#038 – Kubernetes Supercharging Particle Physics with Ricardo Rocha (CERN)

Ricardo Rocha

Computing Engineer, Platform Infrastructure Lead, CERN

Listen on:

Listen to the Podcast

Episode Overview

In this episode of Kubernetes for Humans, host Itiel Shwartz sits down with Ricardo Rocha, Computing Engineer and platform infrastructure lead at CERN, the European laboratory for particle physics. Ricardo walks through how CERN turns billions of particle collisions per second into roughly 100 petabytes of new data each year, and the worldwide computing grid of over a million CPU cores that processes it. The conversation traces CERN's journey from the early multi-orchestrator days (Mesos, Docker Swarm, Kubernetes) to running Kubernetes everywhere from business IT to ML platforms to the online filtering systems sitting next to the detectors. Ricardo also shares how CERN's CNCF involvement shapes maturity assessment for end users, why the lab runs many smaller clusters as cattle to limit blast radius, and how GenAI's demand for batch primitives is finally bringing scientific-computing-grade scheduling to upstream Kubernetes.

In this episode we discuss:

CERN's data scale: 40 million collision images per second, 100 PB of new data per year, and the worldwide LHC computing grid spanning 200 sites and 1M+ CPU cores
How CERN evaluated Mesos, Docker Swarm, and Kubernetes around 2015 and ended up standardizing on Kubernetes for both business and scientific computing
Pushing Kubernetes into the most critical systems: online detector filters, accelerator control systems, and ML platforms
Ricardo's work on the CNCF Technical Oversight Committee and End User Technical Advisory Board, including reference architectures and project maturity levels
Clusters as cattle: why CERN runs many smaller Kubernetes clusters and how that policy saved them when they accidentally deleted a third of their capacity

Key Takeaways

1

CERN operates under a fixed budget that does not grow, so the driver for adopting Kubernetes and cloud native is doing 10x more with the same money — not selling more.

2

Kubernetes is now used across CERN for business IT, ML platforms, online detector filtering, and even accelerator beam control — the real core of the business is moving onto Kubernetes.

3

CNCF maturity levels (sandbox, incubating, graduated) exist to give end users a signal about sustainability and governance; graduated projects need far less internal due diligence than sandbox ones.

4

Running many smaller clusters instead of one large one limits blast radius — CERN once accidentally deleted a third of its capacity and stayed up with only service degradation.

5

GenAI's appetite for batch scheduling, queuing, and accelerator support is finally pushing the primitives scientific computing has long needed into upstream Kubernetes.

Full Transcript

Itiel Shwartz: Hello everyone and welcome to another episodes of the Kubernetes for Humans podcast. Today with me in the show we have Ricardo. Ricardo, do you mind introducing yourself?

Ricardo Rocha: Hello. Thanks for the invitation. It’s a pleasure to participate. yeah, my name is Ricardo. I’m a computing engineer at CERN. I lead the platform infrastructure teams uh in CERN IT department. I’ve been at CERN for a long time. I’ve worked uh a little bit on uh many things from the great times uh almost 20 years ago as a student to then working on storage systems, computing systems, monitoring systems and uh yeah uh apart from a little break from CERN um then I came back to work on cloud computing uh and I’ve been quite involved in cloud native kubernetes and uh also the cloud-native computing foundation uh I’m a member of the Technical Oversight Committee and the End User Technical Advisory Board. So I help out a bit in the community and uh try try to give as much as we can from CERN to the community as well.

Itiel Shwartz: So super rich and interesting background maybe let’s get to let’s do it like chronic chronic like let’s get let’s start and then let’s continue from there. uh like you know what led you to computer science like how did you start what were your first jobs and how did you end up in like a CERN for how how do I pronounce it right like it’s like CERN yeah yeah yeah sern yeah so uh so yeah so what led you to there so if you can a bit on the you know your origin story if you wish

Ricardo Rocha: yeah absolutely so um well I’ve always been uh around computers a little bit my my dad’s uh gave me one. that’s probably the first mistake when I was quite young. so I back in the days it was quite painful actually to work with computers but uh that’s I think they always took the challenge a bit. and I was quite entertained into trying to understand and going through the pains of uh of computing at the time. so this led me to a couple of like summer jobs uh programming here and there helping out um in a couple of companies of uh people I knew uh and then eventually taking uh computer engineering in university um and then the transition to CERN was actually from the university. So I had a professor that uh was collaborating with one of the experiments here at CERN called ATLAS. one of the two uh uh main experiments that were looking for the Higgs boson um few years ago. And from there um we started a small effort inside the university to create a cluster to help out and donate some computing to CERN. A very small effort at the time, but I also found out there was a possibility to to do internships here at CERN uh as student and this is it. I then contacted uh and uh sent my application and I ended up doing my master thesis uh already here at CERN. So that was my left

Itiel Shwartz: and you saved before you left.

Ricardo Rocha: I never left. I did take a sabbatical for two years uh to try something. I wanted to try a bit industry as well. so I did take two years uh from CERN and then I eventually came back.

Itiel Shwartz: Okay. No, that that’s super cool. You know, I think that we’re like same concern quite a lot. I know the place, but I’ll be happy if you can, you know, explain to like all of the listeners what is it, where is it, why is it important, and if you can share some like general background for our listeners.

Ricardo Rocha: Absolutely. Yeah. Yeah. So, CERN uh yeah, we got into the big news just before 2012 uh and the discovery of the Higgs boson, but CERN is exists for 70 years now. We celebrated 70 years last year. so it’s uh it’s the European laboratory for particle physics. and uh what we do is large experiments uh around fundamental research and in particular uh for particle physics. So we try to see the very very small by having this uh quite large experiments um like hosted here. So the big um the mission of the facility is to do fundamental research. and the big experiment that we have right now is called the Large Hadron Collider which is a particle accelerator is 27 kilometers in perimeter. It’s uh 100 meters underground. and we circulate beams of protons in two different senses clockwise and counterclockwise. and we accelerate them to very very close to the speed of light very high energies and at specific points we make them collide. We smash them against each other in particular points where we’ve built this quite large experiments. and these experiments act like big cameras that help us try to see what happened in these collisions. and what we try to understand is basically the nature of matter and the nature of the universe. So some of the experiments are centered on discovering new particles like the Higgs boson. Others like the ALICE experiment actually tries to understand the nature of matter and how it looked right after the big bang uh in something that called quirk and plasma. It’s a very interesting experiment as well. Apart from that there’s a ton of other things going on here. So right next to my office here is the what we call the antimatter factory where they actually create antimatter uh and try to understand it as well.

Itiel Shwartz: No that cool like that that sounds super cool. So you know first of all again like very unique and not like most of our like previous guests so far but explain to me you know like you are a computer science guy right like I am as well most of what you’re describing sounds a bit more like you know physics right or maybe like engineering you know and and you were there for over the last 20 years so maybe if you can share a bit like what where is the software in all of it?

Ricardo Rocha: Yeah, absolutely. So, uh in reality when we do these collisions that happen billions of times per second in the different uh experiments, what is being generated is this the this uh detectors are acting like gigantic cameras in practice and they are taking something like 40 million pictures a second. That’s the rate where we are uh looking at things. the result of this in reality is just data and um one like if we count the experiments the large ones they will generate something like one petabyte of data per second which obviously we cannot uh uh store and analyze. So what we do is we have what we call online systems that filter this data and this is done on the nanosecond level. So this is already in many cases custom hardware in other cases already done by software in things like FPGAs and then we have additional filters that bring the data rates down to a few tens of kilobytes per second. And this is what we store here in our storage systems here in uh two data centers on premises. And then once we have this that stored we have to do what we call reconstruction which is to try to see from this raw data what actually happened in these collisions and we reconstruct the events. and for this we need a large amount of computing capacity. So uh on premises we have something like half a million CPU cores. uh if we count uh the project that I actually came to help out with which is this uh worldwide computing grid where we connect two 200 different sites across the world together. We offer uh over a million CPU cores today and this is where we need the software for the systems uh that maintain all this infrastructure but then of course uh a lot more software even to actually analyze um all this data and even for the event filters and we also have a lot of simulation data that help us design and calibrate also the instruments. So it’s all about it’s always been like this uh at CERN it’s always about pushing a bit the boundaries of technology both from a hardware and a software uh point of view.

Itiel Shwartz: No that’s again like sounds very interesting and unique. So it’s like real big data right like in the end of the day you’re saying there is the problem of collecting the data like you’re running the experiment but then how can I make sure that you know how can I monitor the results or get the results basically giving this like vast amount of information and after you already have all of those data in your like on-prem database the data data storage right uh and you need to analyze it and you’re saying that in order to analyze such vast amount of data you need a lot of unique like maybe both software and hardware right

Ricardo Rocha: yeah absolutely so uh we collect around 100 petabytes of new data every year so we’ve accumulated over 1 exabyte since the LHC has been running so then all this storage is already quite a challenge but then we have to distribute it to the computer centers that have the computing capacity and then we need the all the capacity to actually reconstruct the events and and give to the physicists the data in the formats they need to perform their analysis.

Itiel Shwartz: Okay. Okay. That sounds very interesting. So, you know, I must ask as as the title of the podcast is Kubernetes for humans, right? So, where is Kubernetes like where is it? You know, you also mentioned like the CNCF and different SIGs that I’ll be honest I didn’t really knew. so if you can share a bit about first of all your background with Kubernetes when did you guys start using Kubernetes and then maybe a bit more about the broader ecosystem of Kubernetes for computing like models.

Ricardo Rocha: Absolutely. So um I my personal experience comes from containers. uh when I took uh some uh time off from CERN, I was working in New Zealand for a public cloud uh uh to set up a public cloud uh infrastructure in New Zealand and uh this was mostly virtualization but we started looking already at containerization as well and seeing what we could offer. So when I came to CERN back to CERN I joined the cloud team uh but I was really uh willing to start looking into containerization and at the time if you remember it was around 2015 it wasn’t even sure that Kubernetes was would be the winner. This would be the early days. So Docker was very well established but people were talking about LXC also for containers and then if you looked at orchestration there was Mesos that was very present. uh Docker Swarm uh and kubernetes. So what we did was we started offering our users internally a way to orchestrate clusters in any of these systems. So they could decide if they want Mesos, if they want swarm, if they want kubernetes but things moved pretty fast and uh it became quite clear where where the momentum was going. so basically I started going to KubeCon. I think the first one I attended was in 2017 in Berlin. Mhm. and I started meeting the community. We had some pretty uh unique scaling challenges. so we started interacting with SIG Scalability as well to try to fix some of those. So this is where we also we had some experience uh collaborating with open source communities. But we saw that uh here there was particularly uh uh good momentum to get things done and uh we saw the potential for for our next generation of our infrastructure. So this led to us basically being very present on presenting our use cases at KubeCon in different uh sessions u making case studies uh contacting other end users. so we kind of made CERN quite popular in this cloud native community. So we joined the CNCF as end user members as well. and then we started by offering uh services like traditionally IT service more like business computing kind of thing internally using Kubernetes but the scientific computing use case was always there and this is where what we were pushing for. it took a couple of iterations to get uh all the things we needed uh to scale out both on um management of containers but also scalability of Kubernetes uh large clusters things like this um so it took a while but um you asked about where is it used I would say it’s used everywhere so uh business computing is kind of very well established uh scientific computing we have the machine learning platforms internally they run on Kubernetes. but the most exciting ones I think are the really critical things which we call online which is the systems that control the detectors and run the filters. So for uh for example for ATLAS for run 4 which is the new upgrade we’ll do in the next couple years it will be a gigantic uh Kubernetes cluster underground next to the detector. ALICE and CMS are looking at uh doing the same and even the control systems for the accelerator uh in the next couple of years will be fully managed by Kubernetes and this means like controlling the beam um so the real core of the business will be in Kubernetes

Itiel Shwartz: that sounds like a huge adoption maybe you can share a bit you know I’ll be honest here like I never worked at like the public sector like all of my life I worked in the industry and you know like you measure things by money, right? Like in the end of the day, company wants to like earn money and then it uses Kubernetes and you have someone that guides the process. It can be a CTO, it can be architecture, whatever. And and like it’s always a trade-off, right? Should we build it? Should we buy it? Like there’s a lot of different like trade-offs where you’re trying to understand. Walk me through like how does it work for you guys? Like who calls the shock? Like in another day it sounds that Kubernetes is now super adopted but I I must guess that it cost you a lot of labor to make it happen right and because your use case is quite unique then why should you contribute to the CNCF right like why to do it so if you can share a bit about the internal of like how things are going in in like yeah so that’s that’s a very good question and there’s no one answer because CERN is actually quite a large place so the adoption has been gradual in different teams and different departments. So um but like to the reasoning to to adopt this kind of technology uh you mentioned in the industry it’s not that different internally. I think the main difference is that we have a fixed budget and that budget is not going to grow because we’re not selling anything. So what our main challenge is always to do a lot more with the same budget and this is going to happen again in a couple of years where we are increasing the amount of data by 10 times and we have to be able to store it and analyze it with the budget we have today. So this is not going to increase and this really pushes us to investigate uh new things. We we used to trust that technology the hardware would increase at a rate that suited us with this Moore’s law doubling the capacity. This is no longer the case since a few years which means we need this kind of paradigm shift be it in the infrastructure layer be it in the computing models like using machine learning platforms. So basically this is how it how it goes. uh if we look at 20 years ago when I joined uh we were basically building our own software our own infrastructure software middleware because no one else had big data or at least they were not collaborating. today this is no longer the case. We don’t have to build our storage systems or unnecessary actually storage is a little bit special. We do build our own but uh like the workload management the networking layer all of this we can we moved into collaborating into this much larger communities with very um strong uh organizations behind them and join those efforts. And internally what we usually do is we pick up some uh good use cases uh that can really benefit from this and we where we can show the benefit and we get an agreement to do small proof of concept prototypes can be 6 months a year for example recently with the accelerator sector we’ve done uh a one-year effort and by one year we had a couple of production use cases running which meant uh we could then go to management and say this is the path we think people should uh start looking at and then you get more resources and you can grow the teams. so I guess it’s not that different from the industry. The main difference is that we don’t aim to make more money because we don’t sell things. We try to be more efficient with a capped budget and

Itiel Shwartz: and who is like you know again like different organizations take different approach like is it a top down kind of things like like are you and like I don’t know like couple of other guys or girls are sitting together and like deciding what will be the direction for all of like turn or is it more like independent like areas like how does it work?

Ricardo Rocha: Yeah. So it’s it’s more it’s more the latter. there is uh in some areas there is a top down in some things that are really core and also when we have strict timelines that we need to comply with uh that are driven by the accelerator. But for this kind of experimentation uh yeah CERN is a quite quite good place for this kind of thing. You can you can justify uh these efforts to experiment with new technologies uh for a limited time uh and then present some some nice results or not so nice in some cases. But um it is it is a very um technology aware environment. So there’s ideas popping from from every team. uh and we try to to put everyone together when when there’s shared interest uh like the for example this effort we did with one of the experiments with ATLAS uh already back in the first experiment I think we started looking at 2017 to replace their online farm with Kubernetes uh this actually resulted in a quite significant success so the other experiments are looking to do basically the same just reproducing the same environment in their own caverns so it comes from individual teams and then the word spreads and goes up to the management at some point when you need resources you have to go up and ask for it.

Itiel Shwartz: No that that’s interesting. So you know you mentioned like you are in a couple of like SIGs right like computing models I want to say like the end user models or or if you can share a bit about like the bigger you know like the bigger ecosystem because in the end of the day there are other I guess right like uh universities or labs and so on that might tackle the same problems right or similar problems. So how does it work? Maybe share a bit more about like the broader ecosystem.

Ricardo Rocha: You mean with the Cloud Native Computing Foundation?

Itiel Shwartz: The Cloud Native Computing Foundation.

Ricardo Rocha: Yeah. Yeah. Yeah. Sure. so uh yeah the anecdote is always to show this uh picture of the CNCF landscape with I don’t know how many projects it has there now. it it grew to I think this is one of the main differences from previous efforts like OpenStack is that uh the stack in the CNCF covers all areas. It’s not just the core of the infrastructure. It goes to databases, goes to storage systems, it goes everywhere like platform engineering, you name it, everything is there. so what the foundation uh really required and it was there from the beginning is what is called a Technical Oversight Committee that oversees the projects. So it’s not leading the project or deciding where the project should go but it is uh um taking care of the maturity of the projects in a sense of recommending and uh setting them to different maturity levels. So usually a project will join the CNCF Sandbox which is very early phase maybe building a community but doesn’t have necessarily the required governance in place and uh vendor neutrality and best practices that will assure that it’s sustainable long term and this is the role of the Technical Oversight Committee is to help those projects grow and do some due diligence when they apply to move levels. And the goal is to get them to uh graduated stage state which is uh these levels are a reassurance for end users to know that what they can expect from a project. So if you would be say a university or a company and you adopt a graduated project, you know the project has some level of uh uh sustainability that makes you quite confident. you probably still want to do some evaluation internally, but lots of things you can trust. If you’re doing the same for sandbox, then your due diligence should be a lot more extensive, especially if you plan to bet on this for your core business. So, this is where the Technical Oversight Committee tries to help the projects and the community. And then recently we created this, it’s called End User Technical Advisory Board, which is more an end-user perspective to to that. and the goal here is to put end users together but also to understand from the end users how they see the project health. So it’s kind of complementing the maturity levels but also understanding by talking to the end users what are the gaps. So one of the efforts is to establish what’s called reference architectures which is you would just draw your stack say which projects you use and why you’re using it. maybe did some comparisons as well. And sometimes what we see is we identify gaps uh in the full stack that are being fulfilled by commercial products or where there’s no project that can do this uh item and then we bring this back to the community open it up and maybe someone will have cool ideas and maybe launch a startup or start a project to work on this area.

Itiel Shwartz: Yep. No, that’s cool. And how many project are there like specifically for your guys use case like physics particle monitoring like how big of an ecosystem is it?

Ricardo Rocha: it is it is quite large. So um there are pieces that are required by everyone if you have a cloud native stack like the registry we use Harbor for a registry then we have the kubernetes clusters that’s already a bunch of them it’s like containerd uh kubernetes itself uh all the add-ons that we add uh in the cluster so things like runtime security we use Falco for uh uh monitoring we use Prometheus uh all of this starts adding. So I would say we can easily use at least 30 of those projects. not only for the cluster orchestration but then the higher level tools like Argo to do your application management with GitOps uh and many other projects and uh we have a recommended stack but then each individual teams uh individual team will also pick up pick and choose on uh on their particular needs. so yeah like the amount of like CPUs that you mentioned like 1 million CPU is like it is quite a lot right like those are I’m guessing one of like the most like the biggest clusters in the world if if I would like to you know to benchmark you guys or very similar to like the biggest project right so are you still hitting the same technical limits that Kubernetes have or are should

Ricardo Rocha: Yeah. So 1 million cores is our total capacity across the whole world. So it doesn’t mean that everything is running on Kubernetes like uh one one thing that is quite interesting in this particular area is that we had this grid computing in infrastructure where we have developed our own software for everything 20 years ago or 15 years ago uh in production and but what’s been happening is that a lot of these sites have replaced completely the full stack which is the Kubernetes endpoint and this is the proof of like how how useful uh these things are because it gives us the abstraction of the whole thing like it gives us the abstraction to submit workloads to monitor the workloads to collect the logs to do everything. So in many cases what these institutes and organizations have done is they basically scratched their old infrastructure deployed a large Kubernetes cluster and give us the API to submit uh workloads and this works quite well and the same happens internally. This is gradual like so services move from their uh legacy deployments to Kubernetes gradually and this has been happening in all areas in both business and scientific computing. So it’s not like a one large cluster. We actually learned quite early in our process to not have very large clusters and the reason for that was scalability at the start but also blast radius. We gave a pretty entertaining talk with a colleague two years ago where we by mistake started deleting all our clusters um during an operation. We deleted like a third of our capacity and because we have this uh instead of having very large cluster we have this policy of clusters as cattle to reduce blast radius. We actually didn’t have any downtime. We had some service degradation but by deleting onethird of our capacity we still stayed up which is uh kind of an example of yeah it was stressful at the day but it gave for a nice presentation and um and some lessons learned. So yeah.

Itiel Shwartz: Okay. No, that’s cool. So we are like closing to the end of our time. Maybe you can share a bit what do you think we have in store for like the Kubernetes ecosystem and for like research institutions and so on like what you know one year two year three years from now what’s going to change is geni doing any change by the way like is it something that you guys are yeah you know like talk away talk away uh like what’s going to to happen

Ricardo Rocha: absolutely I love to talk about this part so uh the last couple of years we actually been pushing for having Kubernetes having all these primitives we need for scientific computing and this means batch primitives like queuing advanced scheduling capabilities um and we’ve been presenting this quite a bit uh I think if you ask me what GenAI has done uh for us is that it made this feature so required for GenAI that we are getting them finally so even for more traditional scientific computing we got this um um like batch primitives uh thanks to GenAI. But that said, we don’t do necessarily a lot of GenAI for uh research uh at CERN right now, but we do a lot of uh ML, a lot of machine learning. So this is uh basically the our machine learning platforms are Kubernetes from from the start. Mhm. and this is where a lot of the investment will go so that we have uh improved support for accelerators uh in in these clusters but also that we can do better hybrid deployments and integrate external resources, public cloud resources uh to expand our capacity and especially access like specialized accelerators all of this. So I think where we are going is that Kubernetes will have to become even more flexible than what it is in in on boarding all these special environments to put inference in FPGAs underground to get uh I don’t know uh some specialized accelerators somewhere in a region god knows where in the world all these things will be essential for for scientific computing and for GenAI I think.

Itiel Shwartz: Okay that’s cool. any any last remarks that you want to say to our listeners?

Ricardo Rocha: Well, it’s what I keep repeating is that uh all this community has has uh been like a huge changer for the way we do computing at CERN but also it has enabled us to look into the next 10 years and our challenges in a much more confident way. So we always thank everyone and yeah thanks again for the invitation and thanks everyone for all the work uh uh in all these tools.

Itiel Shwartz: Sure. Thank you. Like a great and very different episode. So pleasure.

[Music] Kubernetes for Humans.

This is an AI generated transcript of the conversation

About the Guest

Ricardo Rocha

Computing Engineer, Platform Infrastructure Lead, CERN

Ricardo Rocha is a Computing Engineer at CERN, where he leads the platform infrastructure teams in the IT department. He has been at CERN for nearly 20 years, starting as a student and working across storage, computing, and monitoring systems before taking a sabbatical to work on public cloud infrastructure in New Zealand. After returning to CERN, he joined the cloud team and drove the adoption of containers and Kubernetes for both business and scientific computing. Ricardo is a member of the CNCF Technical Oversight Committee (TOC) and serves on the End User Technical Advisory Board, helping connect the cloud native community with research institutions.

LinkedIn GitHub

Resources Mentioned

CNCF (Cloud Native Computing Foundation)

CNCF Technical Oversight Committee

Harbor container registry

Falco runtime security

Argo CD (GitOps)