BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Tackling Computing Challenges @CERN

Tackling Computing Challenges @CERN

Bookmarks
49:03

Summary

Maria Girone discusses the current challenges of capturing, storing, and processing the large volumes of data generated by the Large Hadron Collider experiments. She also discusses their ongoing research program at CERN Open Lab to explore alternative approaches, including the use of commercial clouds as well as alternative computing architectures, advanced data analytics, and deep learning.

Bio

Maria Girone is CTO at CERN OpenLab and has extensive knowledge in computing for high-energy physics experiments, having worked in scientific computing since 2002. She has worked on the development & deployment of services/tools for the Worldwide LHC Computing Grid, the global grid computing system used to store, distribute, analyse the data produced by the experiments on the Large Hadron Collider

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Girone: My talk is going to be about what we do and how we process big data and what the challenges are around this. I will be giving a short introduction of CERN. How many people know about CERN here? Ok, so I'll skip the introduction maybe.

I will then concentrate on computing and describe a little bit the challenges and the environment that today we work on. Most of my talk is going to be about what comes next in terms of machine, the machine is going to be upgraded, and computing and storage needs, which already are huge today, are going to go even higher. I'm going to discuss a little bit how we have started our R&D program about this.

Introduction to CERN

CERN is certainly a wonderful place to work. It has been created, founded by the member states of Europe just after the Second World War to give the opportunity to physicists, scientists, in general, to come together after a big war and work together. This is actually what we still do today. It has been built across two different countries. We are constantly commuting between France and the Switzerland, and it's actually placed in an area which, I'm going to say from a geological perspective, is very interesting. This is the mountain of Jura, it's a very old mountain. The complex of accelerators actually is built between the Lake of Geneva and this mountain.

It started with 12 member states; today it has 23 member states, many more observers and associate members. Collaborations and the scientists are coming from all over the world, which makes CERN really a worldwide scientific endeavor, probably the biggest we have. We support today a large community of something like 15,000 researchers.

What do we do at CERN? We do so-called fundamental physics. Basically, we want to understand really how the universe has been formed, what happened just a few instants after the Big Bang. We use huge detectors and huge gigantic cameras to look at the infinitely small. If you like, we are going back in time, trying to reproduce in the accelerators of CERN the conditions which were there at the beginning of the formation of the universe. We also are looking at explaining phenomenons like why matter has been winning, or in our universe over antimatter, what is the dark matter really formed of, fundamental questions to advance the frontier of knowledge is our main goal, but not only. We also have advancements in technologies for what concerns detectors, for what concerns accelerators and then we use to spin off to research in many areas, for instance, medical, for what concerns diagnosis and therapy.

We are also leaders in other domains. The web has been invented at CERN, and we're all certainly profiting of that. This point is about training the researchers and the scientists of the future with a huge program which allows students from all over the world to come to CERN and work there and have an experience in such a huge and great laboratory. What makes it really very special is the capability of working with people from very different backgrounds. This is really something unique, I would say.

The Large Hadron Collider

We all know the Large Hadron Collider, 27-circumference collider that expands over this beautiful area. You can see the size of the CERN infrastructure compared to the runway of the airport. We have a chain of accelerators, CERN is really specialized on these machines and it’s unique in the world. They have been built in years and serve as the next injector to the largest machine, which today is the LHC.

There are four interaction points where the protons beams collide, with four huge, gigantic cameras, but these are not the only things we have. We also have basically built a machine of records, we have the most powerful magnets, we have the fastest racetrack in this planet. We have the highest vacuum, even more of vacuum than in the outer space, very tiny but very hot points, hotter than the temperature of the sun. Of course, given the fact that the magnets have to be operated almost at the absolute zero, we have one of the coldest places in the universe. I like to say that it's also the coolest place in the world where to work, at least for me.

When we are building an accelerator, this is not something that we do from one day to another. It's a very complex machine, it requires expertise by many people, and it’s a long process. The first design concepts were published in 1985. We are now here at the end of the first phase with the Run 2. Now we are going to move to the next phase of the brigade program of the LHC, the High-Luminosity LHC. Here, this is a 20-year project from construction to physics. Overall, this machine has been thought initially in 1985, and will continue to operate until 2035, so these are huge times. The next one, in case someone asks, is probably going to be a circular collider, not sure but very possible.

The LHC is constrained between the Jura and the Lake Geneva. It is underground, but why not go under the lake? If we will have a circular collider, we will place it between the Jura and the Prealps so that it will actually be between 80 to 100 kilometers of tunnel. The LHC is already under 150 meters underground. We see what this is going to be like.

The experiments are large, gigantic cameras, and they allow us to really see the infinitely small. The world of particle physics is particles which are quite small. They are huge, it's impressive to see them, everybody is amazed going underground, even my parents. What happens is that these are six to seven floor building. ATLAS is as big as the Notre Dame de Paris. Just to stay in France, CMS, which is smaller than ATLAS but still quite big, weighs two times the Eiffel Tower. It has got also the most powerful superconducting magnet which has ever been built. Huge amounts of readout electronics, huge amounts of electronic signals need to be read out when the experiments are operating.

These main general-purpose experiments are two because they have to cross-check discoveries. For the people who know about Higgs boson discovery, the announcement in fact, was made simultaneous by ATLAS and CMS together. They were cross-checking results because the discovery at the time of the announcement was just above the significance that we could allow putting together the data from both experiments. The other two experiments are specialized in understanding particular signatures in high-energy physics. ALICE is looking at the quark-gluon plasma state of matter, which really belonged to the early moments of the Big Bang. The LHCb is actually devoted to understanding of asymmetry between matter and antimatter by studying the behavioral difference between the b quark and the anti-b quark.

How do events look like? Protons collide with protons, this is, in fact, the beam pipe schematically. Particles are then produced in the final states from these collisions. If you would imagine it in an event, this is how it would look like. Secondary particles are produced, attracts, when the particles are charged, deposits in the kilometers are these things over here. How many? Many. One billion collisions per second, lots of data. To know how many, how much data you have, this is not small. How do we select? Do we take everything? No, we can't because we would have to read out and store petabytes per second. This is not what we want. Moreover, some of these data that is produced comes from processes, physic processes which we know very well, and therefore, we don't really want to recall them.

What we do is filtering. These are called the triggers, we call them in high-energy physics hardware selections, so really working with the thresholds of the detectors, the electronics of the detectors or software triggers. These bring down the data mining problem to the LHC by great amounts, 100,000 selections, collisions per second, are coming out from the hardware filters. Up to 1,000 events per second is what we record today in each of the experiments at the LHC. Still gigabyte per second, still we have something like petabytes at the end of the year. Today, something like 50 petabytes per experiment per year are collected at CERN.

In real-time, this is what happens: collisions every 40 megahertz, selected data, 1,000 events, 1 kilohertz of raw data. This is our primary data, it will be permanently stored in archival facilities. In order to produce physics, we need to compare the data to the so-called predictions which are coming from theory, and to elaborate from the electronic signals of which raw data is made. To something that we understand is the physics process, we need to do two major things. One is reconstruction, we need to reconstruct the data, and have a process for which each electronic signal is translated into a track or into an energy deposit if the particle was neutral. We use here the conditions of the detectors which are taken at the same rate as the collision data. We compare them to Monte Carlo, and then here comes the understanding of the physics processes. These are called data analysis. Simulation and reconstruction are our main usage of computing resources at the LHC. Something like 70% of the entire computing storage is needed for these two main tasks.

What do we do with this data? We invented the grid. A very redistributed, high throughput computing paradigm, sites all over the world, about 200 sites with different responsibilities is a so-called hierarchical model, which starts from the central point where data is produced, CERN, the Tier-0, and then goes down and down. We have something like 13 Tier-1s around the world, a couple in the U.S., for instance, and then we have under 40 Tier-2s, which are mainly there for simulation and analysis, Tier-1s for the processing of data, and the Tier-0, of course, for acquiring data. Compared to a supercomputer, talking in terms of petaflops, is in tens of petaflops, 1 million core distributed in the world, and something like an exabyte of data already stored on the grid.

This is a map of the transfers. We are leaders in transfers of the data all over the world because physicists are everywhere, and they want to analyze data within a couple of days from when it has been collected. We rely strongly on networks. Main centers are connected with 100 gigabit per second links, the small ones with a few gigabites per second links. This is something that really has been proven to be extremely reliable. How much data do we transfer? Something like 3 petabytes a day. A lot.

That's the status, it works like a dream, but we are at the early phases of the exploitation of this machine. We finished just now, at the end of last year, the so-called Run 2, you can see Run 1, Run 2, we are here. We have started a period of shutdown, if you want to go to CERN underground, there are two years when we are allowing visits of the ground, so there are a lot of visitors. Then we will have the so-called Run 3 as 2021. Two out of the four experiments will be upgraded, ALICE and LHCb. They need to open in order to get more data, to get more sensitivity to new physics. Therefore, they will be changed so that they can collect more data.

Similarly, the machine in three, four years from now will be upgraded and will go to the so-called program High-Luminosity LHC. The luminosity is a quantity which allows you to figure out how many collisions can happen in a certain area in a given time. You see here, we will go very much high with respect to what we have been collecting until now so far, this is the so-called instantaneous luminosity. The experiments will also open the triggers, you remember I said we select 1,000 events per second, if we select something like 10,000 per second. So everything will go higher, it’s why it’s so-called the High-Luminosity LHC. We do this to increase our sensitivity to new physics. We want to discover more particles, and we want to study the events and the physics that we know already with greater precision.

ATLAS and CMS will be upgraded as well in Run 4. This is why we need it, because rare events that we want to study actually happen very rarely. They happen incredibly rarely, one event is interesting at LHC out of 10 trillions, so it’s really impressive and a complex task, the one of choosing and understanding the events which make sense for us, because they are potentially interesting candidates to look at. This was 20 volleyball courts, so it's like selecting 1 event every 10 trillions, like choosing that particular grain of sand in 20 volleyball courts, so impressive.

Events at the LHC also happen in order to increase the occurrence complexity simultaneously. In a certain given moment in time, there will be something like 70 overlapping events which need to be reconstructed nowadays in Run 2, which just finished. At the High-Luminosity LHC, 200. If we want to understand this event, we will have to reconstruct all that happened into the detector. Can you imagine this? I still can't. People are looking at this and ways to improve the algorithms in order to be able to disentangle this kind of events.

More collisions will mean big occupancy in the detectors. You saw the overlapping so-called pileup events where the vertices were shown. This is something like the occupancy of a sliced detector which will be able to measure the energy of the particles. You see each of these is basically a jet. You need to understand all these deposits in energy in a situation like this one. It's a lot of work that needs to be done at the level of the reconstruction algorithms in order to be able to disentangle this kind of occupancy. Everything will become more complex with the upgrade and the need of computing time and storage will go much higher.

Today, we are here, we use this compute, and more or less, this is the amount of storage that we have per year. 2027, when we will start again the High-Luminosity LHC, we’ll be here, so order of magnitudes higher. This black curve represents the expectations in terms of technology improvements, which will come by industry. But still you see with the budget that we have, that is actually folded into this evolution; we will be factors away in terms of computing and in terms of storage. We are here. We are going to go to exabyte per experiment per year. It’s tough because we won't have more money, we'll have, if at all, less money, and therefore, we will need to find solutions in terms of R&D for doing what we do today, but 100 times more with the same or less money, so challenging.

To close this resource gap, that is basically the gap that I showed you with respect to what we can afford, we are looking into an important R&D program to scale out capacity with the public clouds using HPC centers, new architectures, increased performance with hardware accelerators like FPGA, GPUs, using optimized software, and adopt new techniques like machine learning, deep learning, advanced data analytics is all a part of this R&D that I'm going to cover in the next few moments.

Let me just have one digression to this. I just said collaborating with the industry is important. In particular, where I'm working now as CTO is CERN openlab, it’s the framework of collaboration with industry. We have leading partners in technology. Many companies and actually many research centers participate into this. We certainly have a great environment where to work. Created for the start of LHC today looks at the challenges of the High-Luminosity LHC. It’s supported by industry and, of course, then targets to the improvements in technologies that will help us to close this resource gap.

Research & Development Program

Getting back to R&D now, data center technologies R&D. Hugo [Haas] earlier was telling you about the importance of layered services. We are using OpenStack, we are one of the early adopters of OpenStack, our services are layered on top of it. Here is the experiment software, down here is all the infrastructure that we have - our infrastructure, public clouds, volunteer computing, HPC, you name it.

Containers are being used more and more because they add flexibility into the way in which we access resources. We have, for instance, a very interesting project that's been set up recently, just to cite one on Kubernetes. We really do a lot in the area of containers, layered, virtualized environments.

Commercial clouds, interesting - we’ve been testing that we can, from the experiment viewpoint, expand to cloud resources easily and absorb huge peaks. This is all our grid resources, and we just doubled for a few days during supercomputing last year the resources using the Google clouds - thanks, Google. Costs, we are watching the market, they are becoming more interesting, and therefore, we are have projects supported by the European Community and the European Commission, on how to work on cloud procurement and an option.

HPC, equally very interesting source of resources for the future. Our software is not yet real HPC, we're working on it. You know better than me that HPC give today the opportunity to use accelerators, actually, most of the performance is coming from the adoption of accelerators. We have a number of R&Ds, which are devoted to the more effective use of HPC resources by high-energy physics. A number of European projects that are helping us, and, of course, from the U.S., very similar approaches with DOI and NSF support the HPC centers that are displayed here.

Software is important, we have some very large applications long-lived. They live decades, they involve contribution of many people, so that's a strength. We have many people who've worked on the software. That’s also somehow a weakness because we need to maintain this code from developers that have left years and years ago. Someone asked me this morning how many people contribute, something like thousands of people have contributed to our main reconstruction simulation codes. It’s a challenge to evolve the software to be compliant to HPC.

We mentioned data, we are already in the exabyte scale. We go even to the exabyte scale every year. We are not alone in this world of large data but certainly, we have challenges which are rather peculiar. How we improve in terms of data organization, data management, and data access is one of the R&Ds that we have launched. It relies certainly on the capability that we need to preserve of data transfers. A layered approach based on the data lake concept with storage, which is shared and federated, accessible via multiple hybrid computing infrastructure from grid, to clouds, to HPC, to volunteering computing and, up here, the layers of services of the experiments.

Computing architectures R&D look into the capabilities of accelerators. We are looking certainly at GPU’s as they're showing interesting performances at the time of the start of the High-Luminosity LHC. We might be even getting more performance compared to the single-core CPU. Be careful here, because this comparison is a bit, in a way, unfair. We are looking at different vendors and we are looking at offloading part of our reconstruction workflows onto GPU’s and see how we can gain there the famous factors that we are missing.

The other good point of GPUs’ is that we can actually confront ourselves with a large community of open-source programmers, which is helping us a lot, in particular in the area of machine learning. We are also looking at other workflow-specific accelerators like FPGA’s. In particular, we might be looking soon, in collaboration with a project, into setting up a lab to TPU’s and see the benchmark of how our workflows there.

Machine learning is what I want to conclude with. We have a number of projects in a huge list of all our data processing chain, from the data acquisition, the so-called online, to the data reconstruction and data simulation, which are very important applications in our world, and even to the understanding of our optimization of computing resources using machine learning, and the classic data analysis where we always have adopted the neural nets since decades. This is a way for us to possibly be using HPC centers, as they are early adopters of technologies.

Research & Development on New Techniques and Architectures

I want to share a few examples of R&D on GPUs and FPGAs. This is a R&D which will allow the LHCb experiment that is upgrading itself as of 2021 and helps in the readout processing from the detector and the bunch crossing of the LHC, up to the analysis layer. There are two ways to go using FPGAs or using GPUs, here the decision-making processes is of the order of a few hundred milliseconds. We are really in a quasi-online decision-making, where we select or reject events following a certain classification.

Another example, here is more reconstruction - I said reconstruction is very important for us. People are looking at GPUs and understanding if we can offload some parts of our work which traditionally on the grid is executed on x86 architectures onto GPUs. There are very interesting results of speed-up factors of almost an order of magnitude. I said at the beginning we need between 5 to 10, having got the rest of it from technology improvements by industry. Still, gaining 5 to 10 factors is complex.

Another example is using deep learning technologies for particle classification. This is an example from LHCb which is a little bit more advanced today in tackling the problems because of data upgrade in two years. Use of GAN’s, generative adversarial networks, for changing a bit the approach to simulation, adopting GAN’s for fast simulation. Monitoring, automation, anomaly detection are also used a bit differently than in the experiment data processing. In this case, they are used in order to increase the uptime of the HLC machine in a number of areas with very promising results and impact on the uptime which has been already exercised in Run 2.

We are also looking to the very far future. This is not for the start of the LHC, High-Luminosity LHC, but just to complete the panorama of R&D’s, I wanted to mention some R&D’s that have been started in collaboration with industry on neuromorphic computing and on quantum computing. I wanted just to leave as a reference, if you don't believe me, there are a number of applications that we are testing on simulators in quantum computing.

Takeaways

CERN, the wonderful laboratory, has been pushing the boundaries of technologies for many decades. Computing is one of them, but the next program, which is going to start eight years from now, is going really to see an unprecedented amount of data and computing resource needs. We're looking at many ways with the robust R&D program. We would like to count on the use and efficient use of HPC centers. We are looking at heterogeneous architectures as they may solve our challenges and close our factor 10 resource gap. Collaboration with industry will be key and I strongly believe that doing a joint program together is actually going to let us get to the challenges of the High-Luminosity LHC program.

I'm quoting something that, not me but one of our VIP visitors of CERN, has been saying: "Magic is not happening at CERN; magic is being explained at CERN."

Questions & Answers

Participant 1: I'm not sure if you mentioned this, but what data storage system do you use for storage, analytics, etc? Is it something custom, open-source?

Girone: I did not mention it, I just hinted at it. For our archival purposes, for our raw data, we typically use tape technology still. Today it’s still the most effective performance to cost. Otherwise, for our data that needs to be online, we have a large pool of disk-based technologies, which span from SSDs to even cheaper...

Participant 2: Not that low level, but high level; in a lot of organizations people would use something like a Hadoop for processing, analyzing terabytes of data. What do you use?

Girone: It's big mix because every data center, on the grid, can provide storage services based on how they have been setting infrastructure. At CERN, in particular, we have an in-house developed system which is a file-based system. Mostly we are using file-based systems, actually by many different technologies, which brings challenges of having to deal in the future with a much larger distributed environment for data storage. It’s really more complex, because we needed to make sure that we had interoperability between all these solutions.

Moderator: You've obviously been thinking about these new computing challenges for several years. What's something that happened relatively recently that was surprisingly more powerful than you expected? You're looking at hardware acceleration, you're looking at cloud computing, you're looking at machine learning. Was there something that has occurred in the last two or three years that was surprisingly moving it forward?

Girone: At the level of our infrastructure as it is operated today, I think we have been really sizing it as it was needed with this concept of the grid. What has really caught our attention has been certainly, at least from my perspective, the area of heterogeneous architectures. We have been really focused on a very homogeneous distributed infrastructure, where basically all of our services were replicated all over the place. Today we are confronted with the fact that this paradigm of very uniform architectures is simply not there anymore. This has come clearly to our attention with the development of supercomputers. I think that in the last three years there's been a lot of awareness in order to try to harness all our infrastructure to adopt these new paradigms.

Another thing that I think we have been surprised about has been the fact that for years we thought we ran a very large computing infrastructure and we are therefore special in a way. Then as the time went on, as industry has come, well, we were not alone anymore in this. We are not special in a way, anymore, we have some special tasks, we have some special needs, but nothing is so remarkable or unique to our environment. Therefore, I think that this is a change of culture, embracing the technologies as they come from industry, training our researchers so that they don't use only the tools that we have developed in-house, but they know how to use the tools that everybody uses has been what really made our life more interesting in the last few years.

Participant 2: You mentioned a little bit about quantum computing. Can you talk a little bit about, is CERN the leading set of use cases for quantum computing and applications in that area? Any juicy tidbits would be appreciated.

Girone: I'm not the one who is looking at quantum. I'm more specializing on heterogeneous architectures and the way in which we can optimize that. But certainly, quantum computing is in the attention of many funding agencies and many initiatives. In USA, there is the National Quantum Initiative, a lot of money available and in Europe, the Quantum Flagship program, which is supported by EU. Clearly, quantum computing is the next thing to look at.

How did CERN get involved in this? We have been discussing with a number of technology providers, from D-Wave to IBM, and at the moment with Google. What people have done until now has been a rather focused analysis. For instance, Higgs boson analysis, random quantum, multi-parameter problems are also being investigated. Can we use quantum technologies to optimize a multi-parameter environment, is something that people are looking at.

At the same time, researchers and scientists really look at this as a far-future technology. Therefore, I would say that the contributions are because there is also a very strong support from these national initiatives on quantum. The involvement of CERN is important; it is substantial, but it is not the medium-term that people are now focusing because of the upgrade program. This is proportional to the number of people who are working in these areas.

Having said that, CERN is a leader in many technologies, so people are also looking at gathering together with the cryogenic groups, Cryo being one of the key elements in the quantum computers. I'm expecting that this will evolve in the future, and more teams than just the analysts, will gather effort at CERN, a bit like it has been happening in some national laboratories here in USA.

 

See more presentations with transcripts

 

Recorded at:

Aug 11, 2019

BT