BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles LHC Grid: Data storage and analysis for the largest scientific instrument on the planet

LHC Grid: Data storage and analysis for the largest scientific instrument on the planet

This item in japanese

Bookmarks

The Large Hadron Collider (LHC) is a particle accelerator that aims to revolutionize our understanding of our universe. The Worldwide LHC Computing Grid (LCG) project provides data storage and analysis infrastructure for the entire high energy physics community that will use the LHC.

The LCG, which was launched in 2003, aims to integrate thousands of computers in hundreds of data centers worldwide into a global computing resource to store and analyze the huge amounts of data that the LHC will collect. The LHC is estimated to produce roughly 15 petabytes (15 million gigabytes) of data annually. This is the equivalent of filling more than 1.7 million dual-layer DVDs a year! Thousands of scientists around the world want to access and analyze this data, so CERN is collaborating with institutions in 33 different countries to operate the LCG.

Data from the LHC experiments is distributed around the globe, with a primary backup recorded on tape at CERN. After initial processing, this data is distributed to eleven large computer centers - in Canada, France, Germany, Italy, the Netherlands, the Nordic countries, Spain, Taipei, the UK, and two sites in the USA - with sufficient storage capacity for a large fraction of the data, and with round-the-clock support for the computing grid.

These so-called “Tier-1” centers make the data available to over 120 “Tier-2” centers for specific analysis tasks. Individual scientists can then access the LHC data from their home country, using local computer clusters or even individual PCs.

The LHC Computing Grid comprises three “tiers” and 32 countries are formally involved:

  • Tier-0 is one site: the CERN Computing Centre. All data passes through this central hub but it provides less than 20% of the total compute capacity.
  • Tier-1 comprises eleven sites, located in Canada, France, Germany, Italy, the Netherlands, the Nordic countries, Spain, Taipei, and the UK, with two sites in the USA.
  • Tier-2 comprises over 140 sites, grouped into 38 federations covering Australia, Belgium, Canada, China, the Czech Republic, Denmark, Estonia, Finland, France, Germany, Hungary, Italy, India, Israel, Japan, Republic of Korea, the Netherlands, Norway, Pakistan, Poland, Portugal, Romania, Russia, Slovenia, Spain, Sweden, Switzerland, Taipei, Turkey, the U.K, Ukraine, and the U.S. Tier-2 sites will provide around 50% of the capacity needed to process the LHC data.

When the LHC accelerator is running optimally, access to experimental data needs to be provided for the 5000 scientists in some 500 research institutes and universities worldwide that are participating in the LHC experiments. In addition, all data need to be available over the 15 year estimated lifetime of the LHC.

There were overwhelming reasons, both financial and technical, that called for a distributed architecture:

The primary reason for the decision to adopt a distributed computing approach to managing LHC data was money. In 1999, when work began on the design of the computing system for LHC data analysis, it rapidly became clear that the required computing capacity was far beyond the funding capacity available at CERN. On the other hand, most of the laboratories and universities collaborating on the LHC had access to national or regional computing facilities. The obvious question was: Could these facilities be somehow integrated to provide a single LHC computing service? The rapid evolution of wide area networking—increasing capacity and bandwidth coupled with falling costs—made it look possible. From there, the path to the LHC Computing Grid was set.

During the development of the LHC Computing Grid, many additional benefits of a distributed system became apparent:

  • Multiple copies of data can be kept in different sites, ensuring access for all scientists involved, independent of geographical location.
  • Allows optimum use of spare capacity for multiple computer centres, making it more efficient.
  • Having computer centres in multiple time zones eases round-the-clock monitoring and the availability of expert support.
  • No single points of failure.
  • The cost of maintenance and upgrades is distributed, since individual institutes fund local computing resources and retain responsibility for these, while still contributing to the global goal.
  • Independently managed resources have encouraged novel approaches to computing and analysis.
  • So-called “brain drain”, where researchers are forced to leave their country to access resources, is reduced when resources are available from their desktop.
  • The system can be easily reconfigured to face new challenges, making it able to dynamically evolve throughout the life of the LHC, growing in capacity to meet the rising demands as more data is collected each year.
  • Provides considerable flexibility in deciding how and where to provide future computing resources.
  • Allows community to take advantage of new technologies that may appear and that offer improved usability, cost effectiveness or energy efficiency.

The size of the overall project posed some interesting challenges to the LCG team:

  • Managing the sheer volume of data that has to be moved reliably around the grid.
  • Administering the storage space at each of the sites.
  • Keeping track of the tens of millions of files generated by 9000 physicists as they analyse the data.
  • Ensuring adequate network bandwidth: optical links between the major sites, but also good reliable links to the most remote locations.
  • Guaranteeing security across a large number of independent sites while minimizing red-tape and ensuring easy access by authenticated users
  • Maintaining coherence of software versions installed in various locations
  • Coping with heterogeneous hardware
  • Providing accounting mechanisms so that different groups have fair access, based on their needs and contributions to the infrastructure.

Security for such a large-distributed system is also a big challenge and as “The Telegraph” reports, that Greek hackers were able to gain momentary access to a CERN computer system of the Large Hadron Collider (LHC) while the first particles were zipping around the particle accelerator on September 10th.

Scientists working at CERN, the organisation that runs the vast smasher, were worried about what the hackers could do because they were "one step away" from the computer control system of one of the huge detectors of the machine, a vast magnet that weighs 12,500 tons, measuring around 21 metres in length and 15 metres wide/high.

If they had hacked into a second computer network, they could have turned off parts of the vast detector and, said the insider, "it is hard enough to make these things work if no one is messing with it."

The website cmsmon.cern.ch is still at the time of this writing inaccessible by the public as a result of the attack.

As for the Operating System that drives the LCG, it is the Scientific Linux distribution, which is put together by Fermilab, CERN, and various other labs and universities around the world:

An LHC Computing Grid (LCG) consists of around 40,000 worldwide distributed CPUs that process the data. The participating MACs and PCs will have loaded, among other software, the CERN-adapted Scientific Linux (currently Scientific Linux CERN 4).

Having such a powerful grid would mean nothing without powerful software running on it so the LCG Developers Guide provides technical information for anyone developing or modifying code for LCG and explains the procedures used to meet the quality required for production:

The software development procedure can be broken down into a few simple steps.

  • Create a new module in CVS.
  • Write the code and documentation.
  • Thoroughly test the code.
  • Tag the CVS tree for the module.
  • Contact the building system manager to add your module in the list of modules to build.
  • Ensure that Autobuild has successfully created the package.
  • Thoroughly test the package.
  • Submit the autobuilt package to LCG.
  • Fix any bugs found by the integration and certification process.

APIs are developed using C/C++, Java or Perl and are documented with Doxygen, Javadoc or POD. Other software used in the Grid includes:

  • The Berkeley Database Information Index (BDII)
  • gLite, a framework for building grid applications
  • Xen, a virtual machine monitor
  • Glue 2, an abstract information model which is expressed via a schema independent of information system implementations
  • Gridview , a monitoring and visualization tool being developed to provide a high level view of various functional aspects of the LCG (based on Java, PHP and Oracle 10g).

Grid computing is not the only answer to all of the LHC challenges and there are cases where volunteer computing makes sense. In particular, volunteer computing is good for tasks which need a lot of computing power but relatively little data transfer. In 2004, CERN's IT Department became interested in evaluating the sort of technology that is used by volunteer computing projects like SETI@home. LHC@home became the overall title for these efforts and is a volunteer computing program which enables users to contribute idle time on their computer to help physicists develop and exploit particle accelerators. It uses BOINC, a software platform for volunteer computing and desktop Grid computing.

You can search InfoQ for more information on Grid Computing and Architecture.

Rate this Article

Adoption
Style

BT