InfoQ

Article

higgs-event

LHC Grid: Data storage and analysis for the largest scientific instrument on the planet

Posted by Dionysios G. Synodinos on Oct 01, 2008 07:14 AM

Community
Architecture,
Java
Topics
Grid Computing
Tags
Data Storage ,
Architecture Evaluation ,
Data Analysis

The Large Hadron Collider (LHC) is a particle accelerator that aims to revolutionize our understanding of our universe. The Worldwide LHC Computing Grid (LCG) project provides data storage and analysis infrastructure for the entire high energy physics community that will use the LHC.

The LCG, which was launched in 2003, aims to integrate thousands of computers in hundreds of data centers worldwide into a global computing resource to store and analyze the huge amounts of data that the LHC will collect. The LHC is estimated to produce roughly 15 petabytes (15 million gigabytes) of data annually. This is the equivalent of filling more than 1.7 million dual-layer DVDs a year! Thousands of scientists around the world want to access and analyze this data, so CERN is collaborating with institutions in 33 different countries to operate the LCG.

Data from the LHC experiments is distributed around the globe, with a primary backup recorded on tape at CERN. After initial processing, this data is distributed to eleven large computer centers - in Canada, France, Germany, Italy, the Netherlands, the Nordic countries, Spain, Taipei, the UK, and two sites in the USA - with sufficient storage capacity for a large fraction of the data, and with round-the-clock support for the computing grid.

These so-called “Tier-1” centers make the data available to over 120 “Tier-2” centers for specific analysis tasks. Individual scientists can then access the LHC data from their home country, using local computer clusters or even individual PCs.

The LHC Computing Grid comprises three “tiers” and 32 countries are formally involved:

  • Tier-0 is one site: the CERN Computing Centre. All data passes through this central hub but it provides less than 20% of the total compute capacity.
  • Tier-1 comprises eleven sites, located in Canada, France, Germany, Italy, the Netherlands, the Nordic countries, Spain, Taipei, and the UK, with two sites in the USA.
  • Tier-2 comprises over 140 sites, grouped into 38 federations covering Australia, Belgium, Canada, China, the Czech Republic, Denmark, Estonia, Finland, France, Germany, Hungary, Italy, India, Israel, Japan, Republic of Korea, the Netherlands, Norway, Pakistan, Poland, Portugal, Romania, Russia, Slovenia, Spain, Sweden, Switzerland, Taipei, Turkey, the U.K, Ukraine, and the U.S. Tier-2 sites will provide around 50% of the capacity needed to process the LHC data.

When the LHC accelerator is running optimally, access to experimental data needs to be provided for the 5000 scientists in some 500 research institutes and universities worldwide that are participating in the LHC experiments. In addition, all data need to be available over the 15 year estimated lifetime of the LHC.

There were overwhelming reasons, both financial and technical, that called for a distributed architecture:

The primary reason for the decision to adopt a distributed computing approach to managing LHC data was money. In 1999, when work began on the design of the computing system for LHC data analysis, it rapidly became clear that the required computing capacity was far beyond the funding capacity available at CERN. On the other hand, most of the laboratories and universities collaborating on the LHC had access to national or regional computing facilities. The obvious question was: Could these facilities be somehow integrated to provide a single LHC computing service? The rapid evolution of wide area networking—increasing capacity and bandwidth coupled with falling costs—made it look possible. From there, the path to the LHC Computing Grid was set.

During the development of the LHC Computing Grid, many additional benefits of a distributed system became apparent:

  • Multiple copies of data can be kept in different sites, ensuring access for all scientists involved, independent of geographical location.
  • Allows optimum use of spare capacity for multiple computer centres, making it more efficient.
  • Having computer centres in multiple time zones eases round-the-clock monitoring and the availability of expert support.
  • No single points of failure.
  • The cost of maintenance and upgrades is distributed, since individual institutes fund local computing resources and retain responsibility for these, while still contributing to the global goal.
  • Independently managed resources have encouraged novel approaches to computing and analysis.
  • So-called “brain drain”, where researchers are forced to leave their country to access resources, is reduced when resources are available from their desktop.
  • The system can be easily reconfigured to face new challenges, making it able to dynamically evolve throughout the life of the LHC, growing in capacity to meet the rising demands as more data is collected each year.
  • Provides considerable flexibility in deciding how and where to provide future computing resources.
  • Allows community to take advantage of new technologies that may appear and that offer improved usability, cost effectiveness or energy efficiency.

The size of the overall project posed some interesting challenges to the LCG team:

  • Managing the sheer volume of data that has to be moved reliably around the grid.
  • Administering the storage space at each of the sites.
  • Keeping track of the tens of millions of files generated by 9000 physicists as they analyse the data.
  • Ensuring adequate network bandwidth: optical links between the major sites, but also good reliable links to the most remote locations.
  • Guaranteeing security across a large number of independent sites while minimizing red-tape and ensuring easy access by authenticated users
  • Maintaining coherence of software versions installed in various locations
  • Coping with heterogeneous hardware
  • Providing accounting mechanisms so that different groups have fair access, based on their needs and contributions to the infrastructure.

Security for such a large-distributed system is also a big challenge and as “The Telegraph” reports, that Greek hackers were able to gain momentary access to a CERN computer system of the Large Hadron Collider (LHC) while the first particles were zipping around the particle accelerator on September 10th.

Scientists working at CERN, the organisation that runs the vast smasher, were worried about what the hackers could do because they were "one step away" from the computer control system of one of the huge detectors of the machine, a vast magnet that weighs 12,500 tons, measuring around 21 metres in length and 15 metres wide/high.

If they had hacked into a second computer network, they could have turned off parts of the vast detector and, said the insider, "it is hard enough to make these things work if no one is messing with it."

The website cmsmon.cern.ch is still at the time of this writing inaccessible by the public as a result of the attack.

As for the Operating System that drives the LCG, it is the Scientific Linux distribution, which is put together by Fermilab, CERN, and various other labs and universities around the world:

An LHC Computing Grid (LCG) consists of around 40,000 worldwide distributed CPUs that process the data. The participating MACs and PCs will have loaded, among other software, the CERN-adapted Scientific Linux (currently Scientific Linux CERN 4).

Having such a powerful grid would mean nothing without powerful software running on it so the LCG Developers Guide provides technical information for anyone developing or modifying code for LCG and explains the procedures used to meet the quality required for production:

The software development procedure can be broken down into a few simple steps.

  • Create a new module in CVS.
  • Write the code and documentation.
  • Thoroughly test the code.
  • Tag the CVS tree for the module.
  • Contact the building system manager to add your module in the list of modules to build.
  • Ensure that Autobuild has successfully created the package.
  • Thoroughly test the package.
  • Submit the autobuilt package to LCG.
  • Fix any bugs found by the integration and certification process.

APIs are developed using C/C++, Java or Perl and are documented with Doxygen, Javadoc or POD. Other software used in the Grid includes:

  • The Berkeley Database Information Index (BDII)
  • gLite, a framework for building grid applications
  • Xen, a virtual machine monitor
  • Glue 2, an abstract information model which is expressed via a schema independent of information system implementations
  • Gridview , a monitoring and visualization tool being developed to provide a high level view of various functional aspects of the LCG (based on Java, PHP and Oracle 10g).

Grid computing is not the only answer to all of the LHC challenges and there are cases where volunteer computing makes sense. In particular, volunteer computing is good for tasks which need a lot of computing power but relatively little data transfer. In 2004, CERN's IT Department became interested in evaluating the sort of technology that is used by volunteer computing projects like SETI@home. LHC@home became the overall title for these efforts and is a volunteer computing program which enables users to contribute idle time on their computer to help physicists develop and exploit particle accelerators. It uses BOINC, a software platform for volunteer computing and desktop Grid computing.

You can search InfoQ for more information on Grid Computing and Architecture.

2 comments

Reply

Additional important software used by Mark Pollack Posted Oct 6, 2008 8:43 AM
Taipei by Mickael Faivre-Macon Posted Oct 7, 2008 2:29 PM
  1. Back to top

    Additional important software used

    Oct 6, 2008 8:43 AM by Mark Pollack

    The data analysis system used at LHC is based on "root", root.cern.ch
    It is quite an impressive piece of work in itself.

  2. Back to top

    Taipei

    Oct 7, 2008 2:29 PM by Mickael Faivre-Macon

    After initial processing, this data is distributed to eleven large computer centers - in Canada, France, Germany, Italy, the Netherlands, the Nordic countries, Spain, Taipei, the UK, and two sites in the USA

    Taipei is the capital of Taiwan. Write "Taiwan", not "Taipei".

Exclusive Content

Book Except and Interview : Aptana RadRails, An IDE for Rails Development

Aptana RadRails: An IDE for Rails Development by Javier Ramírez discusses the latest Aptana RadRails IDE, a development environment for creating Ruby on Rails applications.

Fast Bytecodes for Funny Languages

Cliff Click discusses how to optimize generated bytecode for running on the JVM. Click analyzes and reports on several JVM languages and shows several places where they could increase performance.

Scott Ambler On Agile’s Present and Future

Scott Ambler, Practice Lead for Agile Development at IBM, speaks on the current status of the Agile community and practices having a look at the perspective of the Agile’s future.

Manager's Introduction to Test-Driven Development

Dave Nicolette and Karl Scotland try to introduce non-technical managers to one of the most popular Agile development techniques: Test-Driven Development (TDD).

Structured Event Streaming with Smooks

Smooks is best known for its transformation capabilities, but in this article Tom Fennelly describes how you can also use it for structured event streaming.

How to Work With Business Leaders to Manage Architectural Change

Successful architectures evolve over time to meet changing business requirements. Luke Hohmann presents how to collaborate with key members of your business to manage architectural changes.

Colors and the UI

In this article, Dr. Tobias Komischke explains how colors used in a GUI can influence our interaction with a computer and offers advice on using the appropriate colors for the interface.

Building your next service with the Atom Publishing Protocol

In his presentation, recorded at QCon San Francisco, MuleSource architect Dan Diephouse explores ways to use the Atom Publishing Protocol (AtomPub) when building services in a RESTful way.