What I learned working with Big Data at the Large Hadron Collider
While some are fighting with the exact definitions of what exactly constitutes Big Data, there’s little doubt that what we work with at the Large Hadron Collider at CERN is BIG data.
From an analytics perspective, what we do at the LHC is large scale pattern recognition and hypothesis testing.
The experiments are the size of apartment buildings and are cramped with state of the art electronics. The largest of these experiments called ATLAS hosts 100,000,000 read-out channels. Most of them are silicon pixels like in a digital camera. But others are very different technologies, designed to trace the small elusive particles emanating from the collisions of protons. The game is to put all these secondary measurements together and deduce what went on right after the original collision.
To put this in perspective: we are looking through a 90 megapixel 3D camera, reconstructing particle trajectories, the particle types and their speeds. Piecing this information together requires a tremendous amount of processing in itself.
The only problem is that we have protons colliding 600 million times every second. That leaves us with 60 PB/s if we assume that each channel is 8 bit in size. 60 PB/s corresponds to the entire planets accumulation of data every 83 minutes.
Think hard about what you really want to store before you design your infrastructure
Since the world wasn’t swallowed in an information black hole at the start of the LHC we obviously don’t store all this data. Instead dedicated systems for preliminary data analysis and rejection is selecting collisions that are of interest to modern physics. This trigger architecture is based on FPGAs and close-to-the-experiment computer clusters that take decisions on whether to save of discard collision data within nanoseconds.
Discarding data is dangerous if anything is concluded based on statistics afterwards. To avoid bias in our conclusions (to be or not to be a Higgs particle) we do extensive studies on what to expect in general from physics that is already known.
The lesson is that designing a big data collection system to collect both trivially known information events and new potentially rare events is a sure way to overwhelm your analysis pipeline while decreasing any chance of finding the rare and valuable insights.
Know what you know and make sure you are aware of what you don’t.
Computer simulation works (statistically speaking)
One of the perhaps biggest positive surprises at the LHC was how fantastically well the predictions of physics actually fitted the observed collisions.
High energy particle collisions are predicted by quantum field theory. the quantum part means that every time two protons collide they only by a certain chance turn into something new and interesting. Most of the time they don’t. How often a new type of particle is created is indeed the main result from this branch of science.
The reason we need 600 million collisions per second is to create enough chances that a rare particle such as the Higgs particle or a dark matter particle is created.
When looking at particle data we usually look at histograms with many collisions forming a picture of the rate of a particle with a specific distribution in say speed, mass or some other quantity. When a discovery is made, it is because a histogram of a certain type (typically particle mass) deviates significantly from a histogram filled by simulated collisions. The simulated data only contains the particles we already know of from previous experiments.
We use Monte Carlo simulation of particle collisions to fill this histogram of expected particles. In the simulation physics theory, how secondary particles form the collision loses energy in the huge detectors, to how the readout electronics will form the final signal is simulated for millions of virtual collisions.
One such simulated collision takes around 1 hour to fully simulate. After that it is reconstructed by the same patten recognition algorithms as the actual collision data from the LHC. This brings the virtual and the real collision events on equal footing and we can compare them. The surprise is that the virtual version of the collisions matches so well. Decades of effort have been poured into writing realistic simulations, but we were all taken by surprise still. This is definitely one of the major achievements that is rarely sung when taking about the impressive results of the LHC. Without good simulation we would not have discovered the Higgs particle as early as 2012.
The lesson is that complex simulations is well worth the effort. If the work is started early in the design process it can be instrumental in the design of the infrastructure. Simulations is truly the glasses with which we look into the future (and fantasy) world of modern physics. Without simulation it would be impossible to search for exotic new physics such as supersymmetry, dark matter and magnetic monopoles.
Independent information is best stored in files
Having worked with large databases before I became a particle physicist, I was surprised by the use of container files as storage for collisions. Querying a large database with everything in it would have been much easier right?
SELECT COUNT(*) FROM collisions WHERE mass = 126 GeV
What databases are (relatively) good at is relationships, for independent entries the advantage of a centralized database is less visible. All collisions are independent from each other in the LHC. So we instead store collisions in binary files where C types and C++ structures can be saved and accessed in an efficient manner.
This means that it it important to keep track of all files in a dataset otherwise the statistical results would be wrong. It also mean that computations happening on multiple computers at simultaneously can write to their fast local storage and cache input files for fast access.
Computations on the LHC data is done through the Worldwide LHC Computing Grid consisting of more than 150 computing centers around the world. These centers are connected through high speed connections (at least 10 Gbits) and also serve as storage sites for the 25 PB/yr of data generated at the LHC.
Using files allows a complex system with many heterogeneous resources to scale in a simple manner. It also fits well with the LHC computing policy of running the analysis where the data is located rather than transferring the data to your local data center. versioning of data that is analyzed by thousands of physicists is handled by a DNS-like service that handles naming and routing of datasets to physical resources.
An added advantage is that we can change the schema ad-hoc from file to file, allowing us to sieve collisions that takes 2 MB each down to a few kilobytes. This changes the scale from supercomputers to laptops, and radically lessens the turnaround time of the most expensive part of the analysis, the physicist.
The lesson is that fragmented data structures not only makes distributed computing scalable, allows loosely coupled schema evolution and faster turnaround time for analysts.
Nature (and analysts) always finds a way
When the first analyses of collision data began in the ATLAS collaboration in 2009 the client tools for processing data on the grid were unintuitive, slow and exposed many of the different design decisions of each of the computer centers involved. That quickly meant that computing centers were blacklisted by users with a few bad experiences, and that even more users gave up and ran jobs around the grid on their local resources at their institutes.
These reactions were possible at the beginning of the experiment due to the relatively manageable data sizes at that time, but it was clear that it wouldn’t scale when the collision rate began to climb the year after.
As a result a small internal competition on client tools took place. The end result was a tool called Panda that evolved into a quite usable web site that job statistics, resource quotas and the ability to share information with fellow analysts on which datasets to use. It also allowed the analysts to submit jobs to the grid with nearly the same command as a local computation would require.
The lesson is that power-users have little sympathy when it comes to inefficient access models. Early acceptance of the fact that they likely know more about the domain problem than the system architects reverted a disaster. A disaster that could have slowed the discovery of new physics.
The last conclusion…
The most amazing thing of all is that the LHC was conceived in the mid-eighties and designed in the nineties. All with the full knowledge of how much data would be generated. One thing they didn’t know was that another CERN creation, the World Wide Web would foster the likes of Google, Facebook and Amazon and the introduction of commercial Big Data. Today commodity software such as Hadoop and cloud platforms such as Amazon Web Services serve as a commercial version of the Grid. Who knows how the LHC computing model would look like if we designed it today!
That said, there are many big data lessons still to be learned from the LHC experiments, it is indeed no just a physics experiment but as much a computing experiment.