Replicating Particle Collisions at CERN with Kubeflow

Published in

kubeflow

4 min readJul 30, 2019

When protons collide 100 meters underground at the Large Hadron Collider (LHC), humans are not physically present to observe the results firsthand. Even if this were possible, we lack the biological hardware to observe anything interesting at this speed and scale. To understand what’s happening, we need to slow things down and zoom in visually.

ATLAS detector at the LHC, as viewed from underground. Credit: Michelle Casbon

Replicating particle collisions

Replicating particle collisions is a complex task that involves the reconstruction of data points from an astoundingly large array of sensors. This is challenging due to the vast amounts of data involved and resulting large-scale computations.

To facilitate interpretation, this data needs to be presented in an easily consumable format that can be rerun on demand. Sofia Vallecorsa’s team has built a set of tools that simulate events inside the detectors. These tools support a variety of frameworks and runtimes and are used by physicists and engineers across many different teams at CERN.

Traditional approach

To feasibly crunch such large amounts of data, full-scale Monte Carlo simulations have traditionally been the workhorse of CERN physicists. To see an example of this in action, watch Ricardo Rocha & Lukas Heinrich run tens of thousands of Monte Carlo simulations as they rediscover the Higgs boson onstage at Kubecon. While this approach provides a high degree of accuracy, it is time-consuming and compute-intensive. To prepare for LHC hardware upgrades that will result in mind-bogglingly larger data volumes, engineering teams are investigating alternate approaches.

Reframing the problem

By reframing the problem in a deep learning-friendly way, Sofia’s team has developed a much faster alternative. Instead of painstakingly reproducing each particle interaction, they use 3D convolutional Generative Adversarial Networks (GANs) to produce a close approximation. This works in the same way that realistic photos are produced without complicated ray-tracing from a virtual 3D scene (ever heard of thiscatdoesnotexist.com?). The key is in interpreting detector results as a 3D map of pixels, or simply put, an image. This opens up the ability to use GANs to build simulations much more quickly. They expected to see trade-offs with this approach, but so far they have observed similar accuracies to the highly resource-intensive full-scale Monte Carlo approach.

Training at scale

Once they built this model, they were faced with the challenge of where to train it. Training GANs at scale is tricky since convergence is not guaranteed and often fails. They began by deploying to Intel-based supercomputing architectures. Details are described in this paper, but essentially this involved coordinating with a partner institution to run on their highly specialized resources. Applying for a grant and figuring out a custom deployment for every training run does not scale, so they needed a way to run their code on a wider variety of clusters.

This is where Kubeflow comes in. They started by training their 3DGAN on an on-prem OpenStack cluster with 4 GPUs. To verify that they were not introducing overhead by using Kubeflow, they ran training first with native containers, then on Kubernetes, and finally on Kubeflow using the MPI operator. They then moved to an Exoscale cluster with 32 GPUs and ran the same experiments, recording only negligible performance overhead. This was enough to convince them that they had discovered a flexible, versatile means of deploying their models to a wide variety of physical environments.

Beyond the portability that they gained from Kubeflow, they were especially pleased with how straightforward it was to run their code. As part of the infrastructure team, Ricardo plugged Sofia’s existing Docker image into Kubeflow’s MPI operator. Ricardo gave Sofia all the credit for building a scalable model, whereas Sofia credited Ricardo for scaling her team’s model. Thanks to components like the MPI operator, Sofia’s team can focus on building better models and Ricardo can empower other physicists to scale their own models.

Model evolution

Their next steps are to train a new and improved version of their 3DGAN. With a single GPU, the initial model trains on the order of 1–2 days. The improved version takes closer to a week. This time, they are using Google Cloud Platform to remove the need to negotiate access to partner resources. Since their tooling already runs on Kubeflow, this means they have on-demand access to much larger numbers of hardware accelerators. They are most excited about exploring TPUs to see if they can run training even faster.

Building their tooling on Kubeflow has opened up exciting new capabilities for CERN physicists, since they have support for model training in a variety of physical environments. On-demand access to resources that are capable of processing data so quickly is a game-changer since it enables Sofia’s team to expand the supported types of detectors and provide best practices for other approaches that replace classical methods with ML.