Data Science in (Astro)Particle Physics and the Bridge to Industry

Hugo Ferreira
Hugo Ferreira’s blog
6 min readMar 21, 2018

Trying to bring the physics and industry communities together

Source: LIP

Last week, besides going to the PyData Lisbon meetup with Siraj Raval, I also attended the Data Science in (Astro)Particle Physics and the Bridge to Industry, an event organized by LIP (Laboratório de Instrumentação e Física Experimental de Partículas) and IDPASC (International Doctorate Network in Particle Physics, Astrophysics and Cosmology) which took place from the 12th to the 16th of March in Lisbon.

The objective of the event was not only to show the synergies between experimental High-Energy Physics (and fundamental research in general) and Data Science but also to create links with the industry, which aren’t always in place in universities and public research units.

My background is more in theoretical and mathematical Physics (my undergraduate days learning experimental Physics in underground laboratories are long gone), but the scope of the event was wide enough that several other master students, PhDs and young postdocs in other areas of Physics were participating.

Just outside the event’s venue.

The event was divided in two parts. During the first three days, there was a School with morning lectures on Probability, Statistics and Machine Learning and hands-on sessions in the afternoon. The last two days consisted in a Symposium with the participation of invited companies and institutions which use Data Science in their activities.

School

Venue of the School

The first three days were dedicated to learning the basics of Probabilities, Statistics and Machine Learning that the students and young researchers may not have come across and which are essential for a data scientist. We had lectures in the morning and hands-on sessions in the afternoon. The courses slides and other materials can be found on the School website.

The Probabilities and Statistics lectures were given by Glen Cowan from the Royal Holloway University of London. Using examples from Particle Physics, after a brief review of Probability theory, he discussed statistical tests, multivariate methods, hypothesis, p-values, parameter estimation and confidence intervals, among other topics. Even though I had studied most of these concepts in my undergraduate years, it was quite helpful to revise and view the material from a slightly different point of view.

In the first afternoon, in the hands-on session, we had to design a statistical test to discover a signal process (such as dark matter) by counting events in a detector. In the process, we had to establish null and alternative hypotheses, compute p-values and make use of Poisson distributions.

Glen Cowan during a Probabilities and Statistics lecture

The Machine Learning lectures were given by Lorenzo Moneta, a Senior Physicist at CERN, who gave an exhaustive overview of the several ML algorithms and concepts, such as regression, classification, regularization, support vector machines, decision trees and neural networks. At the computing level, the framework used was the Toolkit for Multivariate Data Analysis with ROOT (TMVA), an ML environment for ROOT, a modular scientific software framework widely used in CERN.

It was certainly ambitious to discuss the underlying maths and the implementation of so many ML algorithms in only three lectures, and I’m sure it was a bit overwhelming for those who had never had any exposure to ML before. I personally would have preferred a slower pace, at the cost of some of the more advanced topics.

Lorenzo Moneta during a Machine Learning lecture

All of this culminated in the Data Challenge: we had to organize ourselves in teams of 2 or 3 people and apply an ML algorithm to CERN data, with the objective of distinguishing jets originating from gluons and from quarks (if you have no idea of what I’ve just said, maybe this Wikipedia page will help). The data came as ROOT files and the challenge could be solved in either ROOT/C++ or PyROOT, a Python extension module that allows us to interact with any ROOT class from the Python interpreter. And we had to present our results on the last day of the School, which meant we only had a few hours in the afternoons to start the challenge from scratch and make a small presentation.

Unfortunately, we were supposed to have our systems ready to start the challenge with all the dependencies installed or with a Docker image, but some of us (including me) didn’t get the email with this information. Moreover, I haven’t used ROOT since my second year of undergraduate studies and I’ve only started to learn Python recently, so I spent most of the time downloading and setting up things (imagine a few dozen people doing the same using the WiFi of the event) and trying to guess how to read the CERN data into my Python environment.

Instructions to set up a Docker image for the Data Challenge

In the end, the Data Challenge was helpful for a beginner like me to have an idea of the variety of tools being used to apply ML methods. It’s unlikely that I’ll ever use ROOT in a professional environment in the future, but the whole experience was enlightening.

Symposium

Coffee break during the Symposium

During the last two days we had the Symposium with several companies and institutes presenting how they use Data Science in their activities. The talks were 30 minutes long and were grouped in four topic areas across the two days. The agenda and some of the slides can be found on the Symposium website.

To give an idea of the variety of talks given, here is a list of some of the companies and institutes which presented their work:

  • LIP
  • IST & FCUL
  • BNP Paribas
  • Willis Tower Watson
  • James
  • IBM
  • Altice Labs
  • Vodafone
  • MatchProfiler
  • Nielsen
  • World Bank
  • amplemarket
  • Booking.com
  • GMV
  • Bosch
  • Siemens
  • Fraunhofer
  • NeuroPsyCAD
Hugo Ferreira, CEO and founder of NeuroPsyCAD

I had the opportunity to talk to many of the speakers and hear more about their everyday work, their own journey into Data Science (many of them are former scientists) and also about the existing job and internship opportunities in Lisbon and elsewhere.

Ultimately, I think this event was quite successful in bringing the physics and “real world” communities together and having them share their work. Many of the speakers outside academia were surprised to find that physicists are using many of the same methods and technologies and that they also have big data to deal with (data doesn’t get much bigger than that collected by CERN!).

I’d like to thank Lorenzo Cazon from LIP for organizing this event and I hope it isn’t the last one.

--

--

Hugo Ferreira
Hugo Ferreira’s blog

Data Scientist and Machine Learning enthusiast; physicist and maths geek.