GSOC2022@ML4SCI| Graph Neural Networks for End-to-End Particle Identification with the CMS Experiment

Xin Yi
7 min readSep 30, 2022

Hi! Welcome to my journey in Google Summer of Code(GSOC) 2022 with Machine Learning for Science(ML4SCI). In this post, I will introduce my work during this program and my personal experience. Here is my repository for projects.



In recent years, the convolutional neural network has been successfully applied in particle physics identification tasks. Despite its great success, the existing image-based methods for particle suffers from data sparsity problem and easily overfitted when dataset is relatively small. Motivated by this question, we plan to represent the particles data in graph and find the most suitable graph representation method in this project. Then, we will build various graph neural networks for end-to-end tau identification with the cutting-edge methods in current graph deep learning research. Furthermore, we will analyze the model performance from multiple perspectives. Lastly, we will benchmark model inference on GPU and provide user guidance through documentation and code. This project will give some insights into the application of Graph Neural Network in tau particle identification and is expected to help the ML4SCI community conduct research in future work.

The End-to-End Deep Learning(E2E)

Identification and reconstruction of single particles, jets, and event topologies of interest in collision events are crucial components of searches for new physics at the Large Hadron Collider (LHC). In the CMS experiment, the End-to-End Deep Learning (E2E) project is dedicated to the development of these reconstruction and identification tasks using cutting-edge deep learning techniques.

The main focus of this project is the development of end-to-end graph neural networks for tau particle identification and the CMSSW inference engine for use in reconstruction algorithms in offline and high-level trigger systems of the CMS experiment.

Graph Representation


The Boosted Top Tau Particle dataset contains X_jets, y, m0, and pt columns, where y represents the label for the tau sample and the X_jets column is 125x125 jets image matrices of 8 channels (pt, d0, dz ECAL, HCAL, BPIX1, BPIX2, and BPIX3), more specifically

  • (pt, d0, dz) are tracker layers.
  • ECAL and HCAL are electromagnetic and hadronic calorimeter deposits respectively.
  • BPIX are the pixel layers which are integer values representing the number of collisions occurring in the pixel layers.

The tracker layers, ECAL, and HCAL channels are already scaled, so there’s no need to normalize them.

Represent Jets Image with Graph

Given the nature of particle data and pixel in jets image, the jets image can be represented as unordered graph.


The pixel along X_jets channels. To best concentrate information in jets image with graph, the blank pixels along all channels in jets image are discarded. For example, if the pixels with 8 channels has value (0,0,0,0,0,0,0,0) is disposed but (0,0,0,0,1,0,0,0) is preserved. The choice of channels used in node feature leads to two ways of representing nodes:

  • Full channels with tracker layers, ECAL, HCAL, and BPIX layers (8 features).
  • Channels with tracker layers, ECAL, and HCAL (5 features).

The figure below shows the distribution of nodes number in the dataset.

Distribution of Nodes Number


Because the physical meaning of edges in high-energy particles is not clear as that in Social networks or Molecular Biology, there are many ways in defining edges and connectivity.

One common practice is to determine the graph as a fully-connected graph. However, according to the distribution of nodes number plotted above, it will add too much computational complexity and memory cost, thus, representing the jets image as fully-connected unordered graph is not a wise choice.

The edges(connectivity of nodes) used here can be categorized in two ways:

Static Graph

  • Defined by K Nearest Neighbors.
  • Defined by Radius Neighbors.

Dynamic Graph

  • The connectivity of nodes is not established when nodes data is fed into model. In each round of message passing, the connectivity is calculated with the latest nodes value and defined by k nearest neighbors. A typical way is Dynamic Edge Convolution.

The detailed progress of each methods can be found in here.

Model Architecture

Common Settings

To find which graph model architecture works best for this project, I keep other model settings apart from model architecture the same.

Optimizer: Adam with adaptive learning rate.

Loss function: Cross Entropy Loss

Max Epoch: 20

Batch Size: 32

Graph Convolution

The graph neural network operator from the “Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks” paper.

Graph Attention

The graph attentional operator from the “Graph Attention Networks” paper.

Graph SAGE

The GraphSAGE operator from the “Inductive Representation Learning on Large Graphs” paper.

Dynamic Edge Convolution

The dynamic edge convolutional operator from the “Dynamic Graph CNN for Learning on Point Clouds” paper (see torch_geometric.nn.conv.EdgeConv), where the graph is dynamically constructed using nearest neighbors in the feature space.


Performance Analysis


Combine different values of K Nearest Neighbor, channels used for node feature, and different model architecture. Then, train models with the same amount of data (9.6k samples) and the same model common settings. The following figure is the plot for model performance, where Y axis stands for test accuracy.

From this graph, we can find that GraphSAGE with 5 channels and k=15 is best in static graph models, and is competitive Dynamic Edge Convolution model.

The table below shows the best Test AUC of each model architecture with corresponding graph representation.

Key Takeaways

  • From the node feature’s view, using a full channel is not an ideal choice. After dropping BPXI layers, the model performs better on the 5 channel data(tracker layers, HCAL and ECAL). One possible explanation is that the information provided in BPXI layers is the same as it is in tracker layers but a different form. Thus, repeated information hurts model robustness.
  • The model performs better as k increases to a certain level, then performance drops as k keeps increasing. The best value for the k nearest graph is customized to model architecture. When r=1e-3 in the radius neighbor graph, the models reach the best Test AUC but still can’t parallel with the k nearest neighbor graph.
  • When data is not sufficient, the CNN model overfits the dataset easily. However, this phenomenon is not obvious in the GNN model. This advantage makes GNN stand out in a relatively small dataset scenario.

Future Work and Final Thoughts

This summer, I spent a fruitful summer with the ML4SCI E2E community. Still, there is a lot of things I haven’t been able to explore because of time constraint. Here are some directions for future experiments:

  • Train models on a more comprehensive tau particle dataset with 13 channels.
  • Represent jets image with weighted graph, and introduce more initial edge features to models.
  • Current graph architecture relies on manual design empirically, a recent technique called Neural Architecture Search can automatically learn the best graph architecture during training and may approximate to best model parameters better.

Tracing back my journey through summer, I’d like to say it’s amazing. Everything is new to me: challenging topics in High Energy Physics, collaborating with a global team, and the sense of achievement each time when model outperformed. Most importantly, I build up my confidence in handling unfamiliar subjects and quickly learning them well.

In the end, I would like to thank Professor Sergei Gleyzer for giving support and advice in each group meeting; thank Shravan Chaudhari for informative instructions on datasets usage; thank Purva Chaudhari for answering my confusion when I started the program; Also, thank Google and GSOC organizers for this precious opportunity!

Important Links