Optimization of GPU Tracking Pipeline for Acts GPU R&D — Part 1

8 min readJul 26, 2022

Introduction

Acts is a track reconstruction software toolkit for high energy physics experiments. With the potentially increased number of particle interactions in the High Luminosity Large Hadron Collider (HL-LHC) experiments in the future track reconstruction time will also increase. Therefore, Acts GPU R&D (Research and Development) is conducted under traccc, vecmem and detray to accelerate the track reconstruction time. Vecmem provides memory management tools for convenient GPU memory management and caching allocators, Detray is a geometry builder which translates the CPU geometry into GPU one (did not get my head around this one yet.) and finally Traccc demonstrates the GPU tracking pipeline.

My focus is on improving the throughput of the Traccc pipeline and bench-marking the results. This is achieved by using caching allocators provided by Vecmem and CUDA-MPS or CUDA-MIG which are two ways to improve concurrent GPU utilization.

Lets understand the use case briefly

Particles colliding at very high speeds break up into other (smaller) particles and are projected outwards. These projected particles hit very sensitive detector planes, these interactions between the particles and the detection planes (known as cells) are recorded and ultimately used to generate the particle track. In reality many such particles are produced and they may interact with each (like a chain reaction) producing a very large number of events. Therefore improving the throughput (events/time) is important for algorithms.

How can CUDA Multi-Process Service (MPS) Improve Throughput?

Nvidia documentation explains this a lot clearly than I can, but let me provide the necessary information briefly.

Nvidia uses concurrent scheduling, which means gets scheduled at the same time if possible, to schedule kernels from work queues of the same process (same CUDA context). However, it uses time sliced scheduling to schedule kernels from multiple processes (different CUDA contexts). Therefore multiple processes cannot utilize the GPU simultaneously.

To put it in simple terms (maybe slightly inaccurate though 😉), the multi-process server acts as a middle man receiving kernels from multiple processes and submitting these kernels to the GPU as if it were from the same CUDA context.

Imagine a program occupying just 10% of your GPU’s resources, theoretically you can run 10 such processes on your GPU simultaneously with MPS to occupy ~100% of your GPU. This is the main advantage of using MPS.

Good to know about Nvidia’s HyperQ (available on post Kepler architectures) which allows submitting GPU kernels to the GPU simultaneously. Imagine a 1 lane highway from the host to device versus a 32 lane highway (here highways meaning workload queues).

MPS documentation : https://docs.nvidia.com/deploy/mps/index.html

Traccc Pipeline

This entire pipeline can be broken down into sub components as follows

Clusterization grouping cells that are adjacent to each other.
Measurement Creation calculates the weighted averages of each cluster positions.
Spacepoint Formation converts these measurement’s positions from local. to global space
Seeding finds out a set of 3 spacepoints that belong to the same bin.
Track parameter estimation does global to local transform.

**Figure-01**(https://github.com/acts-project/traccc)

Definitions that will be helpful to understand the pipeline

Cells are particle interactions on the detector plane. It contains information about the local position on the detector plane. A single particle interaction produces multiple cells.
Clusters are a group of cells that are adjacent to each other in the detector plane.
Measurements are weighted averaged local positions of the cells in a cluster.
Spacepoints are local positions of measurements transformed onto global positions. This will be in input to the seeding algorithm.
Binning groups spacepoints to a 2D grid. A bin is a section of detector geometry which contains a number of spacepoints.
Seeds are a set of three spacepoints may belong in the same or adjacent bins and might be produced by the same particle.
Prototracks produced by track parameter estimation is a global to local transformation on the surface.
Track is the set of all spacepoints produced by the same particle.

https://github.com/acts-project/traccc

Okay.. so how does MPS help Traccc?

Generally a single instance of Traccc computes several events sequentially and each event undergoes pipeline shown in Figure-01.
Shown below is how multiple processes would execute 10 events each.

Figure-2 Multiple processes computing 10 events each

The algorithms in the Traccc pipeline are parallelized using CUDA and each event do not fully occupy the GPU resources. Moreover, collision events are independent to each other, hence multiple processes can run simultaneously on different event data improving the overall throughput.

Challenge

However obtaining a higher throughput just by using CUDA-MPS is not easy as it seems. In a recent version of Traccc the timings for one event are as follows,

$ build/bin/traccc_seq_example_cuda --detector_file=tml_detector/trackml-detector.csv --digitization_config_file=tml_detector/default-geometric-config-generic.json --cell_directory=tml_full/ttbar_mu200/ --events=1 --run_cpu=0 --input-binary
Running build/bin/traccc_seq_example_cuda tml_detector/trackml-detector.csv tml_full/ttbar_mu200/ 1
==> Statistics ... 
- read    92504 spacepoints from 13565 modules
- created        321488 cells           
- created        92504 meaurements     
- created        92504 spacepoints     
- created (cpu)  0 seeds
- created (cuda) 15424 seeds
==> Elpased time ... 
wall time           0.681589  
file reading (cpu)        0.0591937 
clusterization_time (cpu) 0.0182984 
spacepoint_formation_time (cpu) 0.00120419
clusterization and sp formation (cuda) 0.0172306 
seeding_time (cpu)        4.544e-06 
seeding_time (cuda)       0.0521068 
tr_par_esti_time (cpu)    7.4e-08   
tr_par_esti_time (cuda)   0.0018714

Roughly kernel execution times sum up to 70 ms (excluding data copies), which is only a small fraction of the wall time. Therefore it is not realistic to expect a major increase in throughput by running processes simultaneously. which is why reducing sequentially running and data copies is essential to achieve a better effect by using MPS.

Vecmem Caching Allocators

Vecmem provides convinient GPU memory allocation. There are several upstream memory resources derived from C++ Polymorphic Memory Resource (PMR) and downstream memory resources (caching allocators). For example using the managed_memory_resource upstream resource we can allocate unified memory (calls cudaMallocManaged). Given that each Traccc instance iterates over several events sequentially, memory needs to be allocated several times from the upstream resource, and it takes up a lot of time. This is an expensive operation, major improvements can be obtained if we could eliminate this, (see results with using contiguous memory resource).

This can be eliminated by the usage of Vecmem caching allocators. How it achieves this is by reusing the previously allocated memory. There are 2 such caching allocators that were tested on an older version of Traccc (no notable difference in wall time with the recent one)

For instance the contiguous memory resource allocates a block of memory of fixed size (hard coded) and it will be de-allocated only when the program ends, hence this is not scalable. Due to these two reasons it was initially decided that binary page memory resource will be used, but testing it out resulted in much worse timings than without a caching allocator 😢 , hence this improvement that can be obtained by using contiguous memory resource is shown by the charts below.
(After recent investigation I think contiguous memory resource can be modified to suit our need)

Wall times for using managed memory resource is 2.512s and for contiguous memory resource is 0.9645s.

Note that the first event includes the CUDA initializing overhead hence the abnormally high Clusterization and Spacepoint formation time.

vecmem/core/src/memory at main · acts-project/vecmem

Vectorised data model base and helper classes. Contribute to acts-project/vecmem development by creating an account on…

github.com

Benchmarks Done Till Date

Bench-marks were done varying the number of processes and the number of events each process compute and measuring the wall time for each job. This procedure was done to CPU algorithm, CUDA algorithm with and without MPS. As mentioned previously there is no major improvement in throughput using MPS

Code: https://github.com/Chamodya-ka/traccc/tree/mps-test-13-07

MPS_BENCHMARKING - Google Drive

Edit description

drive.google.com

Procedure of benchmarking:

Iterate e from 10 -> E, where e indicates the number of events a single process computes.

Iterate n from 1 -> N, at every iteration start n processes compute e events and add appropriate CPU affinity using taskset.

#!/bin/bash# run n cuda processes for example n in range 1 -> 32
# benchmark cuda mps for increments of 10 events per process for example i in range 1 -> 50
max_proc=2
max_events=150
increment=10
cores=1  # number of cores (sockets)
threads=1 # number of threads per core 
gpu=1
path='../data'
log=0
log_path=""
while getopts l:c:t:p:e:g:d: flag;
do
    case "${flag}" in
 c) cores=${OPTARG};;
 t) threads=${OPTARG};;
 p) max_proc=${OPTARG};;
 e) max_events=${OPTARG};;
 g) gpu=${OPTARG};;
 d) path=${OPTARG};;
 l) log=${OPTARG};;
    esac
done
echo "logs : $log"
if [ $log != 0 ];then
 Tstart=$(date "+%s")
 mkdir ./kernel_logs_$Tstart/
fi
echo "$max_proc $max_events";
for((i=10;i<=max_events;i+=increment))
do 
 echo "starting to benchmark with $i processes";
 for((j=1;j<=max_proc;j++))
 do
  echo "starting new run with $j events";
  if [ $log != 0 ];then
    
   mkdir -p ./kernel_logs_$Tstart/$i
   mkdir ./kernel_logs_$Tstart/$i/$j
   log_path="./kernel_logs_$Tstart/$i/$j"
   
   ./benchmark_kernels.sh -p"$path" -n$j -e$i -c$cores -t$threads -g$gpu -l"$log_path"
  else
   ./benchmark_cuda.sh -p"$path" -n$j -e$i -c$cores -t$threads -g$gpu    
  fi
    
  sleep 1
 done  
done

And

#!/bin/bashnum_proc=1  # number of processes expected to run concurrently
events=1  # number of event each process will compute 
cores=1  # number of cores (sockets)
threads=1 # number of threads per core 
datapath=""
numgpus=1
log_dir=""
while getopts n:e:c:t:p:g:l: flag;
do
    case "${flag}" in
        n) num_proc=${OPTARG};;
        e) events=${OPTARG};;
 c) cores=${OPTARG};;
 t) threads=${OPTARG};;
 p) datapath=${OPTARG};;
 g) numgpus=${OPTARG};;
 l) log_dir=${OPTARG};;
    esac
done
echo "$datapath"
echo "number of processes : $num_proc";
echo "number of events : $events";
echo "log path $log_dir"
export TRACCC_TEST_DATA_DIR=$datapath
Tstart=$(date "+%s.%3N")
for((i=0;i<num_proc;i++))
do
 # get processor id
 p=$((($i % ($cores * $threads))))
 # end get processor id
 echo " processor id $p";
 # get gpu id
 gpu_id=$(($i % $numgpus))
 echo " gpu $gpu_id";
 # end get gpu id
 if [ -z $log_dir ];then
  CUDA_VISIBLE_DEVICES=$gpu_id taskset -c $p ../build/bin/traccc_cuda_example --detector_file=tml_detector/trackml-detector.csv --digitization_config_file=tml_detector/default-geometric-config-generic.json --cell_directory=tml_full/ttbar_mu200/  --events=$events --input-binary &
 else
  CUDA_VISIBLE_DEVICES=$gpu_id taskset -c $p nvprof -o $log_dir/$i ../build/bin/traccc_cuda_example --detector_file=tml_detector/trackml-detector.csv --digitization_config_file=tml_detector/default-geometric-config-generic.json --cell_directory=tml_full/ttbar_mu200/  --events=$events --input-binary &
 fi
done
wait
Tend=$(date "+%s.%3N")
elapsed=$(echo "scale=3; $Tend - $Tstart" | bc)
python3 log_data.py $num_proc $events $elapsed $cores $threads cuda
echo "Elapsed: $elapsed s"

Similarly another script is used to run the CPU algorithm. For CPU algorithms benchmarking was done with and without turbo.

Currently in the process of collecting and analyzing individual kernel execution time to observe how the kernel execution time is effected with the number of MPS clients.

Code: https://github.com/Chamodya-ka/traccc/tree/mps-test-22-07

Other Related Contributions

Added Clusterization and Space points formation algorithms using common kernels to CUDA. This is required to benchmarking the entire pipeline (except kalman filtering).

PR : https://github.com/acts-project/traccc/pull/209

Ported Clusterization and Space points formation algorithms from SYCL to CUDA (did not get merged due to a structural change in the Traccc). This was a good learning experience.

PR : https://github.com/acts-project/traccc/pull/206

Created couple of issues in Vecmem library that I came across while testing

https://github.com/acts-project/vecmem/issues/ 182
https://github.com/acts-project/vecmem/issues/ 180