Maximizing GPU Utilization via Data Loading Parallelization and Distributed Data Parallel

Ronald Teo
Thomson Reuters Labs
6 min readApr 12, 2023

The level of interest in Large Language Models (LLM) has exploded in popularity since ChatGPT was released. It sparked the imagination of the masses with its ability to respond to user queries in a human-like manner. The downsides of Chat GPT and other LLMs like it, however, are the huge amount of time and cost to train them. Therefore, it is imperative to train large language models as efficiently as possible to make training these models feasible.

Even with a smaller model such as the widely used BERT model that contains five times less parameters than ChatGPT, training still poses significant challenges in terms of cost and time.

This article discusses the steps to reduce the training time and cost of pre-training a BERT model with our domain-specific data. The training required 50 epochs and will take many experimental iterations to surpass our baseline benchmark. Each epoch took 15 hours, resulting in an estimated 31 days for each experiment. At that rate, with just a handful of experiments, the task would be unfeasible — taking at least half a year to complete, not to mention tens of thousands of dollars!

After identifying the bottleneck in training and rectifying the issue, each epoch took only 30 minutes to train. This reduced the time needed to fully train a model from 31 days to 1 day. As a result, we were able to perform the task in a time-efficient and cost-effective manner.

One of the main drivers for our success is improved GPU utilization by removing Data Loading and Training bottlenecks. In this article, I will explain how we identified the issue and the steps that we took to fix the issue.

DL Training Workflow

Before diving in deeper, it is important to understand Deep Learning (DL) workloads. DL workloads are data and compute-intensive, and they exhibit specific performance patterns that distinguish them from other workloads.

Workload components consist of Data Loading, Data Preprocessing, and Training.

Data Loading

Data needs to be loaded from storage to GPU. To avoid bottlenecks:

  • Data should be pre-fetched and loaded asynchronously.
  • The files should be in the correct file format (e.g. TFRecord to avoid issues such as the “small file problem”).

Data Preprocessing

Transformations such as data augmentation are performed on the fly and often involve large amounts of data. These operations are commonly performed on the CPU asynchronously while training is performed on the GPU. A potential CPU bottleneck can occur when the GPU is waiting for the next batch of data still being processed by the CPU.

Training

During training, inter-GPU communications and synchronizations can impact overall performance. Examples of potential bottlenecks are:

  • Host-to-device transfers
    -
    Saving model checkpoint files
    - Computing and saving metrics
    - Copying new data batches
  • Distributed training synchronization across GPUs
    -
    Gradient accumulation
    - Parameter updates

GPU utilization is directly related to the amount of data they are able to process in parallel. Therefore, the batch size should be set as high as possible without running out of GPU memory.

Steps Taken to Resolve BERT Pre-training Bottlenecks

Identifying the Issue

The first steps to solving an issue are being aware of the issue and identifying the cause. To achieve that, we reviewed the system resource utilization charts to diagnose problem areas.

System Utilization Metrics

The charts indicated the following:

  • 300/800 of our GPU resources were being utilized from our ml.p3.16xlarge [8 V100 GPU] instance. This means that multiple GPUs were being utilized, but not being fully utilized.
  • Only 400/800 of GPU memory was utilized.
  • The CPU utilization was low.

Based on the charts, we were able to rule out that we were GPU or CPU bound. Therefore, we had ample room to improve utilization. We also assumed that Data Processing wasn’t the issue since the CPUs were underutilized. Since the GPUs and GPU memory utilization were similar, we did not suspect the batch size as the culprit. Given this evidence, we suspected that we had a Training and Data Loading issue.

Distributed Parallel to Distributed Data Parallel

The distributed training strategy that we were utilizing was Distributed Parallel (DP), and it is known to cause workload imbalance. This is due to the additional GPU synchronization that is required in comparison to the Distributed Data Parallel (DDP) strategy. More detailed explanation can be found here.

In order to switch from DP to DDP, we modified our training launch script from

python train_experimental_model.py 

to

python -m torch.distributed.launch --nproc_per_node 8 train_experimental_model.py

# set proc_per_node = number of GPU

After switching to DDP from DP, we were able to improve our GPU utilization from 300 to just under 700 as shown below.

DDP GPU Utilization

Increase Data Loader Number of Workers to Load Data into Pinned Memory

We suspected that the GPUs were waiting for the data to be loaded from CPU memory to GPU memory, and our data loader couldn’t keep up.

Therefore, we passed the following parameters to our HuggingFace trainer.

training_args =TrainingArguments(dataloader_pin_memory=True,dataloader_num_workers=8)

# Set dataloader_num_workers < num_available cpu

See HuggingFace Documentation for more details on the TrainingArguments class. Essentially, the code above changes the data loading process from a serial to process to a parallel process. This enables the data to be readily available in pinned memory for the GPU to consume, thus removing the data bottleneck. A more detailed explanation can be found here.

image credit: telesense.co

Parallel Data Loading

After the modification, we were able to push our utilization past the 700 mark.

Shows Our Final, Optimal, Utilization of Resources

Conclusion

*Based on AWS p3 and g4 instance cost ($28.15/hr and $4.89/hr)

With optimizations in place, we were able to fully utilize the computing resources in our training job. This turned our pretraining task from an unfeasible 31 days for each experiment (model training time) to just 1 day. Besides saving time, optimally using our cloud resources drastically reduced the cost of doing the work.

This article discusses a very narrow aspect of troubleshooting and solving bottleneck issues. In this example, we used the HuggingFace framework as an example, but the concepts of pinned memory, parallel data loading, and distributed data parallel are common across most frameworks. You should consult the specific framework’s documentation on how to best take advantage of them.

Here are other worthwhile improvements we made prior to starting our investigation:

The troubleshooting we did in this article was performed manually by looking at resource utilization graphs. To automate the process, one could explore Sagemaker Profiler.

--

--