Data Parallelism in Machine Learning Training

Published in

Cloud Villains

7 min readApr 12, 2024

In the era of Generative AI, a distributed training system is essential to anyone who wants to leverage Gen AI since the Generative AI model is mostly huge. The first notion of the distributed system you should know is data parallelism. That is, a very large dataset is split into multiple GPUs and each GPU has a copy of the machine learning model to train in a parallel manner.

Given a single GPU, its full internal state is available locally, such as model parameters, optimizer parameters, gradients, etc. Using data parallelism, the training set is split into mini-batches evenly distributed across GPUs. Each GPU only trains the model on a subset of the total dataset so that the model state is slightly different on each GPU. To converge the internal states while training, the model state must be regularly updated on all GPUs synchronously or asynchronously.

Data Parallelism

As I mentioned, data parallelism is a core technique for distributed training of large machine learning models. The basic idea is to split the training dataset across multiple GPUs, where each GPU maintains a full copy of the model. During each training iteration:

The dataset is divided into small batches, and each batch is assigned to a different GPU.
Each GPU independently computes the gradients for its assigned batch using its own copied model.
The gradients from all GPUs are then aggregated.
The aggregated gradients are used to update the model parameters on all GPUs.

This approach allows the training to scale linearly with the number of GPUs, as each GPU can process a different batch of data in parallel. The key benefit is that it allows you to effectively leverage multiple GPUs’ aggregate memory and compute power to train much larger models than those of a single GPU.

Synchronous vs. Asynchronous Updates

We already touched on the need to regularly update the model state across all GPUs. There are two main approaches for this:

Synchronous updates

In this approach, the gradients from all GPUs are aggregated and the model parameters are updated simultaneously on all GPUs after each iteration. This ensures that the model state is consistent across all GPUs, but can limit the training throughput if the communication between GPUs is slow. In other words, all GPUs report their updates to all other GPUs or to a central parameter server that redistributes them. As updates are applied simultaneously, the model state is in sync on all GPUs.

Asynchronous updates

Here, each GPU updates its local model copy independently, without waiting for the other GPUs. The parameter updates are then shared asynchronously with the other GPUs, typically using a parameter server architecture. This can improve training throughput, but also lead to inconsistent model states across GPUs, which may impact convergence.

That is to say, if gradient updates are sent to all other nodes or a central server and they are applied immediately, the problem of scaling comes to us. As the number of GPUs increases, a parameter server will inevitably suffer from a bottleneck. Without a parameter server, network congestion becomes a problem as well. We will train a model more slowly than we expect even though many GPUs are used.

Choosing between synchronous and asynchronous updates often involves a trade-off between training throughput and model convergence. Synchronous updates generally lead to more stable training, while asynchronous updates can be faster but may require more careful hyperparameter tuning.

The Challenges of Asynchronous Updates

As you mentioned, the key issue with asynchronous updates is the potential for inconsistent model states across the GPUs. When each GPU updates its local model copy independently, without waiting for the other GPUs, it can lead to the following problems:

Staleness: The parameter updates made by one GPU may not get propagated to the other GPUs on time. This can result in some GPUs training on “stale” model parameters that are out of sync with the latest updates.
Divergence: The independent updates on each GPU can cause the model parameters to diverge, leading to inconsistent internal states across the replicas. This can negatively impact the model’s ability to converge to a stable solution.
Bottlenecks: As you pointed out, without a central parameter server, the need to communicate parameter updates across the distributed network can lead to congestion and become a performance bottleneck, especially as the number of GPUs increases.

However, AllReduce handles this issue efficiently. Ring-AllReduce is one of the decentralized asynchronous algorithms which are called AllReduce. Ring-AllReduce organizes nodes in a directed one-way ring.

How AllReduce Works

There are multiple types of AllReduce .

These are easily thought algorithms of AllReduce ; however, each has some problems. The first one has redundant communications and the second one’s GPU_1 suffers from a bottleneck as the number of GPUs increases.

How Ring-AllReduce Works

DespiteAllReduce algorithms’ hurdles, Ring-AllReduce provides an elegant solution to these challenges. The key benefits of Ring-AllReduce are:

Efficient communication: Ring-AllReduce organizes the GPUs in a ring topology, allowing the gradients to be aggregated efficiently, with each GPU only needing to communicate with its immediate neighbors in the ring.
Decentralization: The ring-based communication pattern of Ring-AllReduce eliminates the need for a centralized parameter server. This avoids the bottleneck that can occur with a parameter server architecture.
Synchronization: This type of AllReduce operation ensures that the parameter updates are synchronized across all GPUs, maintaining a consistent model state and avoiding the divergence issues that can arise with fully asynchronous updates.

By using Ring-AllReduce, the distributed training system can achieve the benefits of asynchronous updates (faster throughput) while mitigating the potential downsides (inconsistent model state and divergence). The synchronization provided by Ring-AllReduce helps to stabilize the training process and improve convergence, even with the asynchronous parameter updates.

Ring-AllReduce has two steps, Scatter-Reduce and AllGather.

Scatter-Reduce

Before seeing how this step communicates, we assume that there are 4 GPUs and each has the same copy of the model, which consists of 4 neurons. After finishing computing gradients, each model has its own gradient. That is, the model on GPU_1 has a gradient of A1 , B1 , C1 and D1 .

To aggregate the gradients in a way of Scatter-Reduce, All the GPUs send an element of their gradient to the next GPU.

AllGather

After Scatter-Reduce is done, each GPU has the sum of elements of gradients spreaded in all the GPUs. As you can see above, GPU_1 has the sum of the second elements from all the GPUs.

AllGather get all the GPUs have the same aggregation of gradients. So, the first GPU sends its sum of elements to the second GPU. Finally All the GPUs have the same gradients.

Conclusion

We discuss distributed training for large-scale generative AI models, focusing on the concept of data parallelism. Key points:

Data parallelism is a core technique for distributed training, where the dataset is split across multiple GPUs, each with a full copy of the model.
There are two main approaches for updating the model state across GPUs
- Synchronous updates: Gradients are aggregated and model parameters are updated simultaneously on all GPUs
- Asynchronous updates: Each GPU updates its local model independently, sharing updates asynchronously.
Asynchronous updates can improve training throughput but may lead to issues like staleness, divergence, and bottlenecks due to inconsistent model states across GPUs.
Ring-AllReduce, a decentralized asynchronous algorithm, provides an efficient solution by organizing GPUs in a ring topology. This allows for efficient communication between neighboring GPUs, elimination of the need for a central parameter server, and synchronized parameter updates to maintain a consistent model state
Ring-AllReduce has two main steps. Scatter-Reduce aggregates gradients by passing elements between neighboring GPUs. AllGather shares the aggregated gradients across all GPUs.

The summary highlights the importance of distributed training techniques like data parallelism and the challenges of asynchronous updates, as well as the benefits of the Ring-AllReduce algorithm as a solution to these challenges.