9 libraries for parallel & distributed training/inference of deep learning models

6 min readOct 3, 2022

In this blog we will cover a few basics of large model training before jumping to the list of libraries available. To skip the basics of large model training and jump to the list of libraries click here.

Basics of training large models

Large deep learning models require significant amount of memory to train. Models require memory to store intermediate activations, weights etc.. while training. Some models can be trained only with a very small batch size on a single GPU while other models may not fit on single GPU. This makes the training very inefficient and impossible in some cases.

There are two main frameworks for training large scale deep learning models.

Data parallelism
Model parallelism

We will discuss what these core concepts and recent advances and enhancements to these methods in the area of model training and then we will look at some of the libraries available which implement these methods.

Data parallelism

The simplest scenario in which Data parallelism can be applied is a case in which the model fits completely in to the GPU memory. We may be limited by the batch size with which we can train the model making the training difficult.

The solution to this is to have different instances of the model running on different GPUs and different batches of data.

Each instance of the model is initialised using the same parameters but during forward pass different batches of data are sent to each model. The gradients from each model instance is gathered and gradient update is computed. The model parameters are then updated and sent to each model instance as an update.

Model parallelism

Model parallelism becomes necessary when the model doesn’t fit on a single GPU. It then becomes necessary to split the model over multiple GPUs to train it.

By splitting the model on multiple GPUs we can create a model which doesn’t fit on a single GPU and yet be able to train it. The issue with this approach is compute usage is not efficient as at any point of time only one GPU is actively being used and other GPUs are idle.

Enhancements

There are various advances and enhancements to the above two frameworks which make the training/inference efficient. They are as follows

Libraries

1. Megatron-LM

Megatron is large transformer language model developed by the Applied Deep Learning Research team at NVIDIA. The model is for research into training large language models at scale.

Megatron supports model parallelism ( tensor, sequence & pipeline ) and multi-node pre-training of transformer models. Megatron currently supports BERT, GPT & T5 models.

2. DeepSpeed

DeepSpeed is a deep learning library by Microsoft. It has been used to train large models like Megatron-Turing NLG 530B & BLOOM.

DeepSpeed innovates in three areas

Training
Inference
Compression

DeepSpeed offers the following benefits

Train/Inference dense or sparse models with billions or trillions of parameters
Achieve excellent system throughput and efficiently scale to thousands of GPUs
Train/Inference on resource constrained GPU systems
Achieve unprecedented low latency and high throughput for inference
Achieve extreme compression for an unparalleled inference latency and model size reduction with low costs

3. FairScale

FairScale ( by Facebook research ) is a PyTorch extension library for high performance and large scale training. The vision of FairScale is as follows :

Usability — Users should be able to understand and use FairScale APIs with minimum cognitive overload.
Modularity — Users should be able to combine multiple FairScale APIs as part of their training loop seamlessly.
Performance — FairScale APIs provide the best performance in terms of scaling and efficiency.

FairScale supports FullyShardedDataParallel ( FSDP ) the recommended way to scale training of large neural networks.

FSDP workflow from https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/

4. ParallelFormers

Parallelformers is a library based on Megatron-LM. It is integrated very well with Huggingface library. The models in the Huggingface library can be parallelised with a single line of code. Currently it supports only inference.

from transformers import AutoModelForCausalLM, AutoTokenizer
from parallelformers import parallelizemodel = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")

parallelize(model, num_gpus=2, fp16=True, verbose='detail')

5. ColossalAI

Colossal-AI provides a set of parallel components which we can use to implement our own distributed/parallel training. The following parallelisation strategies and enhancements are implemented

Data Parallelism
Pipeline Parallelism
1D, 2D, 2.5D, 3D Tensor Parallelism
Sequence Parallelism
Zero Redundancy Optimizer (ZeRO)
Heterogeneous Memory Management ( PatrickStar )
For Inference Energon-AI

6. Alpa

Alpa is a system for training and serving large-scale neural networks. Key features of Alpa are as follows

Automatic Parallelization. Alpa automatically parallelizes users’ single-device code on distributed clusters with data, operator, and pipeline parallelism.
Excellent Performance. Alpa achieves linear scaling on training models with billions of parameters on distributed clusters.
Tight Integration with Machine Learning Ecosystem. Alpa is backed by open-source, high-performance, and production-ready libraries such as Jax, XLA, and Ray.

7. Hivemind

Hivemind is a library for decentralized deep learning using Pytorch across the internet. Its intended usage is training one large model on hundreds of computers from different universities, companies, and volunteers.

Its key features are:

Distributed training without a master node: Distributed Hash Table allows connecting computers in a decentralized network.
Fault-tolerant backpropagation: forward and backward passes succeed even if some nodes are unresponsive or take too long to respond.
Decentralized parameter averaging: iteratively aggregate updates from multiple workers without the need to synchronize across the entire network (paper).
Train neural networks of arbitrary size: parts of their layers are distributed across the participants with the Decentralized Mixture-of-Experts (paper).

8. OneFlow

OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient. With OneFlow, it is easy to:

program a model with PyTorch-like API
scale a model to n-dimensional-parallel/distributed exectution with the Global View API
accelerate/deploy a model with the Static Graph Compiler.

9. Mesh-Tensorflow

According to the github page : Mesh TensorFlow (mtf) is a language for distributed deep learning, capable of specifying a broad class of distributed tensor computations. The ‘Mesh’ here refers to the interconnected network of processors or compute devices.