Yandex Publishes YaLM 100B. It’s the Largest GPT-Like Neural Network in Open Source

Published in

Yandex

12 min readJun 23, 2022

In recent years, large-scale transformer-based language models have become the pinnacle of neural networks used in NLP tasks. They grow in scale and complexity every month, but training such models requires millions of dollars, the best experts, and years of development. That’s why only major IT companies have access to this state-of-the-art technology. However, researchers and developers all over the world need access to these solutions. Without new research, their growth could wane. The only way to avoid this is by sharing best practices with the developer community.

We’ve been using YaLM family of language models in our Alice voice assistant and Yandex Search for more than a year now. Today, we have made our largest ever YaLM model, which leverages 100 billion parameters, available for free. It took us 65 days to train the model on a pool of 800 A100 graphics cards and 1.7 TB of online texts, books, and countless other sources. We have published our model and helpful materials on GitHub, under the Apache 2.0 license that permits both research and commercial use. It is currently the world’s largest GPT-like neural network freely available for English.

In this article, we’ll share not only the model, but our experience training it. You might think that with a supercomputer, training large-scale models is a piece of cake. Unfortunately, it’s not. Here, we’ll tell you how we managed to train such a huge language model and how we reduced the training time by half without compromising stability. Many of the things described below can be applied to training smaller models as well.

How to Accelerate Model Training

In the context of large-scale neural networks, a 10% gain in training speed can save you a week’s worth of runtime on a high-value cluster. Here, we’ll tell you how to more than double the training speed.

Training iterations usually consist of the following steps:

Prepare the batch
Run forward propagation: calculate the activation and loss functions
Run backward propagation: calculate gradients
Run the step stage to update the model’s weights

Let’s look at how you can accelerate these stages.

Look for Bottlenecks

To see how your training time is used, you should use a profiler. In PyTorch, the torch.autograd.profiler module does this (see the article). Here is an example of a trace that we obtained from the profiler:

This trace was produced by a small 12-layer neural network. You can see the forward stages at the top and the backward stages at the bottom. What’s wrong with this trace? One operation takes too long, about 50% of the entire training time. It turned out that we forgot to change the token embedding size when copying the training configuration for our large model. This resulted in excessive matrix multiplication at the end of the network. By reducing the embedding size, we significantly sped up the training process.

The profiler also helped us find more serious problems, so we recommend using it often.

Use Fast Data Types

The first thing that affects the speed of training and inference is the data type used to store the model and run calculations. We use four data types:

Single-precision format, fp32: A regular float format. It’s very accurate, but takes up four bytes, slowing down computations. This is the type your PyTorch model has by default.
Half-precision format, fp16: A 16-bit data type that is much faster than fp32, consuming half as much memory.
bfloat16, another 16-bit type: Compared to fp16, it provides 3 bits less for the mantissa and 3 bits more for the exponent. As a result, the format can accept broader value ranges, but it suffers from precision loss in numerical operations.
TensorFloat format, tf32: A 19-bit data type that combines the exponent from bf16 and the mantissa from fp16. It consumes the same four bytes as fp32 but is much faster.

On A100 and newer graphics cards, 16-bit types are 5× faster than fp32 and 2.5× faster than tf32. If you use an A100 card, then tf32 is always used instead of fp32, unless you explicitly specify otherwise.

In older graphics cards, bf16 and tf32 aren’t supported, and fp16 is only two times faster than fp32. Regardless, this is still a huge gain in performance. It always makes sense to run computations in half-precision or bf16 format, even though this approach has its drawbacks. We’ll discuss those later.

Accelerating Operations on GPU

This article gives a good explanation of GPU operations and ways to accelerate them. Here, we’ll quote a couple of basic ideas from there.

Utilize the GPU Completely

First, let’s understand what the calculation of a single CUDA-kernel on the GPU looks like. The factory analogy from this article helps illustrate this:

Your GPU has a warehouse (memory), and a factory (compute). When executing a kernel, the compute requests relevant data from the memory, calculates the result, and writes it back to the memory.

What happens if your factory is running at half of its capacity?

Like a brick-and-mortar factory, half of your GPU resources stay idle. How do you fix this during training? The simplest way is to increase the batch size.

For small models, an N-times increase in the batch size can bring a multi-fold increase in the training speed, although the iteration itself will slow down. For large-scale models with billions of parameters, you can also get a small gain from increasing the batch size.

Reduce Memory Interaction

The second idea from the article is as follows. Suppose we have three kernels that process the same data in a pipeline:

In this case, time is used not only for computing, but also to access the memory: these operations come at a cost. To reduce the number of these operations, you can fuse your kernels:

How? There are several ways:

1. Use torch.jit.script. By using this simple attribute, you can compile the function code into a single kernel. In the code below, we have fused three operations: tensor add, dropout, and another tensor add.

This approach provided us with a 5% learning rate increase.

2. You can write your own CUDA kernels. That way, you can not only fuse your operations, but also optimize memory usage and avoid unnecessary operations. However, writing this code requires very specific knowledge, and developing the kernel might be too expensive.

3. Or you can use ready-made CUDA kernels. Let’s have a quick look at kernels in the Megatron-LM and DeepSpeed libraries (we use them a lot):

Attention softmax with a triangular mask provides 20–100% acceleration. The speed gain is also especially high on small networks when you use fp32 in your computations.
Attention softmax with an arbitrary mask provides up to 90% acceleration.
Fused LayerNorm is a fused version of LayerNorm in fp32. We haven’t used it, but it should also provide a gain in speed.
DeepSpeed Transformers is a fully fused transformer unit. It provides acceleration, but is extremely difficult to scale up and maintain, so we don’t use it.

By using different kinds of fused kernels, we’ve sped up the training process by more than 1.5 times.

Dropouts

If you have a lot of data and no retraining at dropout == 0, disable dropouts! This increased our computing speed by 15%.

Case With Multiple GPUs

What changes when you run multiple GPUs? Now, our process looks like this:

Prepare the batch
Forward
Backward
all_reduce gradients: Average out the gradients across your graphics cards to combine their resources
Step: Update the model’s weights

Averaging all gradients takes times. Each GPU must send and receive at least as many gradients as you have parameters in your network. Let’s see how we can significantly accelerate this and the step stage.

Communications

How do optimal communications work? The NVIDIA NCCL library that we use calculates communications at initialization and allows GPUs to communicate over the network without any CPU intermediaries. This ensures maximum communication speed. Here’s an NVIDIA article about this library.

In code, it looks like this:

NCCL communications are very fast, but even with them, the speed of the all_reduce stage will take a lot of time. ZeRO helps us accelerate it even more.

ZeRO

ZeRO means Zero Redundancy Optimizer.

In the left part of the picture, you can see standard training on multiple GPUs. In the standard scheme, we distribute all the optimization parameters and states, as well as the averaged gradients, between our processes. This costs us a lot of memory.

A high-level ZeRO flowchart is shown on the right. We assign a group of parameters to each process. For these parameters, the process always stores values and optimizer states, and only this process can update them. This way, you can save huge amounts of memory that can now be allocated for large batches. However, this adds a new stage: all_gather weights. We need to collect all the network’s parameters in each process to run the forward and backward stages. Now, the complexity of operations after the gradients calculation will be as follows:

all_reduce gradients: O(N), where N is the number of parameters.
step: O(N/P), where P is the number of processes. This is already decent acceleration.
all_gather parameters: O(N).

You can see that we have accelerated one stage, but at the cost of adding new, heavy operations. So how do we accelerate them? Simple: run them asynchronously!

Gather your layers asynchronously one after the other during the forward stage:

We gather the first layer for all processes.
While gathering the second layer, we run the forward stage for the first layer.
While gathering the third layer, we run the forward stage for the second layer.

And so on until we’ve finished all the forward stages. You can speed up the backward stage almost the same way.

This produced an 80% gain in speed for our models! Even on smaller models (100M on 16 GPUs), we saw an acceleration of 40–50%. This approach requires a fairly fast network, but if you have one, you can significantly speed up training on multiple GPUs.

Result

We have applied four approaches to our training process:

We fused part of our operations: +5% of speed
We used softmax attention kernel with a triangular mask: +20–80%
We disabled dropout: +15%
We applied ZeRO: +80%

Not bad. Let’s move on.

Dealing With Divergence

A long iteration isn’t the only obstacle to training a really large model. It may seem that, if you have plenty of computing power, you can simply start training the model, go on vacation for two months, and have a ready-made model waiting for you when you get back. However, models of this scale are quite fragile and prone to divergence. What is divergence and how do we control it?

What Is Divergence?

Let’s say you launched your training, looked at the charts, and saw that loss is decreasing. Day one, day two, day three, still decreasing. Then, on the morning of the fourth day, you look at the loss chart and it looks like this:

Loss is higher now than it was a few hours after you started training. Moreover, the model has literally forgotten everything it knew. This is irrecoverable: days of training down the drain.

What happened?

First Observations

We noticed three things:

1. LAMB optimizer is much less prone to divergence than Adam.

2. By reducing the learning rate, we can overcome the divergence issue. But it’s not that simple:

To select the lr parameter properly, you need to restart the learning process many times.
A decrease in lr often slows down learning: for example, here a twofold decrease in lr led to a slowdown of 30%:

3. fp16 is more prone to divergence issues than fp32. This was mainly due to overflows in fp16 values in our activation functions and gradients. The maximum absolute value of fp16 is 65535. The overflow resulted in NaN in the loss function values.

Thermometers

One of the things we used to keep our training running for a long time were thermometers. We measured the maxima and minima of activation functions in various network segments, as well as the global norm of gradients. Here is an example of thermometer values for diverged training:

You can see that starting at about 14,000 iterations, the maxima of matmul in attention began to grow sharply. This growth is the reason for the divergence. If you roll back the training to 13,000 iterations and skip the faulty batches that caused the divergence or decrease the learning rate, you can significantly decrease the likelihood of a repeat divergence.

This approach has two downsides:

It doesn’t eliminate 100% of divergence.
You waste precious time rolling back the training. This is certainly better than totally discarding a divergent training, but still.

Later, we introduced some tricks that reduced the likelihood of divergence to such an extent that enabled us to train a multitude of models of various sizes, including 100B.

Stabilizations. BFloat 16

BFloat 16 doesn’t overflow even with large values of gradients and activations. That’s why this format is a proven option for storing weights and making computations. Unfortunately, it isn’t accurate enough, so arbitrary arithmetic operations could accumulate errors resulting in learning slowdown or another kind of divergence.

To compensate for the divergence, we started calculating the next layers and operations in tf32 (or fp32 on older graphics cards):

Softmax in attention (here’s where our kernels came in handy), softmax on tokens before the loss function.
All LayerNorm functions.
All operations with Residual: this is how we avoided errors accumulating and gradients moving deeper into the network.
all_reduce of gradients that we mentioned earlier.

All these stabilizations slowed down learning by only 2%.

Stabilizations. LayerNorm

The articles about BERT and GPT used the approach that is now known as post-layernorm (left in the picture). However, in terms of stability and convergence rate on large models, pre-LayerNorm performed admirably (right in the picture). So, for our models, we use pre-LayerNorm.

BigScience enlightened us about an unexpected stabilization option: by introducing layernorm at the very beginning of the network, after embeddings, you can also significantly decrease the likelihood of collapse.

Stabilizations. Curriculum Learning

We also adopted the approach from the Curriculum learning article. We want to train our neural network on a large batch and a large string length, but we start by training on a small batch and small string length, then gradually increase them as the training progresses.

This approach has two benefits:

The loss function drops quite quickly in the very beginning, regardless of the number of tokens visible to the model at each iteration. Since we decrease the number of computations in the beginning of training, we pass this stage of the loss plateau much faster.
The authors of the article claim that this approach results in stabilized learning.

Stabilizations. Summary

We have implemented the following approaches:

We adopted bf16 as the main type for weights
We ran precision-critical computations in tf32
We introduced pre-LayerNorm
We put LayerNorm immediately after embeddings
We used Curriculum learning

As a result, we have been training our models without divergence for more than 6 months. The models are different sizes. These stabilizations helped us train a model with 100 billion parameters, which we are now happy to share with the developer and research community.