Four ways to increase batch size in deep neural network training

Karina Ovchinnikova
Deelvin Machine Learning
5 min readNov 16, 2021
Batch of plastic cups

In deep neural network training, the batch size, like other hyperparameters, is a variable that requires selection. The greater the batch size is, the more GPU memory is required. Therefore, the maximum possible batch size is limited by the available GPU memory.

In this article, we consider several techniques that can increase the maximum possible batch size. They correlate with the accuracy, learning rate, and convergence of the model in different ways. In our tests we measure the average training time of one epoch, the amount of the occupied GPU memory, the best accuracy in the training process, and the epoch at which it was achieved. All tests were carried out on a computer with the following configuration:

CPU: Intel (R) Core (TM) i9–9900K CPU @ 3.60GHz (16 cores)
GPU: GeForce RTX 2080 Ti 12Gb
Driver version 470.57.02 / CUDA 11.4.2 / CUDNN 8.2.4.15
Docker version 20.10.9
Pytorch version 1.10.0

The experiment code is available on github. The experiments are based on the task of classification into two classes — cats vs dogs and the architecture of the ResNet-50 neural network. In each experiment, training was carried out for 15 epochs, the SGD (momentum = 0.9, learning rate 0.001, weight decay 1e-4) optimizer was used.

cuDNN auto-tuner

cuDNN auto-tuner allows at the beginning of training to select the fastest implementations of convolutional layers for a specific hardware, but this requires additional memory. To disable cuDNN auto-tuner, add the line

torch.backends.cudnn.benchmark = False

Table 1. Comparison of training with cuDNN auto-tuner enabled and disabled

The maximum batch size without using any techniques (n exp = 0) is 102. After disabling cuDNN auto-tune (n exp = 1), the batch size was increased to 111. At the same time, the average training time for one epoch increased by almost 5 seconds.

Automatic Mixed Precision (AMP)

AMP chooses the optimal set of operations that can be performed in FP16 so that training remains stable. Moving to FP16 reduces GPU memory usage, speeds up data transfers and math operations, especially on GPUs with Tensor Core. Both a native implementation and an implementation in the Apex library are available for PyTorch. Apex provides a wider variety of options, however the documentation on the NVIDIA website says that “torch.cuda.amp is the future-proof alternative, and offers a number of advantages over Apex AMP”, therefore it is better to opt for the native implementation.

To use the Pytorch implementation of amp, one needs to add a GradScaler before the learning loop, run a forward pass with autocasting and then backward pass using a scaler.

Table 2. Comparison of training with different settings for amp

The use of AMP allowed us to increase the batch size almost two-fold. Although the largest batch size was achieved using the Apex library with the “O3” setting (n exp = 5), torch.cuda.amp is the best option in terms of the average time per epoch and the achieved accuracy (n exp = 2). The Apex website also states that “O3 may not achieve the stability of the true mixed precision options O1 and O2”.

Gradient checkpointing

This approach is based on a simple idea — instead of storing intermediate results of the computational graph, we will recalculate them at the stage of the backward pass. This will reduce the amount of GPU memory consumed, but increase the computational costs. The entire model is segmented and only the input layers for each segment are saved during the forward pass. During the backward pass, in order to compute the gradients, the forward pass calculations are performed once more within each segment.

Gradient checkpointing with manual settings

Pytorch allows you to either automatically split a model (or part of a model) into a specified number of segments using torch.utils.checkpoint.checkpoint_sequential, or specify the start of each segment using torch.utils.checkpoint.checkpoint. It should be noted that this technique cannot be applied to models containing Dropout layer.

The easiest way to try gradient checkpointing for ResNet-50 is to apply torch.utils.checkpoint.checkpoint to each of the layers self.layer1, self.layer2, self.layer3, self.layer4 (n exp = 6), or torch.utils.checkpoint.checkpoint_sequential to their sequence with a different number of segments (n exp = 7 and n exp = 8).

Optimal gradient checkpointing

Modern deep neural networks are fairly large computational graphs, for which it is difficult to determine on how many segments and how to divide the graph in order to maximize the reduction in GPU memory consumption. The article proposes an algorithm for finding the optimal partition of an arbitrary computational graph into segments. The results state that for the Resnet-50 they were able to reduce memory consumption by 63%. The implementation of the algorithm from the article is available on github.

To use this technique, before training, you need to search for the optimal segmenting by executing the following code:

During training, to get the output, use not model, but run_segment:

Table 3. Comparison of training with and without gradient checkpointing

It can be seen from the table that manual adjustment gives a small increase in the size of the batch, and Optimal Gradient Checkpointing (n exp = 9) allows you to increase it more than two-fold, with an insignificant slowdown in the learning process.

Gradient accumulation

Gradient accumulation is one of the simplest techniques to increase batch size. The size of the mini-batch is fixed (in our case, this is the maximum size of the batch, training with which is placed in the GPU memory). Loss and gradients are calculated for each mini batch. The gradients are accumulated until the accumulation_steps of mini-batches are processed. And only then the model weights are updated and the gradients are reset. As a result, the batch size will be equal to accumulation_steps * mini_batch_size.

Table 4. Comparison of training with and without Gradient accumulation

Gradient accumulation allows you to increase the batch size with little effort almost indefinitely, but the convergence process will differ from the option when the entire batch fits into the GPU memory. This is due to the fact that there may be batch specific operations in the model, such as BatchNorm.

Conclusions

We have tested 4 techniques for increasing the maximum batch size. Their combined use made it possible to increase the batch size from 102 to 960. As the results presented in the tables demonstrate, an increase in the batch size affects the accuracy value on the validation set and the learning convergence process change, so the best accuracy is achieved at different epochs. For this task, the best accuracy and speed was for n exp = 3, with batch size = 199. However, we cannot claim that for other data and network architectures the settings from n exp = 3 will also give the best result. The batch size can affect the convergence process and generalizing ability of the model, that is the difference in the accuracy of the model on the validation and test samples. The larger the batch size is, the worse the generalizing ability of the model (article) is, but the learning process is faster. Therefore, in order to achieve the desired compromise between the size of the batch, the speed of training and the accuracy of the model, they have to be re-selected. Also, it is important to remember that when the batch size increases, you may need to change the settings or even the optimizer itself (Pytorch performance tuning guide page 5) in order to achieve better accuracy.

This project was conducted by Deelvin. Check out our Deelvin Machine Learning blog for more articles on deep neural network training.

--

--