PyTorch performance tuning in action

Published in

Deelvin Machine Learning

12 min readSep 21, 2021

PyTorch is a Machine Learning (ML) framework whose popularity is growing fast among deep learning researchers and engineers. One of its key advantages is access to a wide range of tools for training different types of networks: fully connected, convolutional, recurrent, and other. Yet despite the convenience of this tool, there remains the challenge of accelerating neural network training. Reducing the time of training can lead to conducting more experiments in a given time and, potentially, achieving a better model.

Recently, NVIDIA engineers released PyTorch Performance Tuning Guide video where they talk about several techniques that can accelerate PyTorch training pipeline. In our research, we decided to test the techniques proposed by them on our training pipeline and to evaluate which techniques lead to acceleration and by how much.

For the experiments, we chose training pipeline for Unet with resnet34 backbone to solve the binary semantic segmentation task using Supervisely Person Segmentation Dataset. For each technique, a brief description is provided, as well as the results of measurements on our pipeline and the acceleration that it enables.

The experimental methodology is as follows:

In each experiment, we add a new technique to the best pipeline configuration.
The result of each experiment is the average time of the training epoch over 30 epochs.
If the addition of a technique leads to a decrease in the average time of the training epoch, then we add this technique to the pipeline configuration and consider it to be the best. Otherwise, the proven technique is not added to the best pipeline configuration and we move on to the next experiment.

All experiments were conducted on a computer with the following configuration:

CPU: Intel(R) Core(TM) i9–9900K CPU @ 3.60GHz (16 cores)
GPU: 2 x GeForce RTX 2070 SUPER (8 GB each)
Disk: SSD 500 GB
Nvidia config: Driver version 455.32.00 / CUDA 11.1 / CUDNN 8.0.8
CUDA 11.1
CUDNN 8.0.8
Docker version 20.10.6
Pytorch version 1.8.0

The code for the experiments is available on github.

So, let’s get started!

Experiment 1: Enable async data loading

The first technique discussed in the video is setting num_workers>0 and pin_memory=True in PyTorch DataLoader. According to PyTorch documentation:

If num_workers=0 (default value), data loading and data augmentations are synchronous with training and are done in the main process. This results in the main training process having to wait for the data to be available to continue the execution.
Setting num_workers>0 enables data loading and augmentation to run asynchronously in separate processes. This allows you to prepare data in advance so that the training process does not wait for them. The choice of num_workers parameter depends on the location of training data, and CPU and GPU characteristics.
Setting pin_memory=True can accelerate copying from host to GPU memory. The default value is False.

In our experiment, the average time (and std) of the training epoch was measured for different combinations of num_workers (from 0 to 16 with the interval 2) and pin_memory (True or False). The results are presented in Figure 1.

Figure 1. Average training epoch time for different values of num_workers and pin_memory params.

Fig. 1 shows that an increase in the number of num_workers reduces the average training time for one epoch. At the same time, changing pin_memory parameter from False to True with a fixed num_workers does not significantly affect training time.

With the baseline pipeline configuration num_workers=0 and pin_memory=False, the average epoch time was 279.064 seconds. In our experiment, the average training epoch time was reduced to 61.9682 seconds with num_workers=6 and pin_memory=True.

Experiment 2: Disable debugging API

PyTorch provides a rich API for debugging, which the authors recommend disabling before final training to accelerate the pipeline. It is important to note that debugging API is disabled by default and you can verify it using the commands:

import torchtorch.is_anomaly_enabled () # False
torch.autograd._profiler_enabled () # False

Since our code does not initially use debugging API as it is disabled by default, this technique does not accelerate our pipeline.

Nevertheless, to demonstrate that debugging API slows down the pipeline, we conducted an experiment where anomaly detection is added to the best pipeline configuration from Experiment 1. The measurement results are shown in Table 1.

Table 1. Average training epoch time for a pipeline with anomaly detection and without it.

Table 1 shows that average epoch time for a pipeline with anomaly detection enabled is 7.5 seconds longer than for the same pipeline with anomaly detection disabled. This confirms that disabling debugging API for the final training leads to acceleration.

Experiment 3: Set bias to false for convolutions followed by batch normalization layer

PyTorch provides convolutional layers (1d, 2d and 3d) with default argument bias=True. However, if there is BatchNorm layer immediately after it, then bias is not needed for the convolution layer and it can be set to bias=False, since in BatchNorm the first operation will be subtraction of the mean and it will cancel the bias effect. Besides, bias=True uses more GPU kernels for computations.

In our experiments for Unet, pretrained resnet34 is used in the encoder, in which bias=False is already set for such convolutional layers. There are 12 convolutional layers in the decoder, for which bias=True in the previous experiments. In this experiment, for these 12 convolutional layers we change bias from True to False and compare it with the best time obtained in Experiment 1.

Table 2. Average training epoch time for Unet with bias=True and bias=False for convolutional layers followed by batch normalization.

According to Table 2, this technique gives a small gain in the average time of the training epoch ~0.7046 sec. Most likely, this happened because the decoder has only 12 convolutional layers, after which there is batch normalization, and in the backbone there is already bias=False. Probably, this technique would allow greater acceleration with more convolutional layers, in which the bias value changes from True to False.

Experiment 4: Change the way to zero out gradients

According to the video, the traditional zeroing of gradients via `optimizer.zero_grad()` or `model.zero_grad ()` is simple to implement, but has several disadvantages:

This way of zeroing gradients uses separate cuda kernels for each parameter, which is ineffective.
Backward pass update gradients using “+=” operator which firstly reads data, but reading data is unnecessary.

NVIDIA engineers proposed a more efficient option that avoids the above disadvantages:

...
for param in model.parameters():
    param.grad = None
...

We ran an experiment in which we compared the average training epoch time depending on how the gradients were zeroed. The results of the experiment are shown in Table 3.

Table 3. Average training epoch time for traditional gradient zeroing and the technique proposed by NVIDIA engineers.

Table 3 shows that switching from the traditional method to the proposed one did not accelerate the current pipeline: the average epoch time increased by 0.9072. At the same time, a peculiarity was noticed in the memory consumption graph: for the approach proposed by NVIDIA engineers, memory consumption increased to 98.822%. The GPU memory consumption for both gradient zeroing methods is shown in Figure 2.

Figure 2. GPU memory usage for training pipeline for both zero out gradient ways.

As a result, it turned out that this technique increases memory consumption and does not provide acceleration for the training pipeline.

Experiment 5: Gradient checkpointing + increase batch size

The next technique to be tested is gradient checkpointing. Its main idea is to reduce GPU memory consumption by:

Saving the results of not all operations during the forward pass, but only a part.
Perform extra calculations during the backward pass when a value that has not been saved is needed.

We want to use this technique to increase the batch size for the model and check if we can reduce the average epoch time by increasing the batch size despite the additional calculations during the backward pass.

Pytorch provides 2 functions for gradient checkpointing:

`torch.utils.checkpoint.checkpoint_sequential` — allows to perform sequential model gradient checkpointing specifying only the model, number of segments and input for the model.
`torch.utils.checkpoint.checkpoint` — gives the user more control over where to do the gradient checkpointing, but this makes the implementation more complicated.

Finding the optimal or close to it partition looks like a non-trivial task that deserves a separate article. For this reason, we used the simplest approach and tried to apply checkpoint_sequential to different encoder/decoder blocks with different number of segments. The best result that was achieved for the architecture used by following this method was to reduce the memory consumption from 6576 MB to 6057 MB, which made it possible to increase the batch size by only 1. Probably, to free more memory for the current Unet configuration, one needs to use `torch.utils .checkpoint.checkpoint` and more detailed research which is beyond the scope of this article.

Table 4 compares average training epoch time for the pipeline from Experiment 3 without gradient checkpointing and batch sizes 16, 24 (maximum batch size for this configuration) with average training epoch time with gradient checkpointing and batch size 25 (maximum batch size for this configuration). We also measured average training epoch time without gradient checkpointing for batch sizes 8 and 12.

Table 4. Average training epoch time of different batch sizes with gradient checkpointing and without it.

According to Table 4, increasing the batch size for our pipeline configuration leads to an increase in the average training epoch time. This could be due to the upsampling in Unet, which is implemented through `torch.nn.functional.interpolate`, and for it, the time for running a batch of size n is n * time for one sample. Thus, we can say that increasing the batch size did not give any result due to the architecture of the network used in the experiments. It should be noted that for batch size <16, the average training epoch time increases and batch size=16 is optimal in terms of speed, although it utilizes only 84% of the GPU memory (with 100% utilization of cores).

The question whether it is possible by applying gradient checkpointing to increase the batch size so as to get acceleration, despite the extra computation during the backward pass remains open and requires more in-depth research.

Experiment 6: Enable cuDNN auto-tuner

Enabling cuDNN auto-tuner launches a short benchmark that selects the fastest convolution layer implementation for a specific hardware. Using cuDNN auto-tuner can affect the reproducibility of the experiments, since even with different runs on the same hardware, different convolutional layer implementations can be chosen.

To enable cuDNN auto-tuner in PyTorch, before the training loop, add the following line:

torch.backends.cudnn.benchmark = True

We ran an experiment comparing the average training epoch time for a pipeline with and without cuDNN auto-tuner enabled. The measurement results are shown in Table 5.

Table 5. Average training epoch time for pipeline with cuDNN auto-tuner and without it.

Table 5 shows that using cuDNN auto-tuner reduces the average training epoch time by 5.178 seconds. In addition, it is worth noting that the use of cuDNN auto-tuner required additional GPU memory as Figure 3 demonstrates.

Figure 3. GPU memory usage for training pipeline with cuDNN auto-tuner enabled and without it.

Experiment 7: Enable mixed precision training

Mixed precision training is a powerful technique which allows to use less memory for training NN (and train heavier architectures), speed up data transfer operations and do math operations faster on Tensor Cores (supported by NVIDIA GPUs based on Volta, Turing and Ampere architectures). All these improvements are the result of performing the main part of operations in FP16, while storing minimal information in FP32, with the same accuracy as in FP32.

In our experiment, we used the configuration from Experiment 6 and added mixed precision training there using NVIDIA apex library, which provides 2 options for implementing this technique (O1 and O2). Table 6 shows a comparison of the average training epoch time with and without mixed precision implementations.

Table 6. Average training epoch time with mixed precision training and without it.

From the results presented in Table 6, it can be seen that with mixed precision training, the average training epoch time decreased by 8.4093 seconds (for O2 implementation).

At the same time, on the figures of GPU memory consumption (Figure 4) and GPU kernel utilization (Figure 5), you can see the use of only 50% of the memory and the average utilization of cores is about 70%.

Figure 4. GPU memory usage for mixed precision training, batch size = 16.

Figure 5. GPU kernel utilization for mixed precision training, batch size = 16.

A decrease in GPU kernel utilization and GPU memory signals that not all resources are used, and the pipeline can be accelerated. Therefore, we will increase the batch size from 16 to 32 and the number of workers from 6 to 12. The results of measurements with the new configuration are shown in Table 7.

Table 7. Average training epoch time using both mixed precision implementations, batch size = 32 and num workers = 12.

As a result of using mixed precision training, increasing the batch size to 32 and num_workers to 12, we managed to reduce the average training epoch time by 16.7267 seconds.

Experiment 8: Use FusedAdam instead of Adam

Apex library offers not only automatic mixed precision functionality, but also some optimized, reusable building blocks. Among them there are optimizers FusedAdam, FusedSGD and others. These optimizers have the same functionality like standard optimizers from `torch.optim` but implementation is faster thanks to fusing optimizers to a single cuda kernel.

In our experiment, we will compare the average training epoch time for pipeline with the standard Adam optimizer and with FusedAdam optimizer. The comparison results are shown in Table 8.

Table 8. Average training epoch time using Adam and FusedAdam optimizers

Table 8 shows that replacing Adam with FusedAdam reduced the average epoch time by only 0.5812 seconds.

Experiment 9: Enable distributed training

If you have multiple GPUs, you can take advantage of distributed training. There are two approaches to distributed training: model parallelism and data parallelism. As it was demonstrated in one of our previous posts, for the Unet configuration used in the experiments, data parallelism gives greater acceleration than model parallelism when compared against a single GPU training. Therefore, based on this result, we will only consider data parallelism in our experiment.

With data parallelism, the dataset is divided into N equal parts, where N is the number of GPUs involved in training. Each GPU stores a copy of the model, which is trained on its part of the dataset. Copies of the gradients are exchanged at each iteration of the model training. Ideally, because each copy of the model is trained on its own part of the data, the epoch time can decrease by N times (N is the number of GPUs), but there is a bottleneck — the transfer of gradients between GPUs. The rate of exchange of gradients depends both on the size of the gradients (i.e. the network architecture) and on the rate of the exchange channel between GPUs.

The computer we are using for the experiments has 2 GPUs installed, so network transmission delays will not have any influence. Table 9 shows the results of the transition to distributed training using `DistributedDataParallel` for the pipeline configuration from Experiment 8.

Table 9. Average training epoch time with `DistributedDataParallel` on 2 GPUs and without it.

According to Table 9, switching to `DistributedDataParallel` led to a decrease in the average training epoch time by 3.5055 seconds, which is significantly less than the expected improvement (average epoch time ~ 20 sec). This result can be explained by the slow transfer of gradients between GPUs. There are also other approaches to accelerating distributed training that will not be discussed in this article.

Summary

Figure 6. Average training epoch time with adding the next technique to the previous ones. The number at the top shows by how many times the training epoch time has decreased with the addition of the corresponding technique.

To sum up, in our experiments we were able to reduce the average training epoch time by 7.912 times relative to the baseline.

Figure 6 shows how each of the techniques contributed to the overall acceleration of the training pipeline:

The greatest contribution to acceleration was made by the parameter setting num_workers and switching to mixed precision training. At the same time, for different values of the pin_memory parameter, there was no noticeable difference in speed with a fixed num_workers.
The inclusion of cuDNN auto-tuner and the transition to distributed training also had a positive effect. At the same time, a greater acceleration was expected from distributed training than it was achieved in practice.
The transition to FusedAdam and setting bias to false for convolutions followed by batch normalization layer brought minimal acceleration. A slight speedup when using the second technique can be explained by the small number (12) convolution layers to which this technique could be applied.

The following techniques failed to offer improvement for various reasons:

Disabling debug API did not lead to acceleration because the debugging API was not used in the pipeline and it is disabled by default.
Zeroing gradients with param.grad = None instead of optimizer.zero_grad().
Gradient checkpointing + increasing batch size requires deeper research.

It is worth noting that except for gradient checkpointing, only the alternative method of zeroing gradients did not give any speed boost. Enable async data loading, enable cuDNN auto-tuner and mixed precision training can be distinguished from those that gave a speed increase, which will always give a noticeable increase in speed. The speedup that other techniques will give depends largely on the pipeline and network architecture, as well as the speed of data exchange between GPUs (specifically for distributed training).

That’s all, I hope the results of my experiments will be useful to you. Feel free to ask questions and stay tuned to the new posts on our Deelvin ML blog.