How Parallelization and Large Batch Size Improve the Performance of Deep Neural Networks.

Ankit Kumar
13 min readNov 13, 2021

--

Large Batch Size had till recently been viewed as a deterrent for good accuracy. However recent studies show that increasing the batch size can significantly reduce the training time while maintaining a considerable level of accuracy. In this blog, we draw on our inferences from four such technical papers.

1. Training ResNet-50 on ImageNet in 15 Mins

In order to optimize the learning algorithm to ensure that the accuracy does not drop with large mini-batch, the following techniques were used:

Technical Details

A. RMSprop Warm-up:

The RMSprop Warm-up phase is used to address the optimization difficulty at the start of the training. The update rule demonstrated below utilizes both the Stochastic Gradient Descent (SGD) along the RMSprop optimization algorithm.

Update Rule with SGD and RMSprop

The network utilizes both the SGD and the RMSprop algorithm during the training phase. The training starts as pure RMSprop and then gradually transitions into pure SGD.

Transition Rule from RMSProp to SGS

As seen from the equation above, the transition to SGD is exponential, then linear, and then constant (at a value of one indicating pure SGD). The smooth Exponential Linear Unit (ELU) transition of SGD ensures that the transition from RMSprop to SGD doesn’t have a negative impact on the training of the network.

B. Slow-Start Learning Rate Schedule

Optimization issues are more frequent in the initial part of the training due to the constant learning and change in the state of the model and hence a small learning rate for a large initial phase is used to address this issue. The learning rate for RMSprop (used during the initial part of the training) is kept very small at 0.0003 while the learning rate of SGD (used during the later part of the training) keeps on reducing after a set number of epochs.

C. Batch Normalization without Moving Averages

An interesting observation presented in this paper is that with large mini-batch sizes, the moving averages of the mean and variance are not an accurate representation of the actual mean and variance. In order to overcome this issue, only the last mini-batch is considered instead of moving averages and all-reduce communication is used to average across all the workers in the data parallelization.

Apart from these key optimizations to the learning algorithm, the following combination of software and hardware modules was used to perform the experiment

Chainer — Open source deep-learning framework

NCCL — Nvidia Collection Communication Library for multi-node GPU communication

Half-Precision Floating Points — In order to reduce the communication overhead during all reduce phase

1024 NVIDIA Tesla P100 GPU — For distributed training

Experimental Results

Experimental Result of training ResNet-50 with 32k Mini-Batch

As demonstrated from the table above, ResNet-50 was trained in under 15 mins for 90 epochs with a mini-batch size of 32k leading to an accuracy of 74.9%. This accuracy is comparable to the previously benchmarked accuracies and takes considerably less amount of time to train.

Observations and Insights

The key takeaway from this paper is that it is feasible to train Deep Neural networks with extremely large batch sizes and still maintain considerable accuracy while having a significant reduction in training time. This paper also demonstrates how the learning algorithms can be optimized and used in conjunction with distributed training to achieve this level of performance.

2. Image Classification at Supercomputer Scale

This paper can be viewed as an extension of the previous paper we discussed. In this paper, the authors have proposed system-related optimizations to train ResNet-50 on ImageNet in under 2.2 minutes with no drop inaccuracy.

Technical Details

  1. TPU Pod: A combination of TPU chip configuration, TPU v2 pod contains 256 chips and TPU v3 pod contains 1024 chips. The key point over here is that as we increase the number of chips in the pods, we also increase the petaFLOPS mixed-precision throughput.
  2. Mixed Precision: The authors used bfloat 16 (16-bit half-precision point) to perform convolution operations. This enables higher throughput with minimal loss inaccuracy.
  3. Learning Rate Schedule: A linear scale learning rate is used (learning rate scales linearly with the batch size. This linear learning rate scaling was proposed by Alex Krizhevsky from Google in 2014.
  4. Layer-Wise Adaptive Rate Scaling (LARS): LARS uses different learning rates for different layers based on the norm of the weights and the norm of the gradients. This allowed to scale up the batch size from 8192 (with linear learning rate scaling) to a batch size of 32k.

Input Pipeline Optimization

The throughput of the overall network is dependent on the throughput of the input pipeline and hence it is equally important to optimize the input pipeline. The authors of this paper present the following optimizations to the input pipeline:

  1. Dataset Sharding and Caching: This is essentially data parallelization in which the dataset is distributed across multiple learners. Practically it is not feasible to store all of the data in a single machine (or host) and hence data parallelization is used in practice.
  2. Prefetch: Prefetch allows the data from the next batch to be concurrently processed on the input pipeline at the same time when the model pipeline is training on the current batch.
  3. Fused JPEG Decode: For the purpose of dataset augmentation (in order to reduce variance), only the relevant part(s) of the input image is decoded to reduce the computation overhead.
  4. Parallel Data Parsing: This is an extension of the first optimization technique described above (dataset sharding and caching) where a multi-core CPU parallelizes the data parsing operation through several CPU threads.

Experimental Results

Validation Accuracy vs Effect Batch Size

It was observed that while using Batch Normalization, the effective batch size per worker (in data parallelization) has a key impact on the validation accuracy. If the effective batch size falls below 32, there is a significant reduction in the validation accuracy as shown from the graph above. For this experiment, the effective batch size per replica was 64.

Effect of input pipeline optimizations

The graph above shows the effect of the input pipeline optimizers (caching, prefetch, jpeg encoding, and data parallelization) through controlled additions and controlled ablations (subtractions). As we observe from the graph, data parallelization is the most important optimizer for the input pipeline. Removing data parallelization reduces the throughput by almost 50%!

Final Performance Evaluation

Using the system and algorithm-related optimizations discussed above, this table shows the final performance of the ResNet-50 model. The authors were able to train ResNet-50 on ImageNet with a validation accuracy of 76.3% in just under 2.2 minutes !. We observe that changing the optimizer from RMSprop+SGD to LARS leads to a significant reduction in training time while maintaining accuracy.

Observations and Insights

This paper demonstrated significant system (or input pipeline) related optimizations to reduce the training time of Deep Neural Networks. We also observed how using LARS as an optimizer can significantly reduce the training time without compromising on accuracy.

Does the training set batch size affect the performance of Convolutional Neural Networks?

In this blog, I would like to try answering the above question. One of the most critical parts of deep learning is discovering accurate hyperparameters. We are going to discuss the effects of batch size on the performance of Convolutional Neural Networks.

I will like to discuss a research paper that will give us deep insight into the relationship between training set batch size and the performance of CNN.

Impact of training Set Batch Size on the Performance of Convolutional Neural Network for Diverse Datasets

MNIST Dataset

The benchmark classification image problem is conducted on the MNIST and CIFAR-10 datasets for estimating network training performance. The MNIST dataset consists database of handwritten digits from zero to nine. The CIFAR-10 datasets consists 60,000 32x32 color images in 10 different classes.

CIFAR-10 Dataset

Inferencing the related works, two sequences of the batch size values were selected, namely number to the power of two and number multiple of ten.

In addition to this, different CNN architecture was applied for each dataset. The initial hurdle was to examine the influence of batch size on the MNIST dataset. Consequently, a renowned architecture of CNN, called LeNet, was used.

LeNet

Then to conduct the testing on the CIFAR-10 dataset. A neural network with five convolutional layers was used. In addition to this normalized layers were also added. The selected models were applied using the machine learning framework TensorFlow v.1.3.0.

The models are trained using SGD with a learning rate of 0.001 and 0.0001 for MNIST and CIFAR-10 datasets, respectively. The performance was evaluated as an average of over 5,000 and 10,000 iterations for the MNSIT and CIFAR-10 datasets, respectively, to optimize the training of the model.

Fig.1 The testing accuracy of the trained CNN on the MNIST dataset

From Fig.1 we can conclude that the larger the batch size value, the more smooth the curve. The lowest and noisiest curve corresponds to the batch size of 16, whereas the smoothest curve corresponds to a batch size of 1024.

Fig.2 The testing accuracy of the trained CNN on the CIFAR-10 dataset

From Fig.2 we can observe that the smoothness of the curve is approximately the same for all the batch size values. The lowest curve corresponds to the batch size of 16, and the highest corresponds to the batch size of 1024.

Investigating the above two figures, we can conclude that the curves which describe testing accuracy results are noisy on MNIST datasets and smooth on the CIFAR-10 dataset. The curves vary from the batch size value of 16 to 1024.

In addition to this, we can comprehend from Table 1 that the testing accuracy for both datasets increased when the batch size increased. Similarly, the training time efficiency is similar to the testing accuracy. Analyzing Table 2, we can infer that the higher the batch size value, the more time is required to train the network.

How to achieve efficient large-batch training?

Hello everyone, we know that using data-parallelism on multiple GPUs helps in speed-up the training of large networks. To make full use of the computational power of each GPU, we need to increase the batch size, especially in the case of Stochastic Gradient-based methods. However, it is not easy to retain the accuracy of the network by increasing the batch size. So we are going to discuss how we can achieve effective large-batch training.

I will like to discuss a research paper that will give us profound insight into the policies that can overcome initial optimization difficulties.

Scaling SGD Batch Size to 32K for ImageNet Training

ImageNet

We know that Deep Neural Networks (DNN) perform significantly better than the conventional machine learning techniques for complicated applications like computer vision and natural language processing. However, The efficiency of Deep Neural Networks (DNN) is affected by their time-consuming nature. For example, Training the ImageNet by AlexNet model on one NVIDIA K20 GPU will need six days to achieve 58% top-1 accuracy.

As earlier discussed, we know that increasing the number of GPUs can help in speeding the training process. For utilizing the available resources we need to increase the batch size. However, increasing the batch size often leads to a significant loss in test accuracy. Hence, we need to use some policies that increase the batch size and maintain the same accuracy as the baseline (e.g. batch size = 256).

Gradient Descent

One of the methods that can help us maintain the balance between batch size increase and baseline accuracy is to control the SGD learning rate during the training process. For example, In ImageNet training by ResNet-152, managed to achieve the same 77.8% accuracy when he increased the batch size from 256 to 5120 by linear scaling rule. However, When increasing the batch size beyond 1024 to train ImageNet using the AlexNet model existing methods (like linear scaling or sqrt scaling) does not work. Although, batch normalization to the AlexNet model can improve Batch-4096 accuracy from 53.1% to 58.9%.

To increase large-batch AlexNet’s test accuracy and enable large-batch training to general networks or datasets, Layer-wise Adaptive Rate Scaling (LARS) can be used. LARS is a learning rate policy that uses different Learning Rates for different layers based on the norm of the weights (||w||) and the norm of the gradients (||∇w||). By using LARS, the same accuracy can be achieved increasing the batch size from 128 to 8192 for the AlexNet model.

ImageNet Dataset by ResNet50 Model with poly learning rate (LR) rule

Facebook used the multistep learning rate rule, warm-up strategy, and linear-scaling learning rate. Specifically, batch-8192’s base LR is 3.2, which is 32 times (8192/256) of batch-256 base LR. During the first five epochs, they gradually increase the learning rate from 0.1 to 3.2, which is called the warm-up range. At the 30th epoch, 60th epoch, and 80th epoch, they use η = 0.1 × η to update the learning rate. Like Facebook’s paper, they have used the warm-up and linear-scaling strategies for the learning rate. They push the learning rate higher and use the poly rule rather than the multistep rule to update the learning rate. From the above figure, we understand that we can use an 8k batch size to achieve the same accuracy with a 256 batch size by using the same 90 epochs.

Now I would like to discuss different ways that were used to Train ImageNet by AlexNet.

ImageNet Dataset by ResNet50 Model with poly learning rate (LR) rule

Linear Scaling and Warmup schemes for LR: The baseline was Batch- 512 BVLC-AlexNet which achieved 0.588 test accuracy in 100 epochs. The target was to achieve the baseline accuracy using Batch- 2096 and Batch-8192 in 100 epochs. We can infer from the table that initially using linear scaling for Batch-4096 AlexNet batch-4096 does not converge even at LR = 0.01. Even after using linear scaling and warmup schemes for LR, they were not able to get the desired results. So, we can conclude that only using linear scaling and warmup schemes is not enough for large-batch AlexNet training.

Batch Normalization for Large-Batch Training: Comprehending different techniques to enable large-batch training for AlexNet, it was found out that only Batch Normalisation improves the accuracy. Although, the result using Batch Normalization was promising. However, there was still a one percent accuracy gap between 512 and 4096, and for batch 8192, the gap was even more.

ImageNet Dataset by ResNet50 Model with poly learning rate (LR) rule

Layer-wise Adaptive Rate Scaling (LARS) for Large-Batch Training: For improving the accuracy further of large-batch AlexNet, a new rule of updating the learning rate was introduced. We know that that the standard SGD algorithm uses the same LR for all layers. However, from experiments, it was perceived that different layers may need different LR. So, Layer-wise Adaptive Rate Scaling (LARS) learning rate scheme was designed to improve large-batch training accuracy. By using LARS, the batch size can be scaled up to 23K for ImageNet-1k training by the ResNet50 model. We can deduce from the graph that the same accuracy can be achieved even using different batch sizes ranging from 256 to 32k.

But why achieve a large batch size when we can use a small batch size?

As we have seen that different batch sizes ranging from 256 to 32k are giving the same accuracy. So, why bother about using large batch sizes. I would love to answer this question using the below table.

The speed and time for AlexNet-BN

We can understand from the above table that for a batch size of 512 the training time is approximately 5 hours 22 minutes using 4-GPU, whereas using 8-GPU the training time increase approximately to 6 hours and 10 minutes. The training time is increasing because the model using a batch size of 512 is not able to utilize computational resources available completely. Whereas the model with a batch size of 4096 runs more than two times faster on 8-GPU as compared to 4-GPU. Hence, it is necessary to increase the batch size to utilize the computational resources efficiently.

--

--

Ankit Kumar

I am an AI and Robotics enthusiast. Melding the two fields into one is what fascinates me the most. I want to develop advanced AI-based robots for humanity.