How Parallelization and Large Batch Size Improve the Performance of Deep Neural Networks.
Large Batch Size had till recently been viewed as a deterrent for good accuracy. However recent studies show that increasing the batch size can significantly reduce the training time while maintaining a considerable level of accuracy. In this blog, we draw on our inferences from four such technical papers.
1. Training ResNet-50 on ImageNet in 15 Mins
In order to optimize the learning algorithm to ensure that the accuracy does not drop with large mini-batch, the following techniques were used:
Technical Details
A. RMSprop Warm-up:
The RMSprop Warm-up phase is used to address the optimization difficulty at the start of the training. The update rule demonstrated below utilizes both the Stochastic Gradient Descent (SGD) along the RMSprop optimization algorithm.
The network utilizes both the SGD and the RMSprop algorithm during the training phase. The training starts as pure RMSprop and then gradually transitions into pure SGD.
As seen from the equation above, the transition to SGD is exponential, then linear, and then constant (at a value of one indicating pure SGD). The smooth Exponential Linear Unit (ELU) transition of SGD ensures that the transition from RMSprop to SGD doesn’t have a negative impact on the training of the network.
B. Slow-Start Learning Rate Schedule
Optimization issues are more frequent in the initial part of the training due to the constant learning and change in the state of the model and hence a small learning rate for a large initial phase is used to address this issue. The learning rate for RMSprop (used during the initial part of the training) is kept very small at 0.0003 while the learning rate of SGD (used during the later part of the training) keeps on reducing after a set number of epochs.
C. Batch Normalization without Moving Averages
An interesting observation presented in this paper is that with large mini-batch sizes, the moving averages of the mean and variance are not an accurate representation of the actual mean and variance. In order to overcome this issue, only the last mini-batch is considered instead of moving averages and all-reduce communication is used to average across all the workers in the data parallelization.
Apart from these key optimizations to the learning algorithm, the following combination of software and hardware modules was used to perform the experiment
Chainer — Open source deep-learning framework
NCCL — Nvidia Collection Communication Library for multi-node GPU communication
Half-Precision Floating Points — In order to reduce the communication overhead during all reduce phase
1024 NVIDIA Tesla P100 GPU — For distributed training
Experimental Results
As demonstrated from the table above, ResNet-50 was trained in under 15 mins for 90 epochs with a mini-batch size of 32k leading to an accuracy of 74.9%. This accuracy is comparable to the previously benchmarked accuracies and takes considerably less amount of time to train.
Observations and Insights
The key takeaway from this paper is that it is feasible to train Deep Neural networks with extremely large batch sizes and still maintain considerable accuracy while having a significant reduction in training time. This paper also demonstrates how the learning algorithms can be optimized and used in conjunction with distributed training to achieve this level of performance.
2. Image Classification at Supercomputer Scale
This paper can be viewed as an extension of the previous paper we discussed. In this paper, the authors have proposed system-related optimizations to train ResNet-50 on ImageNet in under 2.2 minutes with no drop inaccuracy.
Technical Details
- TPU Pod: A combination of TPU chip configuration, TPU v2 pod contains 256 chips and TPU v3 pod contains 1024 chips. The key point over here is that as we increase the number of chips in the pods, we also increase the petaFLOPS mixed-precision throughput.
- Mixed Precision: The authors used bfloat 16 (16-bit half-precision point) to perform convolution operations. This enables higher throughput with minimal loss inaccuracy.
- Learning Rate Schedule: A linear scale learning rate is used (learning rate scales linearly with the batch size. This linear learning rate scaling was proposed by Alex Krizhevsky from Google in 2014.
- Layer-Wise Adaptive Rate Scaling (LARS): LARS uses different learning rates for different layers based on the norm of the weights and the norm of the gradients. This allowed to scale up the batch size from 8192 (with linear learning rate scaling) to a batch size of 32k.
Input Pipeline Optimization
The throughput of the overall network is dependent on the throughput of the input pipeline and hence it is equally important to optimize the input pipeline. The authors of this paper present the following optimizations to the input pipeline:
- Dataset Sharding and Caching: This is essentially data parallelization in which the dataset is distributed across multiple learners. Practically it is not feasible to store all of the data in a single machine (or host) and hence data parallelization is used in practice.
- Prefetch: Prefetch allows the data from the next batch to be concurrently processed on the input pipeline at the same time when the model pipeline is training on the current batch.
- Fused JPEG Decode: For the purpose of dataset augmentation (in order to reduce variance), only the relevant part(s) of the input image is decoded to reduce the computation overhead.
- Parallel Data Parsing: This is an extension of the first optimization technique described above (dataset sharding and caching) where a multi-core CPU parallelizes the data parsing operation through several CPU threads.
Experimental Results
It was observed that while using Batch Normalization, the effective batch size per worker (in data parallelization) has a key impact on the validation accuracy. If the effective batch size falls below 32, there is a significant reduction in the validation accuracy as shown from the graph above. For this experiment, the effective batch size per replica was 64.
The graph above shows the effect of the input pipeline optimizers (caching, prefetch, jpeg encoding, and data parallelization) through controlled additions and controlled ablations (subtractions). As we observe from the graph, data parallelization is the most important optimizer for the input pipeline. Removing data parallelization reduces the throughput by almost 50%!
Using the system and algorithm-related optimizations discussed above, this table shows the final performance of the ResNet-50 model. The authors were able to train ResNet-50 on ImageNet with a validation accuracy of 76.3% in just under 2.2 minutes !. We observe that changing the optimizer from RMSprop+SGD to LARS leads to a significant reduction in training time while maintaining accuracy.
Observations and Insights
This paper demonstrated significant system (or input pipeline) related optimizations to reduce the training time of Deep Neural Networks. We also observed how using LARS as an optimizer can significantly reduce the training time without compromising on accuracy.
Does the training set batch size affect the performance of Convolutional Neural Networks?
In this blog, I would like to try answering the above question. One of the most critical parts of deep learning is discovering accurate hyperparameters. We are going to discuss the effects of batch size on the performance of Convolutional Neural Networks.
I will like to discuss a research paper that will give us deep insight into the relationship between training set batch size and the performance of CNN.
Impact of training Set Batch Size on the Performance of Convolutional Neural Network for Diverse Datasets
The benchmark classification image problem is conducted on the MNIST and CIFAR-10 datasets for estimating network training performance. The MNIST dataset consists database of handwritten digits from zero to nine. The CIFAR-10 datasets consists 60,000 32x32 color images in 10 different classes.
Inferencing the related works, two sequences of the batch size values were selected, namely number to the power of two and number multiple of ten.
In addition to this, different CNN architecture was applied for each dataset. The initial hurdle was to examine the influence of batch size on the MNIST dataset. Consequently, a renowned architecture of CNN, called LeNet, was used.
Then to conduct the testing on the CIFAR-10 dataset. A neural network with five convolutional layers was used. In addition to this normalized layers were also added. The selected models were applied using the machine learning framework TensorFlow v.1.3.0.
The models are trained using SGD with a learning rate of 0.001 and 0.0001 for MNIST and CIFAR-10 datasets, respectively. The performance was evaluated as an average of over 5,000 and 10,000 iterations for the MNSIT and CIFAR-10 datasets, respectively, to optimize the training of the model.
From Fig.1 we can conclude that the larger the batch size value, the more smooth the curve. The lowest and noisiest curve corresponds to the batch size of 16, whereas the smoothest curve corresponds to a batch size of 1024.
From Fig.2 we can observe that the smoothness of the curve is approximately the same for all the batch size values. The lowest curve corresponds to the batch size of 16, and the highest corresponds to the batch size of 1024.
Investigating the above two figures, we can conclude that the curves which describe testing accuracy results are noisy on MNIST datasets and smooth on the CIFAR-10 dataset. The curves vary from the batch size value of 16 to 1024.
In addition to this, we can comprehend from Table 1 that the testing accuracy for both datasets increased when the batch size increased. Similarly, the training time efficiency is similar to the testing accuracy. Analyzing Table 2, we can infer that the higher the batch size value, the more time is required to train the network.
How to achieve efficient large-batch training?
Hello everyone, we know that using data-parallelism on multiple GPUs helps in speed-up the training of large networks. To make full use of the computational power of each GPU, we need to increase the batch size, especially in the case of Stochastic Gradient-based methods. However, it is not easy to retain the accuracy of the network by increasing the batch size. So we are going to discuss how we can achieve effective large-batch training.
I will like to discuss a research paper that will give us profound insight into the policies that can overcome initial optimization difficulties.
Scaling SGD Batch Size to 32K for ImageNet Training
We know that Deep Neural Networks (DNN) perform significantly better than the conventional machine learning techniques for complicated applications like computer vision and natural language processing. However, The efficiency of Deep Neural Networks (DNN) is affected by their time-consuming nature. For example, Training the ImageNet by AlexNet model on one NVIDIA K20 GPU will need six days to achieve 58% top-1 accuracy.
As earlier discussed, we know that increasing the number of GPUs can help in speeding the training process. For utilizing the available resources we need to increase the batch size. However, increasing the batch size often leads to a significant loss in test accuracy. Hence, we need to use some policies that increase the batch size and maintain the same accuracy as the baseline (e.g. batch size = 256).
One of the methods that can help us maintain the balance between batch size increase and baseline accuracy is to control the SGD learning rate during the training process. For example, In ImageNet training by ResNet-152, managed to achieve the same 77.8% accuracy when he increased the batch size from 256 to 5120 by linear scaling rule. However, When increasing the batch size beyond 1024 to train ImageNet using the AlexNet model existing methods (like linear scaling or sqrt scaling) does not work. Although, batch normalization to the AlexNet model can improve Batch-4096 accuracy from 53.1% to 58.9%.
To increase large-batch AlexNet’s test accuracy and enable large-batch training to general networks or datasets, Layer-wise Adaptive Rate Scaling (LARS) can be used. LARS is a learning rate policy that uses different Learning Rates for different layers based on the norm of the weights (||w||) and the norm of the gradients (||∇w||). By using LARS, the same accuracy can be achieved increasing the batch size from 128 to 8192 for the AlexNet model.
Facebook used the multistep learning rate rule, warm-up strategy, and linear-scaling learning rate. Specifically, batch-8192’s base LR is 3.2, which is 32 times (8192/256) of batch-256 base LR. During the first five epochs, they gradually increase the learning rate from 0.1 to 3.2, which is called the warm-up range. At the 30th epoch, 60th epoch, and 80th epoch, they use η = 0.1 × η to update the learning rate. Like Facebook’s paper, they have used the warm-up and linear-scaling strategies for the learning rate. They push the learning rate higher and use the poly rule rather than the multistep rule to update the learning rate. From the above figure, we understand that we can use an 8k batch size to achieve the same accuracy with a 256 batch size by using the same 90 epochs.
Now I would like to discuss different ways that were used to Train ImageNet by AlexNet.
Linear Scaling and Warmup schemes for LR: The baseline was Batch- 512 BVLC-AlexNet which achieved 0.588 test accuracy in 100 epochs. The target was to achieve the baseline accuracy using Batch- 2096 and Batch-8192 in 100 epochs. We can infer from the table that initially using linear scaling for Batch-4096 AlexNet batch-4096 does not converge even at LR = 0.01. Even after using linear scaling and warmup schemes for LR, they were not able to get the desired results. So, we can conclude that only using linear scaling and warmup schemes is not enough for large-batch AlexNet training.
Batch Normalization for Large-Batch Training: Comprehending different techniques to enable large-batch training for AlexNet, it was found out that only Batch Normalisation improves the accuracy. Although, the result using Batch Normalization was promising. However, there was still a one percent accuracy gap between 512 and 4096, and for batch 8192, the gap was even more.
Layer-wise Adaptive Rate Scaling (LARS) for Large-Batch Training: For improving the accuracy further of large-batch AlexNet, a new rule of updating the learning rate was introduced. We know that that the standard SGD algorithm uses the same LR for all layers. However, from experiments, it was perceived that different layers may need different LR. So, Layer-wise Adaptive Rate Scaling (LARS) learning rate scheme was designed to improve large-batch training accuracy. By using LARS, the batch size can be scaled up to 23K for ImageNet-1k training by the ResNet50 model. We can deduce from the graph that the same accuracy can be achieved even using different batch sizes ranging from 256 to 32k.
But why achieve a large batch size when we can use a small batch size?
As we have seen that different batch sizes ranging from 256 to 32k are giving the same accuracy. So, why bother about using large batch sizes. I would love to answer this question using the below table.
We can understand from the above table that for a batch size of 512 the training time is approximately 5 hours 22 minutes using 4-GPU, whereas using 8-GPU the training time increase approximately to 6 hours and 10 minutes. The training time is increasing because the model using a batch size of 512 is not able to utilize computational resources available completely. Whereas the model with a batch size of 4096 runs more than two times faster on 8-GPU as compared to 4-GPU. Hence, it is necessary to increase the batch size to utilize the computational resources efficiently.
[1]: Scaling SGD Batch Size to 32K for ImageNet Training Yang You, Igor Gitman, Boris Ginsburg
[2]: A Review on Conventional Machine Learning vs Deep Learning
[3]: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
[4]: Image Classification at Supercomputer scale
[5]: Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes
[6]: Impact of Training Set Batch Size on the Performance of Convolutional Neural Networks for Diverse DatasetsPavlo M. Radiuk Khmelnitsky National University, Ukraine.
[7]: MNIST Dataset
[8]: CIFAR-10 Dataset
[9]: TensorFlow