Effect of Batch Size on Neural Net Training

Published in

Deep Learning Experiments

16 min readMay 25, 2020

Welcome to the first installment in our Deep Learning Experiments series, where we run experiments to evaluate commonly-held assumptions about training neural networks. Our goal is to better understand the different design choices that affect model training and evaluation. To do so, we come up with questions about each design choice and then run experiments to answer them.

In this article, we seek to better understand the impact of batch size on training neural networks. In particular, we will cover the following:

What is batch size?
Why does batch size matter?
How do small and large batches perform empirically?
Why do large batches tend to perform worse, and how can we close the performance gap?

What is batch size?

Neural networks are trained to minimize a loss function of the following form:

Figure 1: Loss function. Adapted from Keskar et al [1].

where

theta represents the model parameters
m is the number of training data examples
each value of i represents a single training data example
J_i represents the loss function applied to a single training example

Typically, this is done using gradient descent, which computes the gradient of the loss function with respect to the parameters, and takes a step in that direction. Stochastic gradient descent computes the gradient on a subset of the training data, B_k, as opposed to the entire training dataset.

Figure 2: Stochastic gradient descent update equation. Adapted from Keskar et al [1].

B_k is a batch sampled from the training dataset, and its size can vary from 1 to m (the total number of training data points) [1]. This is typically referred to as mini-batch training with a batch size of |B_k|. We can think of these batch-level gradients as approximations of the ‘true’ gradient, the gradient of the overall loss function with respect to theta. We use mini-batches because it tends to converge more quickly, since it doesn’t need to make a full pass through the training data to update the weights.

Why does batch size matter?

Keskar et al note that stochastic gradient descent is sequential and uses small batches, so it cannot be easily parallelized [1]. Using larger batch sizes would allow us to parallelize computations to a greater degree, since we could split up the training examples between different worker nodes. This in turn could significantly speed up model training.

However, larger batch sizes, while able to achieve similar training error as smaller batch sizes, tend to generalize worse to test data [1]. The gap between the train and test error is referred to as the ‘generalization gap.’

Thus, the ‘holy grail’ is to achieve the same test error as small batch sizes using large batch sizes. This would allow us to significantly speed up training without sacrificing model accuracy.

How are the experiments set up?

Similar to the first article on optimizers, we will train a neural net using different batch sizes and compare their performance.

Dataset: we use the Cats and Dogs dataset, which consists of 23,262 images of cats and dogs, split about 50/50 between the two classes. Since the images are differently-sized, we resize them all to the same size. We use 20% of the dataset as validation data (dev set) and the rest as training data.
Evaluation metric: we use the binary cross-entropy loss on the validation data as our primary metric to measure model performance.

Figure 3: Sample images from Cats vs Dogs dataset

Base model: we also define a base model that is inspired by VGG16, where we apply (convolution ->max-pool) operations repeatedly, using ReLU as the activation function for the convolution. Then, we flatten the output volume and feed it into two fully-connected layers, and finally a one-neuron layer with a sigmoid activation, resulting in an output between 0 and 1 that tells us whether the model predicts a cat (0) or dog (1).

Training: we use SGD with a learning rate of 0.01. (Note that this learning rate is different from the previous article.) We train until the validation loss fails to improve over 100 iterations.

How does batch size affect training?

Let’s try out different batch sizes in the wild with our Cats vs Dogs dataset and see how each performs!

Figure 5: Training and validation loss curves for different batch sizes

Figure 6: Best losses attained by each batch size

Figure 7. Left: mean time per epoch. Middle: Num epochs until validation loss converged. Right: Overall training time until validation loss converged.

From the above graphs, we can conclude that the larger the batch size:

The slower the training loss decreases.
The higher the minimum validation loss.
The less time it takes to train per epoch.
The more epochs it takes to converge to the minimum validation loss.

Let’s go through these one by one. First, in large batch training, the training loss decreases more slowly, as shown by the difference in slope between the red line (batch size 256) and blue line (batch size 32).

Second, large batch training achieves worse minimum validation losses than the small batch sizes. For example, batch size 256 achieves a minimum validation loss of 0.395, compared to 0.344 for batch size 32.

Third, each epoch of large batch size training takes slightly less time — 7.7 seconds for batch size 256 compared to 12.4 seconds for batch size 256, which reflects the lower overhead associated with loading a smaller number of large batches, as opposed to many small batches sequentially. This time difference would be even more pronounced if we had parallelized training using multiple GPUs.

However, large batch training takes more epochs to converge to a minimizer — 958 for batch size 256, 158 for batch size 32. Because of this, large batch training took longer overall: batch size 256 took almost four times as long as 32! Note that we did not parallelize training here — if we had, then large batch training might have trained as quickly as small batch training.

What happens if we parallelize the training runs? To answer this, we parallelized training across four GPUs using MirroredStrategy in TensorFlow:

with tf.distribute.MirroredStrategy().scope():
   # Create, compile, and fit model
   # ...

MirroredStrategy copies all of the model’s variables to each GPU, and distributes the forward/backward pass computation in a batch to all the GPUs. Then, it combines the gradients from each GPU using all-reduce, and then applies the result to each GPU’s copy of the model. Essentially, it is dividing up the batch and assigning each chunk to a GPU.

We found that parallelization made small-batch training slightly slower per epoch, whereas it made large-batch training faster — for batch size 256, each epoch took 3.97 seconds, down from 7.70 seconds. However, even with the per-epoch speedup, it fails to match batch size 32 in terms of overall training time — when we multiply by the overall number of epochs (958), we get a total training time of ~3700 seconds, which is still much greater than the 1915 seconds for batch size 32.

Figure 8: mean time per epoch, when parallelizing across 4 GPUs.

So far, large batch training don’t look like they’re worth the trouble, since they take longer to train and achieve worse training and validation loss. Why is this the case? Is there any way that we can close the performance gap?

Why do the smaller batch sizes perform better?

Keskar et al propose an explanation for the performance gap between small and large batch sizes: training with small batch sizes tends to converge to flat minimizers that vary only slightly within a small neighborhood of the minimizer, whereas large batch sizes converge to sharp minimizers, which vary sharply [1]. Flat minimizers tend to generalize better, since they are more robust to changes between the training and test sets [1].

Figure 9: Conceptual illustration of flat and sharp minima, taken from Keskar et al [1].

Additionally, they found that small batch size training finds minimizers farther away from the initial weights, compared to large batch size training. They explain that small batch size training may introduce enough noise for training to exit the loss basins of sharp minimizers and instead find flat minimizers that may be farther away.

Let’s validate these hypotheses.

Hypothesis 1: The small batch minimizer is farther from the initial weights compared to the large batch minimizer.

We first measure the Euclidean distance between the initial weights and the minimizers found by each model.

Figure 10: Distance from initial weights

Figure 11: Distance from initial weights by layer, comparison of batch size 32 and 256

Indeed, we find that generally speaking, the larger the batch size, the closer the minimizer is to the initial weights. (With the exception of batch size 128 being farther from the initial weights than batch size 64). We also see in figure 11 that this is true across different layers in the model.

Why does large batch training end up closer to the initial weights? Is it taking smaller update steps? Let’s find out why by measuring the epoch distance — i.e. the distance between the final weights in epoch i and initial weights in epoch i — for batch sizes 32 and 256.

Figure 12. Left: epoch distances by batch size. Right: ratio of epoch distances.

The first plot above shows that the larger batch sizes do indeed traverse less distance per epoch. The batch 32 training epoch distance varies from 0.15 to 0.4, while for batch 256 training it is around 0.02–0.04. In fact, as we can see in the second plot, the ratio of the epoch distances increases over time!

But why does large batch training traverse less distance per epoch? Is it because we have fewer batches, and therefore fewer updates per epoch? Or is it because each batch update traverses less distance? Or, is the answer a combination of both?

To answer this question, let’s measure the size of each batch update.

Figure 13: Distribution of batch update sizes

Median batch update norm for batch size 32: 3.3e-3
Median batch update norm for batch size 256: 1.5e-3

We can see that each batch update is smaller when the batch size is larger. Why would this be the case?

To understand this behavior, let’s set up a dummy scenario, where we have two gradient vectors a and b, each representing the gradient for one training example. Let’s think about how the average batch update size for batch size=1 compares to that of batch size=2.

Figure 14: Comparison of update steps between batch size 1 (a+b) and batch size 2 ((a+b)/2)

If we use a batch size of one, we will take a step in the direction of a, then b, ending up at the point represented by a+b. (Technically, the gradient for b would be recomputed after applying a, but let’s ignore that for now). This results in an average batch update size of (|a|+|b|)/2 — the sum of the batch update sizes, divided by the number of batch updates.

However, if we use a batch size of two, the batch update is instead represented by the vector (a+b)/2 — the red arrow in Figure 12. Thus, the average batch update size is |(a+b)/2| / 1 = |a+b|/2.

Now, let’s compare the two average batch update sizes:

Figure 15: Comparison of average batch update size for batch size 1 and batch size 2.

In the last line, we used the triangle inequality to show that the average batch update size for batch size 1 is always greater than or equal to that of batch size 2.

Put another way, in order for the average batch size for batch size 1 and batch size 2 to be equal, the vectors a and b have to be pointing in the same direction, since that is when |a| + |b| = |a+b|. We can extend this argument to n vectors — only when all n vectors are pointing in the same direction, are the average batch update sizes for batch size=1 and batch size=n the same. However, this is almost never the case, since the gradient vectors are unlikely to point in the exact same direction.

If we return to the minibatch update equation in Figure 16, we are in some sense saying that as we scale up the batch size |B_k|, the magnitude of the sum of the gradients scales up comparatively less quickly. This is due to the fact that the gradient vectors point in different directions, and thus doubling the batch size (i.e. the number of gradient vectors to sum together) does not double the magnitude of the resulting sum of gradient vectors. At the same time, we are dividing by a denominator |B_k| that is twice as large, resulting in a smaller update step overall.

This could explain why the batch updates for larger batch sizes tend to be smaller — the sum of gradient vectors becomes larger, but cannot fully offset the larger denominator|B_k|.

Hypothesis 2: Small batch training finds flatter minimizers

Let’s now measure the sharpness of both minimizers, and evaluate the claim that small batch training finds flatter minimizers. (Note that this second hypothesis can coexist with the first one — they are not mutually exclusive.) To do so, we borrow two methods from Keskar et al.

In the first one, we plot the training and validation loss along a line between a small batch minimizer (batch size 32) and a large batch minimizer (batch size 256). This line is described by the following equation:

Figure 17: Linear interpolation between small batch minimizer and large batch minimizer. From Keskar et al [1].

where x_l* is the large batch minimizer and x_s* is the small batch minimizer, and alpha is a coefficient between -1 and 2.

Figure 18: Interpolation between small batch minimizer (alpha=0) and large batch minimizer (alpha=1). The large batch minimizer is much ‘sharper.’

As we can see in the plot, the small batch minimizer (alpha=0) is much flatter than the large batch minimizer (alpha=1), which varies much more sharply.

Note that this is a rather simplistic way of measuring sharpness, since it only considers one direction. Thus, Keskar et al propose a sharpness metric that measures how much the loss function varies in a neighborhood around a minimizer. First, we define the neighborhood as follows:

Figure 19: Constraint box within which to maximize the loss. From Keskar et al [1].

where epsilon is a parameter defining the size of the neighborhood and x is the minimizer (the weights).

Then, we define the sharpness metric as the maximum loss in this neighborhood around the minimizer:

Figure 20: Sharpness metric definition. From Keskar et al [1].

where f is the loss function, with the inputs being the weights.

With the definitions above, let’s compute the sharpness of the minimizers at various batch sizes, with an epsilon value of 1e-3:

Figure 21: Sharpness score by batch size

This shows that the large batch minimizers are indeed sharper, as we saw in the interpolation plot.

Lastly, let’s try plotting the minimizers with a filter-normalized loss visualization, as formulated by Li et al [2]. This type of plot chooses two random directions with the same dimensions as the model weights, and then normalizes each convolutional filter (or neuron, in the case of FC layers) to have the same norm as the corresponding filter in the model weights. This ensures that the sharpness of a minimizer is not affected by the magnitudes of its weights. Then, it plots the loss along these two directions, with the center of the plot being the minimizer we wish to characterize.

Figure 22: 2D filter-normalized plots for batch size 32 (left) and 256 (right)

Again, we can see from the contour plots that the loss varies more sharply for the large batch minimizer.

Can we improve performance on large batch sizes by increasing the learning rate?

In Hypothesis 1, we saw that both update size and update frequency per epoch were lower for large batch size, and in Hypothesis 2, we saw that the large batch size fails to explore as large a region as the small batch size. Knowing this, can we make large batch training perform better by simply increasing the learning rate?

This approach has been suggested previously, for example by Goyal et al [3]:

Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.

Let’s try this out, with batch sizes 32, 64, 128, and 256. We will use a base learning rate of 0.01 for batch size 32, and scale accordingly for the other batch sizes.

Figure 23: Training and validation loss for different batch sizes, with adjusted learning rates

Figure 24: Minimum training and validation losses by batch size

Indeed, we find that adjusting the learning rate does eliminate most of the performance gap between small and large batch sizes. Now, batch size 256 achieves a validation loss of 0.352 instead of 0.395 — much closer to batch size 32’s loss of 0.345.

How does increasing the learning rate affect the training time? Since large batch training can now converge in roughly the same number of iterations as small batch training, as seen in the left plot in Figure 25, it now takes less time overall to train — 2197 seconds for batch size 256, compared to 3156 for batch size 32. The speedup is even more pronounced if we parallelize across 4 GPUs.

Figure 25. Left: number of training epochs until validation loss converges. Right: Overall training time until convergence.

Does this mean that the large batch sizes are now converging to flat minimizers? If we plot the sharpness scores, we can see that adjusting the learning rate does indeed make the large batch minimizers flatter:

Figure 26: Comparison of sharpness with and without the learning rate adjustment

Interestingly, although adjusting the learning rate makes the large batch minimizers flatter, they are still sharper than the smallest batch size minimizer (between 4–7 compared to 1.14). Why this is the case remains a question for future investigation.

Are the larger batch size training runs now ending up as far from the initial weights as the small batch ones?

Figure 27: Distance from initial weights by batch size, before and after adjustment

The answer is, for the most part, yes. If we look at the plot above, adjusting the learning rate helps close the gap between batch size 32 and the other batch sizes in terms of distance from initial weights. (Note that 128 seems to be an anomaly where increasing the learning rate decreased the distance — why this is the case remains open for future investigation.)

Is it always the case that small batch training will outperform large batch training?

Given the observations above, and the literature, we might expect that small batch training would always outperform large batch training if we hold the learning rate constant. In fact, this is not true, as we can see when we use learning rate 0.08:

Figure 28: Validation loss by batch size on a higher learning rate

Here, we see that batch size 64 in fact outperforms batch size 32! This is because the learning rate and batch size are closely linked — small batch sizes perform best with smaller learning rates, while large batch sizes do best on larger learning rates. We can see this phenomenon below:

Figure 29: Effect of learning rate on val loss for different batch sizes.

We see that learning rate 0.01 is the best for batch size 32, whereas 0.08 is the best for the other batch sizes.

Thus, if you notice that large batch training is outperforming small batch training at the same learning rate, this may indicate that the learning rate is larger than optimal for the small batch training.

Conclusion

So, what does this all mean? What can we take away from these experiments?

Linear scaling rule: when the minibatch size is multiplied by k, multiply the learning rate by k. Although we initially found large batch sizes to perform worse, we were able to close most of the gap by increasing the learning rate. We saw that this is due to the larger batch sizes applying smaller batch updates, due to gradient competition between gradient vectors within a batch.

When the right learning rate is chosen, larger batch sizes can train faster, especially when parallelized. With large batch sizes, we are less limited by the sequential nature of SGD updates, as we do not encounter the overhead associated with sequentially loading many small batches into memory. We can also parallelize the computations across training examples.

However, when the learning rate is not adjusted upward for larger batch sizes, then large batch training may take even longer than small batch training because it requires more training epochs to converge. Thus, you need to adjust the learning rate in order to realize the speedup from larger batch sizes and parallelization.

Large batch sizes, even with adjusted learning rates, performed slightly worse in our experiments, but more data is needed to determine whether larger batch sizes perform worse in general. We still observe a slight performance gap between the smallest batch size (val loss 0.343) and the largest one (val loss 0.352). Some have suggested that small batches have a regularizing effect because they introduce noise into the updates that helps training escape the basins of attraction of suboptimal local minima [1]. However, the results from these experiments suggest that the performance gap is relatively small, at least for this dataset. This suggests that as long as you find the right learning rate for your batch size, you can concentrate on other aspects of training that may have a greater impact on performance.

Github

The code and figures can be found here.

References

Keskar, Nocedal, Mudigere, Smelyanskiy, and Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. https://arxiv.org/pdf/1609.04836.pdf
Li, Xu, Taylor, Studer, and Goldstein. Visualizing the Loss Landscape of Neural Nets. https://papers.nips.cc/paper/7875-visualizing-the-loss-landscape-of-neural-nets.pdf.
Goyal, Dollar, Girshick, Noordhuis, Wesolowski, Kyrola, Tulloch, Jia, and He. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf.