Influence of large batches

My deep learning notes

Mastafa Foufa
4 min readOct 26, 2022

“This could explain why the batch updates or larger batch sizes tend to be smaller — the sum of gradient vectors becomes larger, but cannot fully offset the larger denominator |B_k|”

Context: Models in NLP are using more and more data. They are also using bigger batches. Why are they using bigger batches?

Take RoBERTa as an example whereby they train their model with a batch of 2K samples. We note that perplexity decreases as batch size increases until 2K samples and then go down. Perplexity being the inverse of the probability of a sequence, the lower it is, the better.

Perplexity to evalute language models as an intrinsic evaluation.

Cool medium post about perplexity: https://towardsdatascience.com/perplexity-in-language-models-87a196019a94

Some of my notes in this space below.

We know how things work. We start by creating our training data. We divide this training data into batches and for each batch we update the parameters in a way to get closer to a good minum local for the overall loss function. This is done by going in the opposition direction as “the slope” of the loss function described in a multidimensional space by the gradient of this loss function.

We also remember that our data is not perfect. So having batches allow us by the law of large numbers to converge towards the expected loss over the true underlying population. Remember we only work with estimates here because we only have a sample of the true underlying population.

The debate is around the value-add of having B_k high.

Theata <- Theta — learning_rate*(1/B_k)*sum(grad_i(theta))

The sum of the gradients with the parameter’s values at time T is done across all observations sampled from the batch.

What I know like I said above is that the larger is B_k the more I will converge towards the expected value of the slope I expect given a word from the true underlying population and given current parameters theta.

Let’s compare a situation where we have B_k and 2*B_k.

sum_grad(B_k)/B_k < sum_grad(2*B_k)/2*B_k

if and only if 2*sum_grad(B_k) < sum_grad(2*B_k)

So for a batch of size B_k, for the new update to be higher, we would require that the sum_grad(2*B_k) to be higer than twice the initial sum.

For me, it’s hard to tell what would happen. Though, if I push things to the extreme values with B_k large, then imagine the following equation:

2*(sum_grad(B_k)/N) < sum_grad(2*B_k)/N

The asymptotic behavior looks something like:

2*Expected_Loss < Expected_Loss -> Impossible

This is impossible in other words. This also means, for a batch size big enough that, as we scale up the batch size, we don’t get higher updates.

The authors below explain it in a different way that I am not really aligned with. They say that gradients across the increasing batch point in different directions in a way that their sum can’t offset the denominator.

OK. So whatever is the approach we kind of sense that when we increase the batch size, we get lower updates for the parameters of our model using gradient descent.

Hum, is that a good thing or not?

I actually can’t tell just like that if it’s good for a language model to have lower updates. Typically, my intuition tells me that if I am not too confident about the gradient values i.e. not confident about my training data, then it makes sense that I shouldn’t give too much credit to its values and hence move slowly in the loss curve.

Though, this control over the movements can be well done through the learning rate parameter and my intuition above is largely coming from my past readings on how to set the learning rate.

Let’s keep thinking. How can I explain the RoBERTa guys got a better perplexity (lower) by increasing their batch size.

Well firstly, like I said earlier, the larger is your batch the more you converge towards the expected value of your pseudo-slope (the gradient of your loss function given the current parameters of your model). I guess it is a good thing to move in the opposite direction as the true pseudo-slope rather than a bad estimate of it, right?

Time’s up, I’ll keep thinking about it another day. ;)

--

--

Mastafa Foufa

Data Scientist @Microsoft | ex-Teacher @EPITA Paris | 8 patents in AI