Influence of large batches

My deep learning notes

Mastafa Foufa
4 min readOct 27, 2022

Lately I have been preparing my lectures for EPITA Paris and started thinking about the influence of large batches. A lot of recent NLP models use larger batches and train on more data.

My previous article on large batches: https://medium.com/@mastafa.foufa/influence-of-large-batches-5f1d8a00891c

For example, RoBERTa is trained over 160 GB of data vs 16 GB of data for BERT. The authors note that RoBERTa is trained over more data, larger batches and longer.

Where is this analysis coming from?

When I read that they train their model on larger batches, it caught my attention because I was pretty sure I read somewhere that longer batches may lead to a drop of performance.

In the last article, I just shared my raw notes as I didn’t have time to dig deeper for resources. This time, I have next to those notes an article about sharp and flat minimizers. The conclusion of the authors is that large batch tend to lead to finding local minima for the loss function that are sharp minimizers vs flat minimizers for small mini batches.

I’ll try to understand that better and share my notes in the next few lines.

The resource is the following: ON LARGE-BATCH TRAINING FOR DEEP LEARNING: GENERALIZATION GAP AND SHARP MINIMA

One of the comments on the last article suggested me to also approach things from a computational side i.e. that…

--

--

Mastafa Foufa

Data Scientist @Microsoft | ex-Teacher @EPITA Paris | 8 patents in AI