Influence of large batches

My deep learning notes

Mastafa Foufa
5 min readNov 2, 2022

Hi back, it’s Wednesday morning and I’ll be sharing my personal notes on large batches in deep learning.

Have a read at my two previous articles about this topic:

https://medium.com/@mastafa.foufa/influence-of-large-batches-5f1d8a00891c

https://medium.com/@mastafa.foufa/influence-of-large-batches-ba0ad9894f11

My main resource here:

ON LARGE-BATCH TRAINING FOR DEEP LEARNING: GENERALIZATION GAP AND SHARP MINIMA

In the last article, we talked about flat and sharp minima and saw that large batches might lead to sharp minima. This is problematic as a slight move away from the expected loss at testing time lead to a huge loss. In other words, the model learns how to minimize locally the loss but have a hard time doing so at testing time with unknown data points.

Loss function during test time for the “optimal” parameters leading to flat minima (left side) and sharp minima (right side). The loss function on a new test sample has values close to those at training time, whereas the loss function has enormous values for sharp minima.

Below are my last words last time, let’s take it from there now:

It would be great to understand how we can ultimately end up in a sharp minima vs a flat minima and get some intuition ourselves.

We have twenty minutes to go throug that and try to get at least some intuition about the underlying phenomenon.

--

--

Mastafa Foufa

Data Scientist @Microsoft | ex-Teacher @EPITA Paris | 8 patents in AI