Accumulate Gradients and You Are Good To Go

A simple answer to “I can’t increase batch size because of limited GPU”

Harshit Sharma
3 min readJun 4, 2023

Many times while training deep neural networks, we wish we had a bigger GPU to accommodate the BIG Batch size .. but we don’t !! and it leads to:

  • slower convergence and
  • lesser accuracy

Let’s revisit the typical training loop in PyTorch. We will be using this to understand Gradient Accumulation.

Let’s understand the solution and Gradient Accumulation in 3 steps:

Step 1: Divide the BIG batch into smaller batches

Dividing here just means keeping the batch size small so that it fits GPU memory.

Step 2: Accumulate Gradients

This is the core step.

We can’t fit a large batch size but how about we keep track and collect gradients calculated for each batch and act on them together? The idea is to keep accumulating gradients for each of the mini-batches till we reach some number of accumulation steps.

In the below example, we are accumulating gradients for 3 steps.

How to implement this in code? Well, just don’t do “optimizer.step()” after every forward pass. And it will automatically keep on accumulating gradients for each trainable parameter.

Step 3: Update parameter values using Accumulated gradients

Once we have accumulated gradients for “accumulation steps”, we go ahead and do the parameter update.

But there is a catch

Won’t the gradients be huge since we are accumulating them ?

They will be — and that’s why we need to normalize them by the Accumulation steps.

Summary :

Hope you enjoyed this quick read !!

Originally Published at Intuitive Shorts:

Follow Intuitive Shorts (a Free Substack newsletter), to read quick and intuitive summaries of ML/NLP/DS concepts.

--

--