An Easy and Useful Guide to Batch Gradient Descent

Akmel Syed
Dec 27, 2020 · 6 min read

What is Batch Gradient Descent?

The question you’re probably asking right now is, “what is batch gradient descent and how does it differ from normal gradient descent?” Batch gradient descent splits the training data up into smaller chunks (batches) and performs a forward propagation and backpropagation by the batch. This allows us to update our weights multiple times in a single epoch.

What Are the Benefits?

Performing calculations on small batches of the data, rather than all our data at once, is beneficial in a few ways. To name a few:

  1. It’s less straining on memory. Think about if we had a million 4K images . Always holding a million 4K images in memory is extremely taxing.
  2. Because we’re performing multiple weight updates in a single epoch, we’re able to converge (get to the bottom of our hill) in less epochs.
  3. Splitting up our data into batches makes it so that our model is only looking at a random sample of our data at every iteration. This allows it to generalize better. Better generalization = less chance of overfitting.

Why Does It Work?

One of the questions I had when I first came across batch gradient descent was, “we’re asked to gather as much data as we can only to break that data up into small chunks? I don’t get it… ”

I’m going to go over an example (with code) to show why breaking our data into smaller chunks actually works.

Before I show the example, we’re going to have to import a few libraries.

Now that we’ve imported our libraries, using sklearn, we’re going to make an example dataset. It’s going to be a regression line made up of 1000 points.

Let’s look at our example dataset by using matplotlib to plot it.

Cool. It looks exactly like we expected it to look. 1000 points and a regression line.

Now, something I want to show is that when we take a random sample of 64 points (i.e., our batch size), our random sample is a good representation of our full dataset.

To see this in action, let’s plot 10 different sets each containing 64 different random samples.

  rand_indices = random.sample(range(1000), k=64)

plt.scatter(X[rand_indices], y[rand_indices])

I hope it’s making sense. Although we’re only plotting 64 random points, those 64 points give us a very good understanding of the shape and direction of the 1000 points. The argument batch gradient descent makes is that given a good representation of a problem (this good representation is assumed to be present when we have a lot of data), a small random batch (e.g., 64 data points) is sufficient to generalize our larger dataset.

Implementing Batch Gradient Descent

Now that we’ve gone over the what and the why, let’s go over the how. We’ll end this article off with how to implement batch gradient descent in code.

Let’s start off by importing a few useful libraries.

Next, let’s import our dataset and do a little bit of preprocessing on it. The dataset we’ll be working with is the Pima Indians Diabetes dataset. We’ll import it, split it into a train and test set and then standardize both the train and the test sets, while converting them into PyTorch tensors.

X = df[df.columns[:-1]]
y = df['Outcome']
X = X.values
y = y.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)scaler = StandardScaler()
X_train = torch.tensor(scaler.transform(X_train))
X_test = torch.tensor(scaler.transform(X_test))
y_train = torch.tensor(y_train)
y_test = torch.tensor(y_test)

Now, we’re going to need our neural network. We’ll build a single layer feed forward neural network, consisting of 4 nodes in its hidden layer.

Let’s create a function to show accuracy as a metric (our loss is BCE). I like doing this because BCE isn’t really human readable, but accuracy is very human friendly. We’ll also setup a few variables to reuse.

epochs = 1000+1
print_epoch = 100
lr = 1e-2

Our print_epoch variable just tells our code how often we want to see our metrics (i.e., BCE and accuracy).

Let’s instantiate our Model class and set our loss (BCE) and optimizer.

Awesome, we can finally train our model. Let’s first do it without batch gradient descent and then with. It’ll help us compare.

for epoch in range(epochs):
y_pred = model(X_train.float())
loss = BCE(y_pred, y_train.reshape(-1,1).float())


if(epoch % print_epoch == 0):
print('Train: epoch: {0} - loss: {1:.5f}; acc: {2:.3f}'.format(epoch, train_loss[-1], accuracy(y_pred, y_train)))

y_pred = model(X_test.float())
loss = BCE(y_pred, y_test.reshape(-1,1).float())

if(epoch % print_epoch == 0):
print('Test: epoch: {0} - loss: {1:.5f}; acc: {2:.3f}'.format(epoch, test_loss[-1], accuracy(y_pred, y_test)))

As expected, the results aren’t great. 1000 epochs isn’t that much for such a complex dataset, when not using batch gradient descent.

Let’s rerun it, except this time, with batch gradient descent. We’ll reinstantiate our Model class and reset our loss (BCE) and optimizer. We’ll also set our batch size to 64.

Great. Now that we have that done, let’s run it and see the difference.

for epoch in range(epochs):

iteration_loss = 0.
iteration_accuracy = 0.

for i in range(train_batches):
beg = i*batch_size
end = (i+1)*batch_size
y_pred = model(X_train[beg:end].float())
loss = BCE(y_pred, y_train[beg:end].reshape(-1,1).float())

iteration_loss += loss
iteration_accuracy += accuracy(y_pred, y_train[beg:end])
if(epoch % print_epoch == 0):
print('Train: epoch: {0} - loss: {1:.5f}; acc: {2:.3f}'.format(epoch, iteration_loss/(i+1), iteration_accuracy/(i+1)))
iteration_loss = 0.
iteration_accuracy = 0.
for i in range(test_batches):
beg = i*batch_size
end = (i+1)*batch_size

y_pred = model(X_test[beg:end].float())
loss = BCE(y_pred, y_test[beg:end].reshape(-1,1).float())

iteration_loss += loss
iteration_accuracy += accuracy(y_pred, y_test[beg:end])

if(epoch % print_epoch == 0):
print('Test: epoch: {0} - loss: {1:.5f}; acc: {2:.3f}'.format(epoch, iteration_loss/(i+1), iteration_accuracy/(i+1)))

Interesting. Before we get into the results, you’ll see that the code is similar, but we have a few extra elements. You’ll see that our loss and accuracy are actually now an average of all of the batches in the epoch. Also, I have an extra for loop both in the training and evaluation. This loop is what allows us to iterate through our data, splitting it into batches of size 64.

In terms of the result, you’ll see that it significantly outperforms training our model without batch gradient descent. In the same number of epochs, our model jumped from 66% accuracy on the test set to 74% and our BCE went from 0.62 to 0.51.

Choosing the Right Batch Size

The question that bothered me for a long time and probably you’re asking yourself right now is, how do we choose the right batch size? A lot of research has been done around this topic, and in practice, the numbers chosen for the size of the batches are 2^x (e.g., 32, 64, 128, etc.), where x is at least 5, but often greater. It depends a lot on your data as well. If each data point is expensive to hold in memory (e.g., 4K images), then maybe a smaller batch size is a better idea. I usually stick to 64, but you can try a few different sizes and see how it affects your dataset.

As always, you can run the code in Google Colab —

A Coder’s Guide to AI

You don’t need a PhD to dive into AI

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store