How Should I Feed my Gradient Descent?

Published in

The Biased Outliers

4 min readNov 25, 2022

Gradient Descent is always hungry for data. How should we feed the data to the gradient descent? There are three common ways to feed in data for Gradient Descent (GD): Batch, Stochastic, and Mini-Batch.

Is there a best one? Why do machine learning practitioners lean towards Mini-Batch GD? We will experiment with all three and excavate the advantages and disadvantages of each for ourselves.

INTRODUCTION

Batch Gradient Descent is when we feed the ENTIRE dataset to calculate the gradients. Once the gradients are calculated, the GD will take a step towards the minimum.

Stochastic Gradient Descent is when we only feed ONE data point to calculate the gradient. The GD will use that gradient to take a step. Then it will select another data point and repeat the process until it goes through all the data.

Mini-Batch Gradient Descent is when we feed a SMALL BATCH of data to calculate the gradient. This technique splits the data into batches and feeds one batch at a time to the GD until it goes through all the data.

DATA

We will work on a simple regression problem using synthesized data. The features of the data are sampled randomly. The targets will be a linear combination of those features with some noise (randomly sampled from a normal distribution). The training and validation sets will contain 128000 data points.

# Setting the seed
np.random.seed(0)

# Creating observations
X_train = np.random.rand(number_of_observations,number_of_features)
X_valid = np.random.rand(number_of_observations,number_of_features)

# Instantiating the parameters W and b
W = np.round(np.random.rand(number_of_features,1)*10, 0)
b = np.round(np.random.rand(1,1),0)

# Creating some noise
noise = np.random.randn(number_of_observations,1)

# Creating y by doing XW + b and some noise
y_train = np.dot(X_train, W) + b + noise
y_valid = np.dot(X_valid, W) + b + noise

# Printing coefficients
print(f"True W: {W[0][0]} \nTrue b: {b[0][0]}")

Visually there is a linear relationship with X and y with some noise. (We are using a linear model to generate the data XD).

EXPERIMENT

Now for the experiment, we train three simple linear regression models (implemented in Tensorflow). Each model has the same architecture. The only difference is how we feed the data into the GD algorithm. For Batch GD, we feed all the data. For Stochastic GD, we feed one data point at a time. For Mini-Batch, we are feeding in 64 data points at a time. The models are trained for 100 epochs with early stopping. The results of the models after training are shown below:

Based on this graph, we can see that Batch GD (orange) hasn’t learned enough within the 100 epochs. The Stochastic GD (green) and Mini-Batch (purple) are almost the same and it seems like they represent the data well. Let us look at the loss and the time it took to finish training.

Based on the Losses for Batch Type (left), the Stochastic GD converged in the shortest amount of epochs (7 to be exact). Then it was Mini-Batch GD in 32 epochs. Last is Batch GD. It was not able to finish converging in 100 epochs.

However if we look at the Elapsed Time for Training (right), Stochastic took the longest time to finish training. Even though it took the least amount of epochs to converge, it still took the longest time. And that makes sense. We are only feeding one data point at a time to calculate gradients, update parameters, choose another data point and repeat the process. We will be able to converge quickly (in less epochs) but to calculate one gradient at a time is computationally expensive. Batch GD takes the fastest amount of time to calculate gradients (due to vectorization) but it will take many epochs to converge. It needs to go through the data multiple times to converge. Mini-Batch converges almost as fast as Stochastic GD and it computes almost as fast as Batch GD.

CONCLUSION

In summary, Batch GD is computationally the fastest but needs more training time (epochs) to converge. Stochastic GD is the fastest to converge, but takes a long time computationally to calculate and update the gradients. Mini-Batch falls somewhere in between where it converges fast enough and in a short amount of time. Hopefully this blog visually explains in detail why machine learning practitioners lean towards Mini-Batch Gradient Descent (sometimes Batch).

Code used for this blog (GitHub)

A special thanks to Benedict Emoekabu for helping out with this blog!

How Should I Feed my Gradient Descent?

INTRODUCTION

DATA

EXPERIMENT

CONCLUSION

Written by Taraqur Rahman