Stochastic gradient descent (SGD) is one of the most popular and used optimizers in Data Science. If you have ever implemented any Machine Learning or Deep Learning algorithm, chances are you have come across SGD or its variations. Well, optimizers are always needed whenever you are implementing these algorithms but let’s see why SGD is important and why you need to know about it.
To understand and appreciate what SGD does, we need to first understand what is Gradient Descent (as all its variations are based on it) and why the above meme.
Gradient Descent algorithm
In simple words, Gradient Descent (GD) algorithm helps in reaching the local minimum of a function f. In Machine Learning problems, f is called the cost function or loss function. We minimize this loss function iteratively, to get the parameters that give the least possible error in our predicted value.
Note: Above curve is of MSE loss function. In real-life Machine Learning problems, we don’t usually get a convex curve as above.
Mathematically, GD can be seen as:
Here, W is the parameter of the model, ΔW is the partial derivative of the loss function with respect to the parameter W and η is the learning rate.
We start at a random point on the curve by initializing the weights (parameters) with random values. Then we compute the gradient at that point. Then move in the direction opposite to the Gradient (that is why the minus sign) by η*ΔW amount. We keep doing this until we reach the minimum loss point on the curve i.e. when the error starts increasing.
Note: The negative sign in Gradient Descent formula comes from Taylor Series. Our goal is to minimize the error the maximum we can at each step. Since the range of the dot product (follow Taylor Series) is [-1, 1], the minimum value we can have is -1, which we get when cos(x) = 180°. So we minimize the error to the maximum extent when we move to the direction opposite to the gradient. For a more detailed explanation visit the first link in the Reference section.
Again Note: When we have more than one derivative of the same function, it is called Gradient i.e. collection of partial derivatives.
Let’s break down the algorithm into steps:
- Take the partial derivative of the loss function with respect to each parameter.
- Initialize the parameters with random values.
- Plug the parameter values into the derivatives.
- Update the parameters using the above equation.
- Go back to 3rd step and repeat until there are no more (or negligible) updates to the parameter or you reach the maximum number of iterations which is usually 10,000.
Now, let’s take an example to be more clear. We’ll take loss function as Mean Square Error (MSE). So for N data points, the equation will be:
Note: MSE is also called L2 loss function
where ‘y’ is the actual output and ‘y-hat’ is the predicted value. Let’s take Linear Regression with 2 parameters as our prediction model in this example then ‘y-hat’ will be given by:
where ‘w’ and ‘b’ are the parameters and ‘x’ is the feature. So now, plugging ‘y-hat’ in MSE we get:
Taking the partial derivatives with respect to the parameters, we get:
Note: The derivative depends on your choice of the loss function. We have chosen MSE as our loss function. There are other loss functions like Mean Absolute Error (MAE, also called L1 loss), Cross-Entropy loss (used for classification problems) and more.
So now, we are done with the step number 1. For the next step, we will initialize the parameter with random values and compute the partial derivatives (gradient). Then, we will update the parameter values:
Note that for t=0, ‘w’ and ‘b’ had random values.
Problem with Gradient Descent
Now, you can see from the partial derivative equations that for a single update of each parameter, we go over N data points (summation i=0 to N). Suppose you are working with a dataset which involves thousands of features and millions of data points (which is not uncommon in Machine Learning problems), then for each update to the parameter we would have thousand of derivatives and for each derivative, we would have millions of terms. That would be around a billion calculations for each step, which is computationally very very expensive. It will really slow down the process.
So how do we solve it?
Stochastic Gradient Descent
If you have a complete understanding of GD, then SGD is actually a very simple concept. The word ‘stochastic’ means a randomly defined process. In SGD, we randomly choose one point from the dataset, compute the gradient, then update the parameters and repeat until convergence (when the error is no longer decreasing).
That means, instead of going over all the data points in the dataset, we just take one data point for each step. So, for a million data points, it reduces the number of terms computed by a factor of a million, which is great. But as the Yin and Yang symbol says, everything has a dark side too, including SGD. So let’s see the flipside of SGD by comparing it with GD.
Note: There are other names for SGD, like online GD, iterative GD. They all mean the same thing.
SGD vs. GD
The main difference is, of course, the number of data points to go through before each update of the parameters, which is 1 in case of SGD and all in case of GD. In short, GD spans over entire dataset once (which is same as one epoch) before each update, whereas SGD randomly takes just one data point for each update. Now, a few things happen because of this, let’s see:
- In the case of SGD, the computation is fast as we take one data point at a time. But it takes much more time to converge. Because the gradient we get in SGD is not the true Gradient, it is just an approximation of the true Gradient, that is because we are not taking all the data points for computing the Gradient. In GD, the gradient we get is the true Gradient. So, in GD we know that if we go in the direction opposite to the Gradient (true Gradient), the error WILL decrease in each step. In short, fast computation but slow convergence.
- Since we consider only one data point for updating parameters, SGD is heavily affected by Outliers. That means, while randomly choosing data points, if we encountered an Outlier, then it will take some more extra updates to get back on its track of convergence. Now imagine if it is again encountered an outlier which will throw it off the track. In short, SGD is sensitive to outliers.
- Chances of overfitting in SGD is less as it is more generalized and randomized because we are choosing data points at random. So for 1 million data points, the probability of each data point being chosen is 1 in 1,000,000. Moreover, the probability of the same point being chosen again is 1 in 1,000,000,000,000, which is an absurdly small number, although it can happen. Overfitting happens when the model instead of learning how to answer the question, it remembers the answer to the question, so when a different question of the same concept is asked, it does not perform well. Overfitting will happen when the model considers the same data points repeatedly, such that its parameters are adjusted according to only those parameters (instead of learning the pattern). So for new data points (test dataset), the error is large. In other words, by choosing different data point for each update we are restricting the model from remembering the answer. So in short, SGD is more generalized and less prone to overfitting.
There is another variation of GD which is a compromise between both SGD and GD, which is called mini-batch GD.
In mini-batch GD we update the parameters based on small samples of size ‘b’ of the dataset. So instead of taking 1 data point as in the case of SGD or complete dataset as in the case of GD, we randomly take ‘b’ data points from the dataset, calculate the gradient and update the parameter, where 1 <b <N (where N is the size of the dataset).
Observing carefully, we can see that SGD is just a special case of mini-batch GD where b = 1. And when b = N, it is GD.
Note: GD is also called as batch gradient descent.
Mini batch GD usually converges a little faster (in terms of iterations) than SGD as the gradient approximation in mini-batch GD is closer to the true gradient as compared to that in SGD. But as there are chances that data points can be repeated in different batches, mini-batch GD is more prone to overfitting as compared to SGD.
The size of the mini-batch ‘b’ depends on the size of the dataset. But generally, it should be around 1.5 times the number of outliers so that even in the worst case, it won’t be affected much by the outliers.
Last Note: GD and all its variations suffer from local minima problem.
You can have a look at the implementation of SGD below:
So far we have seen GD, SGD and mini-batch GD. There are further improvisations on these algorithms which involves the concept of momentum and adaptive learning rate. These algorithms include Momentum based Gradient Descent, Nesterov Accelerated Gradient Descent, AdaGrad, RMSProp, Adam, AdaDelta.
We usually don’t use GD alone because it is slow. Adam and momentum-based SGD are the most commonly used optimizers.
You may be thinking why so many algorithms. That’s because we are always looking for improvisation; that’s actually a great thing, and that is how we have come so far.
The point of this article was to give the intuition behind the SGD algorithm along with a basic understanding.
Last Note (Oh, have I said it before?): Even though we don’t use GD now, notice that GD is how it all started and is the principle behind all its variations.
Thank you so much for reading!
- Deep Learning NPTEL by Mitesh Khapra
- Gradient Descent by StatQuest with Josh Starmer
- Stochastic Gradient Descent by StatQuest with Josh Starmer
In the next blog, I will talk about optimizers with momentum, adaptive learning rate and also a thing called cyclic SGD.