What is Gradient Descent? Reduce Loss Function with Gradient Descent

Preethi Thakur
7 min readOct 9, 2022

--

Gradient Descent

Gradient Descent is a popular optimization technique where the general idea is to tweak(adjusting till we get optimal result) parameters iteratively in order to minimize the cost function.

Gradient Descent

It measures the local gradient of the error function with respect to the parameter vector θ, and it goes in the direction of the descending gradient. Once the gradient is zero, you have reached a minimum. Gradient Descent is useful when you have a very large dataset.

How Gradient Descent works?

So the process is, you will start by filling θ with random values, this is called random initialization, and then you improve it gradually, taking one tiny step at a time, at each step you are attempting to decrease the cost function until the algorithm converges to a minimum.

The sum of the squared errors are calculated for each pair of input and output values. A learning rate η is used as a scale factor and the coefficients are updated in the direction towards minimizing the error. The process is repeated until a minimum sum squared error is achieved or no further improvement is possible.

Learning Rate

An important parameter in Gradient Descent is the step size, this is determined by the learning rate η hyperparameter. The learning rate has to be appropriate, otherwise your algorithm will take forever (let’s say really long time!!) to reach the minimum. See how learning rate affects the model.

  • If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time.
  • If the learning rate is too high, you might jump across and end up on the other side, possibly even higher up than you were before. This might make the algorithm diverge, with larger and larger values, failing to find a good solution.

Global minimum, Local minimum

Not all cost functions are parabolic(bowl structure). There may be holes, ridges, plateaus, and irregular terrains, due to which convergence to the minimum might get difficult.

If the random initialization starts the algorithm on the left, then it will converge to a local minimum, which is not as good as the global minimum. If it starts on the right, then it will take a very long time to cross the plateau, and if you stop too early you will never reach the global minimum. Global minimum will give the optimal coefficients.

When using Gradient Descent, you should ensure that all features have a similar scale. So, before going ahead with Gradient Descent, better scale the features by using Scikit learn library class StandardScaler, MinMaxScaler, RobustScaler.

There are three different methods in Gradient Descent which we can use to get the optimal coefficients. They are:

  • Batch Gradient Descent
  • Stochastic Gradient Descent
  • Mini — Batch Gradient Descent

Batch Gradient Descent

In Batch Gradient Descent, we compute the gradient of the cost function. We do this by taking partial derivation of the cost function, which is Mean Square Error(MSE) in this example. What is partial derivation? We calculate the amount of the cost function that will change when we change coefficient θj, just a little bit.

While studying about cost function, we already came up with MSE as the cost function for our linear model.

Cost Function

Here,

We take the partial derivation on above cost function with respect to θj we will derive following equation.

Partial derivative of the cost function

This can also be re - written as

Partial derivative of the cost function

We can compute all partial derivates with respect to θ1,θ2…..θjn at one go. The gradient vector below ∇θMSE(θ),contains all the partial derivatives of the cost function of each model parameter(θ, this is also called as weight or coefficient).

Gradient vector of the cost function

Once you have the gradient vector, which points uphill, just go in the opposite direction to go downhill. This means subtracting ∇θMSE(θ) from θ. This is where the learning rate η comes into play: multiply the gradient vector by η to determine the size of the downhill step.

Gradient Descent step

Let’s look at a quick implementation of this algorithm:

eta = 0.1 # learning rate
n_iterations = 1000
m = 100
theta = np.random.randn(2,1) # random initialization
for iteration in range(n_iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta) — y)
theta = theta — eta * gradients
resulting θ
>>> theta
array([[4.21509616],
[2.77011339]])

Gradient descent has given us the coefficients. What will happen when we try with various learning rates?

Gradient Descent with various learning rates

On the left, the learning rate is too low: the algorithm will eventually reach the solution, but it will take a long time. In the middle, the learning rate looks pretty good: in just a few iterations, it has already converged to the solution. On the right, the learning rate is too high: the algorithm diverges, jumping all over the place and actually getting further and further away from the solution at every step.

How to set the number of iterations? A simple solution is to set a very large number of iterations but to interrupt the algorithm when the gradient vector becomes tiny, because this happens when Gradient Descent has (almost) reached the minimum.

Batch Gradient Descent Disadvantage

The main problem with Batch Gradient Descent is that, it uses the whole training set to compute the gradients at every step, which makes it very slow when the training set is large. It is excellent for convex or relatively smooth error manifolds but not recommended for large datasets as the computation takes lot of time and hence will end up being expensive.

Stochastic Gradient Descent

Unline Batch Gradient, Stochastic Gradient Descent just picks a random instance in the training set at every step and computes the gradients based only on that single instance. Hence, this makes the algorithm much faster since it has very little data to manipulate at every iteration(epochs). It also makes it possible to train on huge training sets, since only one instance needs to be in memory at each iteration

Due to its stochastic (random) nature, instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average. Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around, never settling down. So once the algorithm stops, the final parameter values are good, but not optimal.

When the cost function is very irregular, this can help the algorithm jump out of local minimum, so Stochastic Gradient Descent has a better chance of finding the global minimum than Batch Gradient Descent does. But because of this irregularity the algorithm can never settle at the minimum. One solution to this problem is to gradually reduce the learning rate. The steps start out large, which helps make quick progress and escape local minima, then get smaller and smaller, allowing the algorithm to settle at the global minimum.

To perform Linear Regression using SGD with Scikit-Learn, you can use the SGDRegressor() class, which defaults to optimizing the squared error cost function. The following code runs for maximum 1000 epochs (max_iter=1000). epochas are number of iterations taken to reach global minimum.

from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, penalty=None, eta0=0.1)
sgd_reg.fit(X, y.ravel())
>>> sgd_reg.intercept_, sgd_reg.coef_
(array([4.24365286]), array([2.8250878]))

Mini-batch Gradient Descent

Mini-batch Gradient Descent is a combination of both Batch and Stochastic Gradient Descent. At each step, instead of computing the gradients based on the full training set (as in Batch GD) or based on just one instance (as in Stochastic GD), Minibatch GD computes the gradients on small random sets of instances called minibatches.

The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance boost from hardware optimization.

The algorithm’s progress in parameter space is less erratic than with SGD, especially with fairly large mini-batches. As a result, Mini-batch Gradient Descent will end up walking around a bit closer to the minimum than SGD. But, it might be harder for it to escape from local minimum.

The above figure shows the paths taken by the three Gradient Descent algorithms during training. They all end up near the minimum, Batch GD’s path stops at the minimum, while both Stochastic GD and Mini-batch GD continue to jump around. We can see Batch GD took lot of time to take each step. Stochastic GD and Mini-batch GD would actually reach the minimum if we use a good learning rate.

Liked what you read? Click here to read more interesting topics on Machine Learning

--

--