Implementation of Stochastic Gradient Descent

Published in

Analytics Vidhya

4 min readJun 29, 2020

The purpose of writing this post is to understand the maths behind gradient descent. Most of us are using gradient descent in machine learning, but we need to understand the maths behind it.

As a fresher, when I was learning stochastic gradient descent, I found it a little bit complex. Here, I tried to make it simpler for those who want to know how it works.

My focus on this post is to demonstrate the mathematics behind gradient descent.

Take a quick refresher on what is gradient descent?

Gradient Descent: It is an optimization technique that is used to find coefficients of a function that minimizes an output error.

Gradient Descent Procedure:

Initialize the values for the coefficients (It could be 0.0 or small random value)
Calculate the cost function by substituting coefficients to the function
Calculate the partial derivative of the total error with respect to weight
Update the values of the coefficients
Repeat the above procedure until we get cost 0.0 or no further improvements in cost can be achieved

We will see methods of the gradient descent before we jump to the implementation of the stochastic gradient descent.

Batch Gradient Descent: In batch gradient descent, coefficients are updating, after computing the loss of every training sample in the training set, the parameters are updated once that is after all the training examples have been evaluated
Stochastic Gradient Descent: When we have a large amount of data, we can use a variation of gradient descent called stochastic gradient descent. In SGD, coefficients are updated for each training instance, rather than at the end of the batch instances.

Few methods for getting the most out of the gradient descent algorithm:

Plot Cost Values: Collect and plot the cost values calculated by the algorithm for each iteration. Reduce your learning rate, if the cost for each iteration does not decrease
Learning Rate: The learning rate value should be small values such as 0.1, 0.001, or 0.0001. Try different values and see which one works best
Rescale Inputs: Rescale all your input variables to the same range, such as between 0 and 1

Stochastic Gradient Descent is a widely used algorithm in machine learning. Here I demonstrate how to use stochastic gradient descent to learn the coefficients for a linear regression model by minimizing the error on a training dataset.

Let’s take a look at below example:

Data:

I have used excel to show the calculations.

Rescale all your input variables to the same range, such as between 0 and 1. Here, we are using min-max standardization to rescale the data.

Calculate the minimum and maximum value of the variables.

Standardize the data using the below formula.

Apply Min-Max Standardization to our data

2. Let’s start by initializing the random weights. a= 0.45, b=0.75

3. Compute the ŷ

Consider the second random observation from the new scaled data

X new= 0.222222222 , Y new= 0.097087379

ŷ = a + b*X new

ŷ = 0.45 + 0.75*0.222222222 =0.61666

4. Compute loss

Error = ŷ- y new = 0.6166666–0.0970873 = 0.519579

5. Compute partial differentials

Calculate the partial derivative of the total error with respect to weight

The partial differentials are as follows:

The partial derivative of total error w.r.t. a (weight)

= ∂E/∂a

= ∂(ŷ- y new)/∂a

= ∂(ŷ- (a+b*X new))/∂a

= -(ŷ-y new)

=-0.519579

The partial derivative of total error w.r.t. b (weight)

= ∂E/∂b

= ∂(ŷ- y new)/∂b

= ∂(ŷ- (a+b*X new))/∂b

=-(ŷ-y new)*X new

=-0.519579*0.222222222

= -0.115462064

6. Update the weights

Update ‘a’

Use a small learning rate (alpha=0.01)

New a = old a — alpha* ∂E/∂a

= 0.45–0.01 * — 0.519579

= 0.455196

Update ‘b’

New b = old b — alpha*∂E/∂b

= 0.75–0.01 * -0.115462064

= 0.75115

We have just finished our first iteration of stochastic gradient descent and we have updated our weights to be a = 0.455196 and b= 0.455196. Repeat this process for remaining instances from our dataset.

7. Iteration: 5

Let’s jump ahead. Below is a list of all the updated values for the coefficients over the 5 iterations.

The yellow box in the above image is the error for each set of coefficients. It shows us that error was decreasing with each iteration. We can use the coefficients with the least error into our simple linear regression model to make predictions for each point in our dataset.

I hope the implementation of Linear Regression with SGD is clear now. I suggest you try the above calculations by yourself.

Thank you for reading.

Happy Learning!!! :)

Implementation of Stochastic Gradient Descent

Written by Aishwarya Valse