Gradient Descent (with a little bit of scary maths, just a pinch worth)

Ishan Shishodiya
ml-concepts.com
Published in
7 min readApr 8, 2022

Buckle up Buckaroo because this one is gonna be a long one (and a tricky one too).

The whole article would be a lot more “mathy” than most articles as it tries to cover the concepts behind a Machine Learning algorithm called Linear Regression.

If you don’t know what Linear Regression is, go through this article once. It would help you understand the basics behind Linear Regression without actually discussing any complex mathematics.

I have already made a Google Colab Notebook covering this topic so if you would want to follow it click here. The notebook teaches you how to recreate this algorithm from scratch, so it’ll be a very good learning experience for you if you are new to the field of Machine Learning.

Flavors of Linear Regression

Simple Linear Regression

Simple Linear Regression is the type of Linear Regression where we only use one variable to predict new outputs.

The equation of a straight line is -

Simple Linear Regression (Image 1)

Here m or β1 is the slope of the line (for that variable) and c or β0 is the intercept.

Multiple Linear Regression

If there were more than two terms in the equation above, then we would have been dealing with Multiple Linear Regression. And the equation for the line would have been something like this,

Multiple Linear Regression (Image 2)

Multiple Linear Regression is the type of Linear Regression where we use multiple variables to predict new outputs.

In the above case x1,x2,…,xn are the multiple independent variables (feature variables) that affect y, the dependent variable (target variable).

β1,β2,…,βn are the coefficients or the slope for that feature variable, while β0 is still the intercept.

Using the formula for Simple Linear Regression, we can plot a straight line that can try to capture the trend between two variables. All we would need are the values for β0 and β1.

x = [1,2,3,4,5]

y = [3,5,7,9,11]

Let’s take β0 as 0.55 and β1 as 3. Just two random numbers.

Plotting x, y, and a line with a random slope and intercept values on a graph (Image 3)

As you can see, the line we plotted is not very accurate and it strays off the actual data by quite a lot. Basically, it has a lot of errors.

But how much exactly??????

SSE, MSE, RMSE, and MAE

SSE, MSE, RMSE, and MAE are just different ways of calculating the errors in our predictions.

SSE

SSE stands for Sum of Squared Errors. It basically is the sum of the squares of the difference between the predicted and the actual value.

SSE (Image 4)

y_i = Actual y value for that value of x.
y_i_cap = Value predicted by the line we plotted for that value of x.
n = Number of total y values.

Here we are taking the square of y_i − y_i_cap because for some values this might be a positive number, while for some it might be a negative number. As we want the total error values we would want to sum these numbers up. But summing a positive number and a negative number would just cancel out some of the errors. To avoid this we square y_i−y_i_cap as it would only give us positive numbers.

MSE

MSE is the average of the square of the difference between the predicted and the actual value.

If you didn’t understand the sentence above, I hope this formula makes everything clearer.

(Image 5)

RMSE

As MSE has squaring up the errors involved we won’t be getting the accurate error values if we only use MSE. To fix this we can take a square root of MSE. This is called RMSE or Root Mean Squared Error.

RMSE (Image 6)

MAE

We can also take the absolute value of y_true−y_pred by applying mod to avoid getting negative error values. This type of error is called MAE or Mean Absolute Error.

MAE (Image 7)

For the line, we plotted MSE, RMSE, and MAE are 8.5, 2.92, and 2.55 respectively.

We need to minimize these errors to have much more accurate predictions. The most ideal value would be 0, but that’s extremely rare.

Cost Function

A cost function is a formula that is used to calculate the cost (errors/loss) for certain values of the original function.

In simple terms, a cost function just measures the errors in our prediction.

Sometimes the Cost Function is also called the Loss Function.

In our case, the Cost Function is the MSE itself. You can take the Cost Function as RMSE or MAE too, but I am going with MSE because it amplifies the errors a bit by squaring them.

Gradient Descent

Let’s take each formula as a mathematical function-

1. Function for predicted values.

(Image 8)

2. Function for errors.

(Image 9)

3. Function for Cost Function.

(Image 10)

For the highest accuracy possible we want J(β0, β1) ≈ 0.

Phew, that was a lot, but we are halfway there now!

Now let’s plot the errors we get for different values of β0 and β1.

β0 vs MSE (Cost Function) (Image 11)
β1 vs MSE (Cost Function) (Image 12)

From the graphs above we can see that the Slope vs MSE curve and the Intercept vs MSE curve is a parabola, which for certain values of slope and intercept did reach very close to 0.

Let’s plot a 3D graph for an even clearer picture.

Slope vs Intercept vs MSE (Cost Function) (Image 13)

(That’s a cool-looking graph ngl!)

Here we can clearly see that for certain values of β0 and β1, the value of J(β0,β1) becomes the least.

Those are the values we want for our linear equation as they would yield us the highest accuracy possible.

Convergence Theorem

Gradient Descent (Image 14)

As you can see in the image above we start from a particular coefficient value and take small steps to reach the minimum error values. The size of this small step is called the learning rate.

Because we are taking small steps in order to converge with the minima, this whole process is called as convergence theorem.

The formula is something like this -

Convergence Theorem (Image 15)

Here,

  1. β0 is the intercept of the linear equation.
  2. β1 is the slope of the linear equation.
  3. L is the learning rate.
  4. (∂J/∂β0) is the slope(derivative) of function J(β0,β1) with respect to β0.
  5. (∂J/∂β1) is the slope(derivative) of function J(β0,β1) with respect to β1.

With each iteration, we would be getting closer to the ideal value of β0 and β1.

If you feel a bit confused about what this was, click here for better understanding. In the video, he has used a different Cost Function but don’t worry about that. The Cost Function can be anything and it depends from method to method, you just have to make sure that it accurately measures the errors.

Okay cool. Now we are one step closer to implementing gradient descent in code. All we need to do is figure out was the values for (∂J/∂β0) and (∂J/∂β1) are and we are good to go.

(Image 16)

First let’s find ∂J/∂β0,

(Image 17)

Next, let’s find ∂J/∂β1,

(Image 18)

If suppose there was a β2 in the equation (which can happen in Multiple Linear Regression), then (∂J/∂β2) would be,

(Image 19)

For any βn in the equation, the ∂J/∂βn would be,

(Image 20)

Now that we have the values of ∂J/∂β0 and ∂J/∂β1, we can easily feed them in the formula in Image 17.

With each iteration, we would be getting one step closer to the minimum Cost Function value.

To see how all this can be implemented in code, click here.

If you feel like you understood Gradient Descent, try deriving the formula for Multiple Linear Regression. It won’t be much different from what you did in this article, and it is one fun exercise to practice the concepts you just learned.

--

--