Regression Talks-Part II

Linear Regression using Gradient Descent

Bilwa Gaonker
TheLeanProgrammer
6 min readJul 4, 2021

--

In the last article, we talked about the basics of Linear Regression and wrote some Python code using the amazing sklearn library. But Bilwa, something felt missing for some reason?

Like, we did use a function and got the best fitting line but there was no feeling to it. We have so many questions: How does that function work? How is the process of predicting the slope and intercept of the line done? What was the cost function used behind the closed doors to get this output?

Woah Woah, slow down a bit 😮. Hopefully, this article will answer all the questions you all asked because we’ll be talking about Linear Regression using Gradient Descent. We’ll pick up the same data as the last article as it’ll be easier to relate to and understand.

Understanding Gradient Descent

I am pretty sure that we know uni-variate linear regression aims to find the best fitting line. I did mention the term ‘cost function’ in the last article, and defined it as well! Yeah, let us define it again for our own convenience and also see what Gradient Descent is defined as

Cost function: It is also referred to as loss function. It gives out the error and tells us how far we are, from achieving values of slope and intercept for the best fit.

Gradient descent algorithm is defined as the iterative optimization algorithm to find the minimum of a function (here the cost function).

In simpler words, gradient descent is finding the best values for slope and intercept by minimizing a loss function.

Now, let me explain how gradient descent works alright?

Explaining with Math…

Notice the steps of the gradient descent closely:

📍 First, let us find the difference between the actual value of y and the predicted value of y.

📍 The usual go-to for finding error in statistics is squaring the differences (we square the differences in this to avoid any mischievous negative value to fool us that the total error is zero!)

📍 Now we find the mean of the squares to get our beautiful loss function.

Loss function
Loss function written in expanded form

📍 Let us set the learning rate of this algorithm to 0.01, slope=0, and intercept =0.

Initiating the values

Learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function.

📍 Learning rate values like between 0 and 1. Too small of a value will increase the number of iterations required for getting the best output and if the value is too big then it’ll keep oscillating on the curve not really reaching the minimum.

📍 Now, we know that finding the minimum of the function requires us to take the derivative of the function concerning the variable. Hence we calculate the partial derivative of loss function concerning m and c.

Sliding into Dm’s (Bad joke, sorry)
Derivative w.r.t intercept

📍 Next step is to update slope(m) and intercept(c)

📍 The final step is to keep repeating the previous step until the loss is ideally 0 or close to 0.

Understanding it with an example…

Finding the best values for m and c is like finding the perfect pair of jeans 👖 (Yeah, before you all judge! Let me explain)

Suppose you are trying various sizes and brands of jeans to attain the perfect waist size, length, etc. (best fit in short). I am assuming you were size 34 (best fit taken from old purchase).

But various factors have changed, you worked out, tried being fit, and want a new pair of jeans to show your outcome off! Loss function in our jeans case is the difference between your waist size and the jean waist size.

I know you guys are smart enough to check the waist size and jump to the preferred size, but in order to understand machine learning, you need to think dumber (perspective of a machine).

The first step would be trying size 34, we aim to reduce the error of waist size (length is usually okay depending on the type of jeans). Now, this size is too loose, go three sizes low, like 31.

Jeans error squaring is pointless since its positive(Let us say we are using Manhattan distance for finding error). Our learning rate was to take big difference is sizes when it was too loose and taking smaller steps by size when we were approaching the minimum error. Number of iterations is the number of times we tried different jeans.

Okay, this size is loose, but we reduced the error by half and are closer to finding the best pair! Let's go :) Now let us go by reducing one size…30…(closer)…29…28(Voila! This one fits great).

We ended up with cool pair of jeans and showed off our fit selves. Now it's time to show off the coding skills!

Let’s code!

This part will seem pretty easy since I have explained the algorithm stepwise :)

Importing all the required libraries and then preprocessing the data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#Preprocessing the data
data=pd.read_csv(‘LRdata1.txt’, header=None, names=[‘Population’, ‘Profit’])
print(data.head())
X=data.iloc[:,0]
Y=data.iloc[:,1]
Output for the above code

Plotting the data (yes, we are using the same dataset from the last article).

plt.scatter(X,Y, color='green', marker='*')
plt.xlabel('Population of City in 10000s')
plt.ylabel('Profit in $10000s')
plt.title('Scatter Plot of Training Data')

As said above, we find the derivatives and update the values of slope and intercept accordingly.

#Let's build the model we learntm=0
c=0
L=0.001
epochs=5000 #no. of iterations
n= len(X)
for i in range(epochs):
Y_pred = m*X + c
Dm =(-2/n)*sum(X *(Y - Y_pred)) #the ones we saw in theory
Dc =(-2/n)*sum(Y - Y_pred)
m = m - L*Dm
c = c - L*Dc
print(m,c)

Epochs are the number of iterations and make sure you are careful as it might cause overfitting! (You don’t want very tight jeans where you aren’t able to breathe)

slope and intercept value

So yay, we found the line! Time to get all the predicted values (according to this y=m*X+c) and plot them.

# Making predictions
Y_pred = m*X + c
plt.scatter(X,Y, color='green', marker='*')
plt.plot(X, Y_pred, color='red')
plt.xlabel("Population of City in 10000's")
plt.ylabel("Profit in $10000's ")
plt.title("Linear Regression Fit")
plt.show()
Yes! This is how gradient descent algorithm works :)

If you remember, this is how the line looked last time when we used the LinearRegression() function. So that’s all for this article, I guess! In further articles, we’ll be looking at more types of regression. Stay tuned for more regression talks like this! You can connect with me on LinkedIn if you have any queries related to my articles.

Don’t forget to follow The Lean Programmer Publication for more such articles, and subscribe to our newsletter tinyletter.com/TheLeanProgrammer

--

--

Bilwa Gaonker
TheLeanProgrammer

Love playing with data | Ardent Reader | I write newsletters sometimes