Linear Regression From Scratch in Python WITHOUT Scikit-learn

Published in

Geek Culture

7 min readMay 18, 2021

In this tutorial, I’ll go over a brief introduction to one of the most commonly used machine learning algorithms, Linear Regression, and then we’ll learn how to implement it using the least-squares method from scratch in python without sci-kit-learn. We’ll also look at the interpretation of R squared in regression analysis and how it can be used to measure the goodness of the regression model.

Linear Regression is a type of predictive analysis algorithm that shows a linear relationship between the dependent variable(x) and independent variable(y).

Based on the given data points, we try to plot a straight line that fits the points the best. The equation of a straight line is shown below:

where,
x: input data points
y: predicted value, dependent variable (supervised learning)

The model gets the best-fit regression line by finding the best m, c values.
m: bias or slope of the regression line
c: intercept, shows the point where the estimated regression line crosses the 𝑦 axis

Cost Function (J)-

As explained above our goal is to find a regression line or the best fit line which has the least difference (error/residual) between the predicted value and the actual value. This is where the cost function comes into the picture as we use the cost function extensively to calculate the values of ( c, m) to reach the best value that minimizes the error between predicted y value (y^) and true y value (y).

Image Source: Linear Regression By Real Python

Cost function(J) of Linear Regression is the Root Mean Squared Error (RMSE) between predicted y value (y^) and true y value (y).

Mean Squared Error (MSE)-

Given our simple linear equation y = c + m*x, we can calculate MSE as:

Where,

𝑁 is the total number of observations (data points)
yᵢ is the actual value of an observation and y^ is the predicted value
J is the cost function which is the mean squared error in this case

Python Implementation from Scratch:

It’s time to learn the mathematical implementation of the algorithm. For this tutorial, I’ll be working with a simple data set of x and corresponding y values as shown below.

Let’s calculate the mean of x and y, we’ll denote them as x̅ & y̅.

See, our goal is to predict the best-fit regression line using the least-squares method. So to find that we’ve to first find the equation of such a line. So if y = c+ m*x, where ‘m’ is slope/bias which is denoted by a change in x divided by change in y.

Change in x is the difference between actual input value xᵢ and x̅, and similarly change in y is the difference between label yᵢ and y̅.

Below is the mathematical representation of m,

So moving ahead, according to the formula of ‘m’, what we’re gonna do is calculate (x-x̅ )& (y-y̅) for each data point in our very simple dataset.

Image Source: Linear Regression by Edureka

Now that we’ve all the individual elements in our formula ready, we will calculate the summation of numerator and denominator and find the final value of ‘m’.

Now we have the final equation :

3.6 = 0.4 * 3 + c

c = 2.4

Now, for given m = 0.4 & c=2.4, let’s predict y for for all input values x ={1,2,3,4,5}

y = 0.4*1 + 2.4 = 2.8

y = 0.4*2 + 2.4 = 3.2

y = 0.4*3 + 2.4 = 3.6

y = 0.4*4 + 2.4 = 4.0

y = 0.4*5 + 2.4 = 4.4

Now if we plot them, the line passing through all these predicted y values and cutting the y-axis at 2.4 is our regression line.

Now our job is to calculate the distance between actual and predicted values and reduce this distance. Or in other words, we’ve to reduce the error between the actual and the predicted value. The line with the least error will be the line of linear regression.

So this is how it works internally:

It does a number of iterations for different values of ‘m’,
then calculates the equation of line y = c + m*x. As the value of m changes, the line will change.
After every iteration, it will calculate the predicted value according to the line and compares the distance between actual & predicted values.
And the line for which the predicted value has the least distance from the actual value will be determined as the regression line.

R-squared value:

Now that we’ve found the best-fit regression line, it’s time to measure the goodness of it or to check how good our model is performing.

R-squared value is a statistical measure of how close the data are to the fitted regression line.

It is also known as the coefficient of determination or coefficient of multiple determination.

Here’s how we calculate R-squared value,

Where yₚᵣₑ𝒹 is the predicted y value and y̅ is the mean and y is the actual value

Basically, we’re calculating the difference between the predicted value and the mean, then dividing it by the difference between the actual value and the mean.

Now below will be the final R-squared value after summating all the differences between predicted and actual values and

And that is approximate,

It means that our data points are far away from the regression line. Well, that’s no good, is it?

Basically the higher the R-squared value the better our model performance will be. So as the R-squared value gradually increases, the distance of actual points from the regression line decreases, and the performance of the model increases.

Implementation in Python:

Now that we’ve learned the theory behind linear regression & R-squared value, let’s move on to the coding part. I’ll be using python and Google Colab.

I’ll be working with a simple dataset called head brain from Kaggle. It has 237 rows and 4 columns which means 237 observations and 4 labels. We have to predict the brain weight of an individual based on given head size(cm).

Step-1:

Import necessary libraries like pandas, NumPy & matplotlib

Step-2:

Import CSV file as a pandas data frame.

This is our sample dataset:

Step-3:

We can find a linear relationship between head size and brain weight. The next step is to collect our X and Y values. X would consist of head size values and Y would consist of brain weight values. We also need to find the values of ‘m’ and ‘c’, so for that, we need to find the mean of X & Y values.

Step-4:

Calculate m & c using the formulas, if you remember we’ve discussed previously in this article,

Step-5:

Now that we’ve our m & c, let’s plot the input points and the regression line.

Step-6:

Now it’s time to measure how good our model is. For this, as discussed above, we will calculate the R-squared value and evaluate our linear regression model. If you need a refresher on the formula of R-squared:

The above code generates the R-squared value,

And that was the linear regression implemented from scratch without using sklearn library.

If you can’t be bothered with all this mathematics and theory and would very much like to go for a neater method, sklearn library has an amazing inbuilt linear regressor function you can use. Here’s the code snippet for that:

End Notes:

In this tutorial, we’ve learned the theory behind linear regression algorithm and also the implementation of the algorithm from scratch without using the inbuilt linear model from sklearn.

You can find the data and iPython notebook here on my GitHub.

References:

[1] Linear Regression by Edureka on YouTube

[2] https://www.geeksforgeeks.org/ml-linear-regression/

Connect with me on LinkedIn and Twitter for more tutorials and articles on Machine Learning, Statistics, and Deep Learning.