Machine Learning Linear Regression project from scratch (without library)

Published in

Analytics Vidhya

7 min readMar 12, 2021

Linear Regression is a Supervised Machine Learning which is used to predict values within a certain range, rather than classifying them into categories.

In this article you can find the implementation of Univariate Linear Regression in Python without using any machine learning library. The code will be explained step-by-step with provided mathematical background.

Outline

Theoretical background
Python code
Summary

Background

You may think that, “I can drive a car without knowing how the engine works”. Yes, you are right. But what if the engine causes a simple trouble while you are on a highway. To wait for the service for hours or to solve the problem within several minutes by yourself? The same thing applies in Machine Learning algorithms as well. If you understand the background, you will fix the problems on your own or maybe even invent a better engine.

Knowing the math behind any algorithm will give you 100% control over the algorithm. While building a Machine Learning model, you may need to modify the algorithm in order to get the best model out of the data given to you. In such cases, it is essential to know how the algorithm works in the background to make any improvements.

I have already published two papers about the mathematical theory behind Linear Regression. The first one — Univariate Linear Regression, explains the basics of the algorithm with simple examples. It is a good practice to start with Univariate Linear Regression, as it is the simplest version of Linear Regression. After you gained the fundamental information, you can have a look in the second paper — Multivariate Linear Regression. MLR has the same concept as ULR, but it is used for more complex datasets (more than one input features):

Univariate Linear Regression — the basic information needed to start with.

2. Multivariate Linear Regression — the more complex form of Linear Regression.

Python code

The code will be in two parts. The first part will be about how to read and visualize the dataset. After we are done with this part, the functions (cost function, gradient descent and etc.) of the algorithm will be written.

Before diving into the code step by step, I want to mention that you can find all the code and dataset in my github account.

Part 1. Reading and visualization the dataset

To build any Machine Learning model, you need a dataset and to build a successful model, you need to visualize the dataset for better analysis. Because visualization will give you clear understanding of the data and help to have initial idea about which algorithm to use.

Datasets are provided as csv file, and pandas library is used to read csv files. read_csv function writes the dataset into the variable called “data”.

head() function returns the first 5 rows of the dataset.

import pandas as pd                       
data = pd.read_csv("train.csv")      #reading csv
print(data.head())                   #returns first 5 row of datasetOutput:
    x          y
0  24  21.549452
1  50  47.464463
2  15  17.218656
3  38  36.586398
4  87  87.288984

Note: Please keep in mind that, the py and csv files should be in the same directory to write the code as above, otherwise you have to copy the full path where your csv file is stored:

data = pd.read_csv(r"Full path\Filename.csv")

After the csv file is read, x and y values should be stored as separate variables in order to be able to work with them. It can be done in many ways, for example, using iloc and loc (functions of pandas), directly writing the name of a column (which we will use in this example) and etc.

X = data['x']
Y = data['y']

Here in the above, the type of X and Y variables is pandas series. Pandas series is more complex data structure than both numpy arrays and python lists. That is why it requires a lot more time to do operations on pandas series. For this reason, we convert X and Y from pandas series to python lists:

X = X.tolist()
Y = Y.tolist()

For visualization, matplotlib library is used. It has a wide range of functions to customize a plot. In this example, I will use some of them. For more about matplotlib, check the link in the references.

import matplotlib.pyplot as plt
plt.scatter(X, Y)
plt.grid()
plt.xlabel("x values")
plt.ylabel("y values")
plt.show()

The output:

Every dot in the graph represents one sample from the dataset. As is seen from the output, Linear Regression Algorithm is quite appropriate for this dataset.

Part 2. Body of the algorithm

Note: In this section I will briefly talk about the functions, for detailed mathematical explanation of each function, you can have look the articles I mentioned above or you can find them in the references.

As we start to code the body of the algorithm, it is good to mention what the variables stand for. Please keep in mind that, the hypothesis (equation of the line) is:

Apart from that, alpha is learning rate and n_iter is the number of iterations.

Cost function

Cost function is also called loss function. It’s used to calculate how well the line fits the data. The less the value of cost function is, the better the solution is — if all the samples are on the line, the value of the cost function will be zero, if the samples are far from the line cost function will return high value.

def cost_function (X, Y, w, b):N = len(X)
    total_error = 0.0
    for i in range(N):
        total_error += ((w*X[i] - b) - Y[i])**2    return total_error / (2*float(N))

Gradient descent

We start the algorithm with random initial values (usually zero) of w and b, thus the cost function will return some high value. So we somehow have to optimize w and b to reduce the return of the cost function. Gradient descent is the algorithm used in this manner. In each iteration gradient descent algorithm updates the values of w and b and the line fits the data better.

#alpha - learning rate
#N - number of samples in the datasetdef gradient_descent(X, Y, w, b, alpha):
 
    dl_dw = 0.0       
    dl_db = 0.0       
    N = len(X)    for i in range(N):
        dl_dw += -1*X[i] * (Y[i] - (w*X[i] + b))
        dl_db += -1*(Y[i] - (w*X[i] + b))    w = w - (1/float(N)) * dl_dw * alpha
    b = b - (1/float(N)) * dl_db * alpha    return w, b

Train function

During the algorithm, gradient descent runs many times, to be precise, in the number of iterations. After some iterations the value of the cost function decreases and it is good practice to see the value of cost function. Because after certain point, the value of cost function doesn’t change or change in extremely small amount. That is why, it is useless to run gradient descent over and over after that point and you decrease the number of iterations in the next try.

Hence, we combine all these actions — to define the number of iterations, to choose after how many iterations (in this example, in each 400th iteration) you want to see the return of the cost function, calling gradient descent function, into one function and this function is called train function.

def train(X, Y, w, b, alpha, n_iter):    for i in range(n_iter):
        w, b = gradient_descent(X, Y, w, b, alpha)        if i % 400 == 0:
            print("iteration:", i, "cost: ", cost_function(X,Y,w,b))    return w, b

Predict function

Predict function is the simplest one among the functions in Linear Regression. It just calculates and returns the value of y with corresponding x, after gradient descent finds w and b.

def predict(x, w, b):
    return x*w + b

How to call functions

As train function contains gradient descent function in itself, to call only train function is enough:

w, b = train(X, Y, 0.0, 0.0, 0.0001, 7000)

Here,

0.0s are initial value for w and b;
0.0001 is the learning rate (you can increase or decrease learning rate according to your dataset);
7000 is the number of iterations

After the train function return values for w and b, you can check your result with the help of predict function:

x_new = 50.0
y_new = predict(x_new, w, b)
print(y_new)

Summary

To sum up, Supervised Machine Learning has a broad range of algorithms. Linear Regression is among mainly used ones. In this paper, we looked at how it is implemented using Python code from scratch. The sections of the code were explained one-by-one and references for background were provided from my previous papers.

The next paper will be about how to implement Linear Regression using sklearn library.