Simple Linear Regression Using Least Squares From Scratch

Published in

Analytics Vidhya

5 min readAug 25, 2020

‘We all walk before we run’ which indicates the need of basics before doing the big things.Simple Linear regression is one of the very basics of machine Learning.In this Post, We would implement the linear regression from scratch using the statistical technique of least squares.

Introduction to Simple Linear regression

Simple Linear regression is a method used to represent the relationship between the dependent variable(Y) and a single independent variable(X) which can be expressed as y=wx+b where w is the weight of the feature x and b is the bias in Machine Learning terms whereas in mathematics, the equation is represented as y=mx+c with slope m and intercept c.

Now,we are gonna implement this from the scratch with the least squares methodology. In order to implement this we will be using python.The following libraries in python will be used

numpy
pandas
matplotlib

In this example, we will be predicting the percentage of savings that a employee makes from his income based on his experience(in months) in a company.You can find the complete code and the Data set we used,in my GitHubRepo.

The Data should be preprocessed before fitting the model and data preprocessing is also implemented from scratch and explained.

#Reading the dataset
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
dataset = pd.read_excel('Experience vs savings %.xlsx')
dataset.plot(x='Experience(in months)',y='% of savings from income',kind='scatter',
            title='Experience VS % of Savings')

we have imported the required packages then we have read the data set file which is of the format excel spreadsheet using the pandas read_excel() function. The data set has been stored in form of a dataframe in the variable dataset. Then we have made a scatter plot to get an intuition or overview of the data set.

#Checking  For Null values
df=dataset.copy()
print("Test for null values in the dataset : {}".format(np.any(df.isnull().values==True)))
dependent_var=df.iloc[:,1].values
independent_var=df.iloc[:,0].values#output:
Test for null values in the dataset : False

In this section, a copy of the original dataset dataframe has been created. Then the data set is checked for presence of null values. In our case,the data set didn’t have any null values.Eventually,data has been split into dependent variable and independent variable.

The data is scaled using MinMaxscaling technique which will the scale the highest value to 1, the lowest value to 0 and the values in between will be scaled accordingly to the range (0,1). Data scaling helps the model in making the calculations faster.

#Scaling the data using the Minmaxscaling technique
class Scaler:
    def __init__(self):
        self.min=None
        self.max=None
    def scale(self,data):
        if self.min is None and self.max is None:
            self.min=data.min()
            self.max=data.max()
        return (data-self.min) / (self.max-self.min)def reverse_scaling(self,data):
        return (data*(self.max-self.min))+self.minxscaler=Scaler()
yscaler=Scaler()
x=xscaler.scale(independent_var)
y=yscaler.scale(dependent_var)

In this snippet of code,a class of type Scaler has been defined with an initialiser,and two methods called scale and reverse_scaling. scale() method is used to scale the data to the range (0,1) using the minmaxscaling technique while reverse_scaling() method will convert the scaled values into their original range of values.The independent_var and dependent_var have been scaled and stored as x and y.

#Splitting the dataset into train and test set
def splitter(x,y,train_size=0.75,seed=None):
    np.random.seed(seed)
    data=np.concatenate([x.reshape(-1,1),y.reshape(-1,1)],axis=1)
    np.random.shuffle(data)
    xtrain=data[:int(len(data)*train_size),0]
    ytrain=data[:int(len(data)*train_size),1]
    xtest=data[int(len(data)*train_size):,0]
    ytest=data[int(len(data)*train_size):,1]
    return xtrain,ytrain,xtest,ytestxtrain,ytrain,xtest,ytest=splitter(x,y,train_size=0.85,seed=101)

splitter() method can be used to randomly split the whole data set into train and test data of desired size.

#Method of least squares
def least_squares(x,y):
    xmean=x.mean()
    ymean=y.mean()
    num=((x-xmean)*(y-ymean)).sum(axis=0)
    den=((x-x.mean())**2).sum(axis=0)
    weight=num/den
    bias=ymean-(weight*xmean)
    
    return weight,biasweight,bias=least_squares(xtrain,ytrain)
print("weight :{} , bias : {}".format(weight,bias))ypred=yscaler.reverse_scaling(predict(xtest,weight,bias))
ytrue=yscaler.reverse_scaling(ytest)#Output:
weight :0.8931 , bias : 0.0818

The above snippet code represents the implementation of least squares method using the above mathematical expression, where m is the weight(also called slope) and c is the bias(also called intercept). The training x and y data has been used to find the weight and bias of the regression equation.Then the model is used to make predictions on the test set in order to evaluate its performance.

Root Mean Square Error and R-squared value(Model Performance Metrics)

The Root Mean Square Error explains how far the actual values are located from the predicted values or the fitted regression line.The square of RMSE will give the Mean Squared Error. The RMSE can also be interpreted as the standard deviation of the error while MSE is the variance.R-squared value is the ratio of variance observed by the model to the total variance of the Data set.

#Model performance metrics
def mse(true,pred):
    return np.mean((pred-true)**2)def rmse(true,pred):
    return mse(true,pred)**0.5def mae(true,pred):
    return np.mean(abs(pred-true))def r_squared(true,pred):
    true_mean=true.mean()
    pred_mean=pred.mean()
    tot=((true-true_mean)**2).sum(axis=0)
    obs=((true-pred)**2).sum(axis=0)
    return 1-(obs/tot)print("MSE : ",mse(ytrue,ypred))
print("RMSE : ",rmse(ytrue,ypred))
print("MAE : ",mae(ytrue,ypred))
print("R-squared Value",r_squared(ytrue,ypred))#Output:
MSE :  0.02091
RMSE :  0.1446
MAE :  0.1197
R-squared Value 0.9174

The performance of the model we had fitted is quite good.It has RMSE of 0.1446 which means that the actual values are located 0.1446 units away from the predicted values.R-squared value of 91.74% indicates that the model is able to observe 91.74% of the total variability of the data set.The following graph represents the fitted least square regression line to the data set.

Simple Linear Regression Using Least Squares From Scratch

Written by Dhamodaran Babu