Linear Regression — Implementation from scratch

Parth Dhameliya
6 min readMar 8, 2020

--

Linear Regression is a supervised learning algorithm. Linear Regression consist of a linear function where there is an independent variable(X) and a dependent variable(y).Linear Regression is the simplest algorithm of machine learning.

Lets start with the training process step by step .To understand Linear regression we are going to have an example.We are having a data-set of house price prediction in which we need to predict the price of the house using size of house.

Model Representation:

Input Variable or features (X) : Size of house(Sq-feet)

Output Variable or Target (Y) : Price of House

m : Number of Training examples

n : Number of features

X(i) : i-th training example for eg, X(1) is 2104 X(2) is 1416 similarly for y(i)

Training Data-set of house price prediction

Hypothesis or Prediction function: We are going to denote prediction function as H(Θ).

H(Θ) = Θ0*X0(i) + Θ1*X1(i)

Here X0(i)=1.

Rather than computing loop to find H(Θ) = Θ0*X0(i) + Θ1*X1(i) we are going to implement Vectorization and following can be converted into :

H(Θ) = X * Θ

Overall Matrix Conversion of data-set into features and target

Here Matrix X contains Feature such as X1 = Size in sq-feet similarly Y contains target variable Price($) in 1000's.Here X0 is ones vector.

X*theta

where X is having dimension (m,n+1) and theta with dimension (n+1,1) where n is the number of features.Here number of feature is 1 i.e Size of sq-feet and including a column vector of ones (X0) so that the dimension of X becomes (m,n+1)

Now here Θ is initially taken as zero, we want value of theta to be appropriate so that overall value of H(Θ) i.e predicted y is having minimum error with the actual y.

If we have more than one features for e.g size of house and no.of bedroom to predict price of house then our matrix will be [X0 X1 X2] where X0 =ones vector X1=size of house and X2 = no.of bedroom.

We are going to find values of theta using gradient descent algorithm.

Before running into gradient descent it is important to know cost function as it is going to be use in gradient descent.

Cost Function : To measure error between the actual y and (H(Θ)) predicted y we need a error metric i.e cost function J(Θ).

J(θ) = 1/2m *sum( (H(Θ)-y)² )

m : Number of Training examples (rows)

H(Θ) : predicted y

Gradient Descent: So we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters Θ in the hypothesis function. That’s where gradient descent comes in.

So, we are going to measure it by formula given as below:

Θ = Θ -alpha*grad

alpha : Learning rate

grad : (1/m) * ((H(Θ)-y).T*(X)).T

By doing partial derivative of J(Θ) w.r.t Θ :

grad = ∂J(Θ) ∕∂Θ = (1/m) * ((H(Θ)-y).T*(X)).T

T = Transpose of matrix

We will repeat (Θ = Θ -alpha*grad) until convergence or we can say until the cost J(Θ) becomes minimum.

J(theta) vs theta

alpha : alpha is the hyper-parameter called learning rate. Learning rate should not to be too small or too big.For example if alpha is too large,gradient descent can overshoot the minimum.It may fail to converge,or even diverge.If alpha is too small,gradient descent can be slow.Usually trying different value of alpha like [0.001,0.01,0.1,1,0.003,0.03,0.3,3] we can get an idea which would be appropriate to choose.

If we want to know whether the algorithm is working or not, we will simply plot J(Θ) vs no. of iterations.

If cost is decreasing with the iterations then our gradient descent algorithm is working fine.

Now we are going to implement it on real world data-set.Cab-fare amount data to predict fare_amount using distance and passenger_count.For easy implementation we are going to use cleaned version of that data.So far we have seen using one feature in these implementation we are going to use two features.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Importing necessary packages pandas for importing data,numpy for numerical computation and matplotlib for plotting and visualization.

df=pd.read_csv("cab_fare.csv")
df.head()

Now splitting into features(X) and target(Y).Here features are Distance and passenger_count and target as fare_amount.

X=df.drop(['fare_amount'],axis=1)
Y=df['fare_amount']

After splitting i am going to convert into matrix

X=np.matrix(X)
Y=np.matrix(Y).T
X.shape
Y.shape
Output:(16049, 2)
(16049, 1)

Adding X0 i.e ones to X matrix

m,n=X.shape
X0=np.ones((m,1))
X = np.hstack((X0,X))
print(X)
print(X.shape)
Output:matrix([[1. , 5.39388167, 1. ],
[1. , 0.01413355, 6. ],
[1. , 5.52371893, 1. ],
...,
[1. , 1.61797653, 1. ],
[1. , 2.46648061, 1. ],
[1. , 1.49828041, 1. ]])
(16049, 3)

Now initializing theta as zero with (n+1,1) dimension

theta = np.zeros((n+1,1))
print(theta)
theta.shape
Output:[[0.]
[0.]
[0.]]
(3, 1)

Now creating a prediction function for H(theta) = X*theta

def predict(X,theta):
h_theta = X*theta

return h_theta
predict(X,theta)Output:matrix([[0.],
[0.],
[0.],
...,
[0.],
[0.],
[0.]])

Now creating a Cost function to compute cost i.e

J(θ) = 1/2m *sum( (H(Θ)-y)² )

def ComputeCost(h_theta,Y):
m=Y.shape[0]
J = (1/(2*m))*np.dot((h_theta-Y).T,(h_theta-Y))
return J
ComputeCost(h_theta,Y)Output:matrix([[107.22753437]])

Now creating a Grad function for gradient descent i.e

grad = (1/m) * ((H(Θ)-y).T*(X)).T

def grad(X,h_theta,Y):
m=Y.shape[0]
grad = (1/m)*((h_theta-Y).T*X).T
return grad
grad(X,h_theta,Y)Output : matrix([[-11.26354165],
[-68.00912846],
[-18.66983613]])

Now we are going to set the hyper-parameters for running gradient descent

learning_rate=0.003
iterations=100
j_history=[]

#here j_history is used to save the cost per iteration

After applying all the functions above now we are ready to compute gradient descent. We are going to run it till 100 epochs(iterations) with learning rate 0.003.

for i in range(iterations):
h_theta = predict(X,theta)
cost = ComputeCost(h_theta,Y)
grad = (1/m)*((h_theta-Y).T*X).T
theta = theta-learning_rate*grad
j_history=append(j_history,cost)
print("Epoch :",i,"Cost :",cost)
Output:Epoch : 0 Cost : [[107.22753437]]
Epoch : 1 Cost : [[92.53213508]]
Epoch : 2 Cost : [[80.07457421]]
Epoch : 3 Cost : [[69.51396671]]
Epoch : 4 Cost : [[60.56135501]]
Epoch : 5 Cost : [[52.97179875]]
Epoch : 6 Cost : [[46.53766965]]
Epoch : 7 Cost : [[41.0829676]]
Epoch : 8 Cost : [[36.45850279]]
Epoch : 9 Cost : [[32.53781157]]
Epoch : 10 Cost : [[29.21369461]]
.................................
.................................
.................................
Epoch : 90 Cost : [[10.48357878]]
Epoch : 91 Cost : [[10.48127383]]
Epoch : 92 Cost : [[10.47898373]]
Epoch : 93 Cost : [[10.47670814]]
Epoch : 94 Cost : [[10.47444675]]
Epoch : 95 Cost : [[10.47219927]]
Epoch : 96 Cost : [[10.4699654]]
Epoch : 97 Cost : [[10.46774491]]
Epoch : 98 Cost : [[10.46553753]]
Epoch : 99 Cost : [[10.46334303]]

As we can see the cost is decreasing per iterations and as we have saved the cost per iterations in j_history we use it to visualize it.

x=np.linspace(0,iterations,iterations)
plt.ylabel('cost function')
plt.plot(x,j_history,color='r')
plt.xlabel('No. of iterations')
plt.title('decreasing of cost function')
prediction= predict(X,theta)
prediction
Output:matrix([[15.01125407],
[ 5.49539759],
[15.33739383],
...,
[ 5.52651334],
[ 7.65788078],
[ 5.22584724]])
error = ComputeCost(prediction,Y)
error
Output:matrix([[10.46116118]])print("RMSE of Multi-linearRegression : ",np.sqrt((np.square(prediction-Y)).mean())) #RMSE MetricOutput:RMSE of Multi-linearRegression : 4.574092516603985

Now we can conclude it with a simple process diagram as follow:

To know about logistic regression you can go for these given link below:

https://medium.com/@pdhameliya3333/logistic-regression-implementation-from-scratch-3dab8cf134a8

--

--