Multivariate Linear Regression from Scratch in Python

Pytholabs Research
6 min readFeb 1, 2019

update : We have introduced an interactive learning App for machine learning / AI ,>> Check it out for Free now <<

Multivariate Linear Regression

Multiple Linear Regression is a type of Linear Regression when the input has multiple features(variables).

Model Representation

Similar to Simple Linear Regression, we have input variable(X) and output variable(Y). But the input variable has nn features. Therefore, we can represent this linear model as follows;

Y = β0​+ β1​x1​ + β1​x2​ +…+βnxn

xi​ the ith feature in input variable. By introducing x0​=1, we can rewrite this equation.

Y = β0​x0​+β1​x1​+β1​x2​+…+βnxn

where, x0​=1

Now we can convert this eqaution to matrix form.

Gradient Descent

Gradient Descent is an optimization algorithm. Gradient means the rate of change or the slope of curve, here you can see the change in Cost (J) between a to b is much higher than c to d.

Our aim through this algorithm is to find the lowest point of the curve or the value of β0 where Cost(J) is lowest by descending gradually.

Step 1

Initialize values β0​, β1​,…, βn​ with some value. In this case we will initialize with 0.

Step 2

Pick a value for the learning rate α. The learning rate determines how big the step would be on each iteration.

  • If α is very small, it would take long time to converge and become computationally expensive.
  • If α is large, it may fail to converge and overshoot the minimum.

Therefore, plot the cost function against different values of α and pick the value of α that is right before the first value that didn’t converge so that we would have a very fast learning algorithm that converges (see figure 2).

Figure 2: Gradient descent with different learning rates. Source
  • The most commonly used rates are : 0.001, 0.003, 0.01, 0.03, 0.1, 0.3.

STEP 3

Make sure to scale the features (X) if it’s on a very different scales. If we don’t scale the data, the level curves (contours) would be narrower and taller which means it would take longer time to converge (see figure 3).

Figure 3: Gradient descent: normalized versus unnormalized level curves.

Scale the data to have μ = 0 and σ = 1. Below is the formula for scaling each example:

STEP 4

Iteratively update,

This is the procedure. Here α is the learning rate.

This operation means we are finding partial derivate of cost with respect to each βj​. This is called Gradient.

In step 4 we are changing the values of βj​ in a direction in which it reduces our cost function. And Gradient gives the direction in which we want to move. Finally we will reach the minima of our cost function. But we don’t want to change values of βj​ drastically, because we might miss the minima. That’s why we need learning rate.

The above animation illustrates the Gradient Descent method.

We can’t use above in our mathematical computational model , first we have to apply some magic of math(differentiation of cost).

We get the final formula to update our parameter βj and get the optimal value iteratively.

The algorithm will converge when slope becomes 0, since :

We will be using Batch Gradient Descent Algorithm, for details on all variants view Annex A

Implementation

step 1 :

Importing the Dataset (Combined Cycle Power Plant) :

import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
%matplotlib inlinedata = pd.read_excel(‘energy_data.xlsx’)print(data.head(5))

We have 4 features (temp,Vaccum, pressure, humidity) and 1 target variable (Energy Output). Therefore, here we have to predict the energy output with the given features.

STEP 2:

#seperating dataX = data[:,:4]
y = data[:,-1]

Scale / Standardize the features :

To know how to build a scale and rescale the data from scratch contact our:ML experts

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)

STEP 3:

Defining cost MSE:

def cost_function(X, Y, B):
m = len(Y)
J = np.sum((X.dot(B) — Y) ** 2)/(2 * m)
return J

In Batch Gradient Descent Function we take parameters

X: Feature Matrix

Y: an array of target values

B: initial value of theta

alpha: learning rate

iterations: max no. of iterations for algorithm

##--------------------------------------------def batch_gradient_descent(X, Y, B, alpha, iterations):
cost_history = [0] * iterations
m = len(Y)

for iteration in range(iterations):
#print(iteration)
# Hypothesis Values
h = X.dot(B)
# Difference b/w Hypothesis and Actual Y
loss = h — Y
# Gradient Calculation
gradient = X.T.dot(loss) / m
# Changing Values of B using Gradient
B = B — alpha * gradient
# New Cost Value
cost = cost_function(X, Y, B)
cost_history[iteration] = cost

return B, cost_history

STEP 4:

Splitting training and testing sets:

m = 7000
f = 2
X_train = X[:m,:f]
X_train = np.c_[np.ones(len(X_train),dtype=’int64'),X_train]
y_train = y[:m]X_test = X[m:,:f]
X_test = np.c_[np.ones(len(X_test),dtype=’int64'),X_test]
y_test = y[m:]

here, m = total no. of samples those will be included in training set that will be used to train the model the samples left will be assigned to testing set(training = 7000 , testing = 2568)

and, f = no. of features (first 2 features)

STEP 5:

Initializing the coefficients(β0​, β1​,…, βn) and training the model

# Initial Coefficients
B = np.zeros(X_train.shape[1])
alpha = 0.005
iter_ = 2000
newB, cost_history = batch_gradient_descent(X_train, y_train, B, alpha, iter_)

here, alpha is the learning rate set to 0.005 and iter_ = total no. of iterations for gradient descent

newB = the optimized coefficients(β0​, β1​,…, βn) after training the model and the cost_history per iteration

Now after training let’s plot to see the results :

cost reduces with increasing no. of iterations

Evaluating / Testing the model :

let’s make predictions using test set:

y_ = pred(X_test,newB)

now check how good is it predicting using r2 score :

def r2(y_,y):
sst = np.sum((y-y.mean())**2)
ssr = np.sum((y_-y)**2)
r2 = 1-(ssr/sst)
return(r2)
#----------------
r2(y_,y_test)

r2 score yeilds to be 0.91404488821296 … that’s pretty good !

Single predictions:

ans_ = pred(X_test[3],newB)

--

--