Multivariate Linear Regression from scratch

The following code implements Multivariate Linear Regression from scratch. The code builds up from where we left in Simple Linear Regression (https://medium.com/@surajsubramanian2000/simple-linear-regression-from-scratch-9e073185cdab).

Representing our hypothesis as a function of multiple parameters is both powerful and necessary. Hence multivariate Linear Regression is widely used and it is a really powerful algorithm.

Importing necessary packages

  • numpy: for representing vectors,
  • pandas: for reading csv files and
  • sklearn: we use it for splitting the dataset into train and test, preprocessing, encoding and for calculating errors.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()

Next we load the csv file and delete columns that are not needed. The dataset can be downloaded from https://github.com/SurajSubramanian/MachineLearning/blob/master/ML-Programs/MultivariateLinearRegression/FuelConsumptionCo2.csv

df = pd.read_csv("FuelConsumptionCo2.csv")

We’ll view the dataset

Gist of the dataset
df['MODELYEAR'].unique() # returns array([2014])

There is only 1 MODELYEAR, hence it makes no sense to include it as a parameter. So we’ll drop it

df = df.drop(['MODELYEAR'], axis=1)

We use labelencoder to assign unique values to all the categorical data types as we cant operate on variables that are string.

df['MAKE'], df['MODEL'], df['VEHICLECLASS'], df['TRANSMISSION'], df['FUELTYPE'] = \
labelencoder.fit_transform(df['MAKE']), labelencoder.fit_transform(df['MODEL']), labelencoder.fit_transform(df['VEHICLECLASS']), \
labelencoder.fit_transform(df['TRANSMISSION']), labelencoder.fit_transform(df['FUELTYPE'])

Next, we filter out the variables that have less correlation with the target parameter — CO2EMISSIONS

cor = df.corr()
cor_target = abs(cor["CO2EMISSIONS"])
#Selecting highly correlated features
relevant_features = cor_target[cor_target>0.5]
relevant_features
# Output :
ENGINESIZE 0.874154
CYLINDERS 0.849685
FUELCONSUMPTION_CITY 0.898039
FUELCONSUMPTION_HWY 0.861748
FUELCONSUMPTION_COMB 0.892129
FUELCONSUMPTION_COMB_MPG 0.906394
CO2EMISSIONS 1.000000
Name: CO2EMISSIONS, dtype: float64

Dropping the columns with low correlation, i.e.) the columns with correlation below 0.5

df = df.drop(['MAKE', 'MODEL', 'VEHICLECLASS', 'FUELTYPE', 'TRANSMISSION'], axis=1)df.head()

We use min-max normalization. This step is really important. You could skip the following cell and try out all other cells to get some absurd value as the MSE error

Gist of dataset after preprocessing
y,x = df["CO2EMISSIONS"], df.drop(["CO2EMISSIONS"], axis=1)

Splitting the dataset into train and test functions

x_, y_ = x.to_numpy(), y.to_numpy()
x_ = np.append(arr = np.ones((len(x_), 1)).astype(int), values = x, axis = 1)

x_train,x_test,y_train,y_test = train_test_split(x_,y_, test_size = 0.2, random_state=21)

Implementing the training loop

a=0.01
X = []
for row in x_train:
r = [1]
for item in row:
r.append(item)
X.append(r)

X = np.asmatrix(X)
theta = np.zeros(((X[0].size), 1))
Y = y_train.reshape(-1,1)

h = np.dot(X, theta)
h.shape # returns (853, 1)
temp = np.zeros(theta.shape)
cost = np.sum (np.dot(np.transpose(h-Y), (h-Y)))*(1/(2*X.shape[0]))
temp = np.zeros(theta.shape)

The gradientDescent function and the subsequent code are similar to the ones we used for Linear Regression, except that here we represent theta as an array.

One thing I usually do is to examine and play with the shapes of all matrices and arrays and see how they could be combined to produce the required resultant matrix

def gradientDescent(theta, X):
h = np.dot(X, theta)
cost = np.sum(np.sum((h-Y)**2))*(1/(2*X.shape[0]))
temp = theta - np.dot(X.T, h-Y) * (a/X.shape[0])
theta = temp
return(theta, X, cost)
oldCost = 0
theta = np.ones(theta.shape)
X = np.ones(X.shape)
for i in range(0, 10000):
(theta, X, cost) = gradientDescent(theta, X)
if i%1000 == 0:
print(cost)
print(theta)

predicting the y values for the test dataset and evaluating our model with the MSE function

X_test = []
for row in x_test:
r = [1]
for item in row:
r.append(item)
X_test.append(r)
mean_squared_error(np.dot(X_test, theta), y_test) # returns 0.05530668773441035

You can get the complete code from my github repository https://github.com/SurajSubramanian/MachineLearning/tree/master/ML-Programs/MultivariateLinearRegression

You can achieve better results using the built-in sklearn function !

Thanks for reading :)

Machine Learning and Deep Learning Enthusiast