# Multivariate Linear Regression from scratch

The following code implements Multivariate Linear Regression from scratch. The code builds up from where we left in Simple Linear Regression (https://medium.com/@surajsubramanian2000/simple-linear-regression-from-scratch-9e073185cdab).

Representing our hypothesis as a function of multiple parameters is both powerful and necessary. Hence multivariate Linear Regression is widely used and it is a really powerful algorithm.

**Importing necessary packages**

**numpy**: for representing vectors,**pandas**: for reading csv files and**sklearn**: we use it for splitting the dataset into train and test, preprocessing, encoding and for calculating errors.

`import pandas as pd`

import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error

from sklearn.preprocessing import LabelEncoder

labelencoder = LabelEncoder()

Next we load the csv file and delete columns that are not needed. The dataset can be downloaded from https://github.com/SurajSubramanian/MachineLearning/blob/master/ML-Programs/MultivariateLinearRegression/FuelConsumptionCo2.csv

`df = pd.read_csv("FuelConsumptionCo2.csv")`

We’ll view the dataset

`df['MODELYEAR'].unique() # returns array([2014])`

There is only 1 MODELYEAR, hence it makes no sense to include it as a parameter. So we’ll drop it

`df = df.drop(['MODELYEAR'], axis=1)`

We use labelencoder to assign unique values to all the categorical data types as we cant operate on variables that are string.

`df['MAKE'], df['MODEL'], df['VEHICLECLASS'], df['TRANSMISSION'], df['FUELTYPE'] = \`

labelencoder.fit_transform(df['MAKE']), labelencoder.fit_transform(df['MODEL']), labelencoder.fit_transform(df['VEHICLECLASS']), \

labelencoder.fit_transform(df['TRANSMISSION']), labelencoder.fit_transform(df['FUELTYPE'])

Next, we filter out the variables that have less correlation with the target parameter — CO2EMISSIONS

cor = df.corr()

cor_target = abs(cor["CO2EMISSIONS"])

#Selecting highly correlated features

relevant_features = cor_target[cor_target>0.5]

relevant_features# Output :

ENGINESIZE 0.874154

CYLINDERS 0.849685

FUELCONSUMPTION_CITY 0.898039

FUELCONSUMPTION_HWY 0.861748

FUELCONSUMPTION_COMB 0.892129

FUELCONSUMPTION_COMB_MPG 0.906394

CO2EMISSIONS 1.000000

Name: CO2EMISSIONS, dtype: float64

Dropping the columns with low correlation, i.e.) the columns with correlation below 0.5

df = df.drop(['MAKE', 'MODEL', 'VEHICLECLASS', 'FUELTYPE', 'TRANSMISSION'], axis=1)df.head()

We use min-max normalization. This step is really important. You could skip the following cell and try out all other cells to get some absurd value as the MSE error

`y,x = df["CO2EMISSIONS"], df.drop(["CO2EMISSIONS"], axis=1)`

Splitting the dataset into train and test functions

`x_, y_ = x.to_numpy(), y.to_numpy()`

x_ = np.append(arr = np.ones((len(x_), 1)).astype(int), values = x, axis = 1)

x_train,x_test,y_train,y_test = train_test_split(x_,y_, test_size = 0.2, random_state=21)

Implementing the training loop

a=0.01

X = []

for row in x_train:

r = [1]

for item in row:

r.append(item)

X.append(r)

X = np.asmatrix(X)theta = np.zeros(((X[0].size), 1))

Y = y_train.reshape(-1,1)

h = np.dot(X, theta)

h.shape # returns (853, 1)temp = np.zeros(theta.shape)

cost = np.sum (np.dot(np.transpose(h-Y), (h-Y)))*(1/(2*X.shape[0]))

temp = np.zeros(theta.shape)

The gradientDescent function and the subsequent code are similar to the ones we used for Linear Regression, except that here we represent theta as an array.

One thing I usually do is to examine and play with the shapes of all matrices and arrays and see how they could be combined to produce the required resultant matrix

def gradientDescent(theta, X):

h = np.dot(X, theta)

cost = np.sum(np.sum((h-Y)**2))*(1/(2*X.shape[0]))

temp = theta - np.dot(X.T, h-Y) * (a/X.shape[0])

theta = temp

return(theta, X, cost)oldCost = 0

theta = np.ones(theta.shape)

X = np.ones(X.shape)

for i in range(0, 10000):

(theta, X, cost) = gradientDescent(theta, X)

if i%1000 == 0:

print(cost)

print(theta)

predicting the y values for the test dataset and evaluating our model with the MSE function

X_test = []

for row in x_test:

r = [1]

for item in row:

r.append(item)

X_test.append(r)mean_squared_error(np.dot(X_test, theta), y_test) # returns 0.05530668773441035

You can get the complete code from my github repository https://github.com/SurajSubramanian/MachineLearning/tree/master/ML-Programs/MultivariateLinearRegression

You can achieve better results using the built-in sklearn function !

Thanks for reading :)