Multivariate Linear Regression from scratch

The following code implements Multivariate Linear Regression from scratch. The code builds up from where we left in Simple Linear Regression (

Representing our hypothesis as a function of multiple parameters is both powerful and necessary. Hence multivariate Linear Regression is widely used and it is a really powerful algorithm.

Importing necessary packages

  • numpy: for representing vectors,
  • pandas: for reading csv files and
  • sklearn: we use it for splitting the dataset into train and test, preprocessing, encoding and for calculating errors.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()

Next we load the csv file and delete columns that are not needed. The dataset can be downloaded from

df = pd.read_csv("FuelConsumptionCo2.csv")

We’ll view the dataset

Gist of the dataset
df['MODELYEAR'].unique() # returns array([2014])

There is only 1 MODELYEAR, hence it makes no sense to include it as a parameter. So we’ll drop it

df = df.drop(['MODELYEAR'], axis=1)

We use labelencoder to assign unique values to all the categorical data types as we cant operate on variables that are string.

df['MAKE'], df['MODEL'], df['VEHICLECLASS'], df['TRANSMISSION'], df['FUELTYPE'] = \
labelencoder.fit_transform(df['MAKE']), labelencoder.fit_transform(df['MODEL']), labelencoder.fit_transform(df['VEHICLECLASS']), \
labelencoder.fit_transform(df['TRANSMISSION']), labelencoder.fit_transform(df['FUELTYPE'])

Next, we filter out the variables that have less correlation with the target parameter — CO2EMISSIONS

cor = df.corr()
cor_target = abs(cor["CO2EMISSIONS"])
#Selecting highly correlated features
relevant_features = cor_target[cor_target>0.5]
# Output :
CYLINDERS 0.849685
Name: CO2EMISSIONS, dtype: float64

Dropping the columns with low correlation, i.e.) the columns with correlation below 0.5

df = df.drop(['MAKE', 'MODEL', 'VEHICLECLASS', 'FUELTYPE', 'TRANSMISSION'], axis=1)df.head()

We use min-max normalization. This step is really important. You could skip the following cell and try out all other cells to get some absurd value as the MSE error

Gist of dataset after preprocessing
y,x = df["CO2EMISSIONS"], df.drop(["CO2EMISSIONS"], axis=1)

Splitting the dataset into train and test functions

x_, y_ = x.to_numpy(), y.to_numpy()
x_ = np.append(arr = np.ones((len(x_), 1)).astype(int), values = x, axis = 1)

x_train,x_test,y_train,y_test = train_test_split(x_,y_, test_size = 0.2, random_state=21)

Implementing the training loop

X = []
for row in x_train:
r = [1]
for item in row:

X = np.asmatrix(X)
theta = np.zeros(((X[0].size), 1))
Y = y_train.reshape(-1,1)

h =, theta)
h.shape # returns (853, 1)
temp = np.zeros(theta.shape)
cost = np.sum (, (h-Y)))*(1/(2*X.shape[0]))
temp = np.zeros(theta.shape)

The gradientDescent function and the subsequent code are similar to the ones we used for Linear Regression, except that here we represent theta as an array.

One thing I usually do is to examine and play with the shapes of all matrices and arrays and see how they could be combined to produce the required resultant matrix

def gradientDescent(theta, X):
h =, theta)
cost = np.sum(np.sum((h-Y)**2))*(1/(2*X.shape[0]))
temp = theta -, h-Y) * (a/X.shape[0])
theta = temp
return(theta, X, cost)
oldCost = 0
theta = np.ones(theta.shape)
X = np.ones(X.shape)
for i in range(0, 10000):
(theta, X, cost) = gradientDescent(theta, X)
if i%1000 == 0:

predicting the y values for the test dataset and evaluating our model with the MSE function

X_test = []
for row in x_test:
r = [1]
for item in row:
mean_squared_error(, theta), y_test) # returns 0.05530668773441035

You can get the complete code from my github repository

You can achieve better results using the built-in sklearn function !

Thanks for reading :)

Machine Learning and Deep Learning Enthusiast