# Multivariate Linear Regression from scratch

The following code implements Multivariate Linear Regression from scratch. The code builds up from where we left in Simple Linear Regression (https://medium.com/@surajsubramanian2000/simple-linear-regression-from-scratch-9e073185cdab).

Representing our hypothesis as a function of multiple parameters is both powerful and necessary. Hence multivariate Linear Regression is widely used and it is a really powerful algorithm.

Importing necessary packages

• numpy: for representing vectors,
• pandas: for reading csv files and
• sklearn: we use it for splitting the dataset into train and test, preprocessing, encoding and for calculating errors.
`import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_squared_errorfrom sklearn.preprocessing import LabelEncoderlabelencoder = LabelEncoder()`

Next we load the csv file and delete columns that are not needed. The dataset can be downloaded from https://github.com/SurajSubramanian/MachineLearning/blob/master/ML-Programs/MultivariateLinearRegression/FuelConsumptionCo2.csv

`df = pd.read_csv("FuelConsumptionCo2.csv")`

We’ll view the dataset

`df['MODELYEAR'].unique() # returns array()`

There is only 1 MODELYEAR, hence it makes no sense to include it as a parameter. So we’ll drop it

`df = df.drop(['MODELYEAR'], axis=1)`

We use labelencoder to assign unique values to all the categorical data types as we cant operate on variables that are string.

`df['MAKE'], df['MODEL'], df['VEHICLECLASS'], df['TRANSMISSION'], df['FUELTYPE'] = \labelencoder.fit_transform(df['MAKE']), labelencoder.fit_transform(df['MODEL']), labelencoder.fit_transform(df['VEHICLECLASS']), \labelencoder.fit_transform(df['TRANSMISSION']), labelencoder.fit_transform(df['FUELTYPE'])`

Next, we filter out the variables that have less correlation with the target parameter — CO2EMISSIONS

`cor = df.corr()cor_target = abs(cor["CO2EMISSIONS"])#Selecting highly correlated featuresrelevant_features = cor_target[cor_target>0.5]relevant_features# Output :ENGINESIZE                  0.874154CYLINDERS                   0.849685FUELCONSUMPTION_CITY        0.898039FUELCONSUMPTION_HWY         0.861748FUELCONSUMPTION_COMB        0.892129FUELCONSUMPTION_COMB_MPG    0.906394CO2EMISSIONS                1.000000Name: CO2EMISSIONS, dtype: float64`

Dropping the columns with low correlation, i.e.) the columns with correlation below 0.5

`df = df.drop(['MAKE', 'MODEL', 'VEHICLECLASS', 'FUELTYPE', 'TRANSMISSION'], axis=1)df.head()`

We use min-max normalization. This step is really important. You could skip the following cell and try out all other cells to get some absurd value as the MSE error

`y,x = df["CO2EMISSIONS"], df.drop(["CO2EMISSIONS"], axis=1)`

Splitting the dataset into train and test functions

`x_, y_ = x.to_numpy(), y.to_numpy()x_ = np.append(arr = np.ones((len(x_), 1)).astype(int), values = x, axis = 1) x_train,x_test,y_train,y_test = train_test_split(x_,y_, test_size = 0.2, random_state=21)`

Implementing the training loop

`a=0.01X = []for row in x_train:    r =     for item in row:        r.append(item)    X.append(r)    X = np.asmatrix(X)theta = np.zeros(((X.size), 1))Y = y_train.reshape(-1,1)h = np.dot(X, theta)h.shape # returns (853, 1)temp = np.zeros(theta.shape)cost = np.sum (np.dot(np.transpose(h-Y), (h-Y)))*(1/(2*X.shape))temp = np.zeros(theta.shape)`

The gradientDescent function and the subsequent code are similar to the ones we used for Linear Regression, except that here we represent theta as an array.

One thing I usually do is to examine and play with the shapes of all matrices and arrays and see how they could be combined to produce the required resultant matrix

`def gradientDescent(theta, X):    h = np.dot(X, theta)    cost = np.sum(np.sum((h-Y)**2))*(1/(2*X.shape))    temp = theta - np.dot(X.T, h-Y) * (a/X.shape)    theta = temp    return(theta, X, cost)oldCost = 0theta = np.ones(theta.shape)X = np.ones(X.shape)for i in range(0, 10000):    (theta, X, cost) = gradientDescent(theta, X)    if i%1000 == 0:        print(cost)        print(theta)`

predicting the y values for the test dataset and evaluating our model with the MSE function

`X_test = []for row in x_test:    r =     for item in row:        r.append(item)    X_test.append(r)mean_squared_error(np.dot(X_test, theta), y_test) # returns 0.05530668773441035`

You can get the complete code from my github repository https://github.com/SurajSubramanian/MachineLearning/tree/master/ML-Programs/MultivariateLinearRegression

You can achieve better results using the built-in sklearn function !