ML from Scratch: Linear Regression Model with NumPy
Your Complete Guide to Linear Regression
In this project, we will see how to create a machine learning model that uses the Multiple Linear Regression algorithm.
The main focus of this project is to explain how linear regression works, and how you can code a linear regression model from scratch using the awesome NumPy module.
Of course, you can create a linear regression model using the scikit-learn with just 3–4 lines of code, but really, coding your own model from scratch is far more awesome than relying on a library that does everything for you while you sit and watch.
Not only that, coding a custom model means you have full control over what the model does and how that model deals with the data that you will be feeding it. This allows for more flexibility during the training process, and you can actually tweak the model to make it more robust and responsive to the real world data as required in the future during re-training or in the production.
In this project, our model will be used to predict the CO₂ emissions of a vehicle based on its features such as its engine size, fuel consumption etc.
Let’s start working on the project.
Making the Necessary Imports
First, we will import the necessary PyData modules.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Now, let us import our data set. The data set contains model-specific fuel consumption ratings and estimated carbon dioxide emissions for new light-duty vehicles for retail sale in Canada.
df = pd.read_csv("FuelConsumptionCo2.csv")
print(df.head())
Here’s the link for the data set. I will also share the link to the Github repo containing the Jupyter notebook and the dataset at the end of this project.
Here’s how it looks in my Jupyter notebook:
The following are the columns present in our data set.
- MODELYEAR e.g. 2014
- MAKE e.g. Acura
- MODEL e.g. ILX
- VEHICLE CLASS e.g. SUV
- ENGINE SIZE e.g. 4.7
- CYLINDERS e.g 6
- TRANSMISSION e.g. A6
- FUEL CONSUMPTION in CITY(L/100 km) e.g. 9.9
- FUEL CONSUMPTION in HWY (L/100 km) e.g. 8.9
- FUEL CONSUMPTION COMB (L/100 km) e.g. 9.2
- CO2 EMISSIONS (g/km) e.g. 182 → low → 0
Data Wrangling & Feature Selection
One of the most important steps in any data science project is pre-processing the data. This involves cleaning the data, typecasting some columns as required, conversion of categorical variables and standardizing/normalizing the data as per the project requirements.
For our project, the first step of the pre-processing is going to be checking whether we need to typecast the data type of any feature/target variable.
print(df.dtypes)
We get the following output:
As we can see, there’s no need to typecast any column.
The second step in our data wrangling process is analyzing whether the features need to be standardized or not. For that, let us have a look at the descriptive analysis of the dataframe.
print(df.describe())
As we can see, all the potential features are of the same order in terms of scales, so we needn’t standardize any feature.
For this project, the features we will be choosing are ENGINESIZE, CYLINDERS & FUELCONSUMPTION_COMB and the target variable is CO2EMISSIONS.
df = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]print(df.head())
Our next step- Checking the number of NaN(null) values in the dataframe.
for i in df.columns:
print(df[i].isnull().value_counts())
As we can see, there are are no null values in our dataframe. So the data is perfect for training the model.
Data Visualization & Analysis
First, we will have a look at the correlation of the features and the target variables.
print(df.corr())
This table shows a strong positive correlation between the features and the target variable. Remember, a strong correlation is a good thing for a linear regression model.
Now, let us visualize the plots of different features against the target variable. This will allow us to get an idea whether the features show a linear relation with the target variable or not.
fig, a = plt.subplots(1,3, figsize = (18, 5))a[0].scatter(df['ENGINESIZE'], df['CO2EMISSIONS'], color = 'c')
a[0].set_title('Engine Size vs CO2 Emissions')
a[0].set_xlabel('Engine Size (L)')a[1].scatter(df['CYLINDERS'], df['CO2EMISSIONS'], color = 'm')
a[1].set_title('No. of Cylinders vs CO2 Emissions')
a[1].set_xlabel('No. of Cylinders')a[2].scatter(df['FUELCONSUMPTION_COMB'], df['CO2EMISSIONS'], color = 'b')
a[2].set_title('Fuel Consumption vs CO2 Emissions')
a[2].set_xlabel('Fuel Consumption (L/100km)')fig.text(0.08, 0.5, 'CO2 Emissions', va='center', rotation='vertical')
plt.show()
As we can see, the features show a considerable linear relationship with the target. Hence, we can use them for training the model.
Linear Regression Model from Scratch
Linear regression uses the following mathematical formula for prediction of a dependent variable using an independent variable.
y = wx + b
Here,
- y- Dependent variable(s)
- x- Dependent variable(s)
- w- Weight(s) associated with the independent variable(s)
- b- Biases for the given lin-reg equation
The following is the process for developing a linear regression model.
- Splitting the data set into training and testing sets. For the sake of simplicity however, we will be skipping this step in our custom model.
- Assigning random weights and biases to the model and then calculating dependent variable, ŷ on the basis of the random weights and biases.
- Using a loss function to calculate the total information loss, i.e., the total inaccuracy within out model. In our examples, we will be using the Mean Squared Error (MSE) loss function.
- Our next step is to reduce the total MSE of our model. For this, we will be using the Stochastic Gradient Descent (SGD) function, which is one of the most popular optimizer algorithms used in regression models. We will discuss the SGD function in detail while coding the optimizer function.
- We will update the model weights and biases based on our optimizer algorithm, then retrain the model. This is a recurrent process that will keep on repeating until we achieve an optimum model with low information loss.
First, let’s cast the features to a NumPy array, features.
features = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']].to_numpy() #Converts the dataframe to numpy arrayprint(features)
Now, let us get cast the target column to a NumPy array, target.
target = df[‘CO2EMISSIONS’].to_numpy() #Converts the dataframe to numpy arrayprint(target)
Since we have 3 dependent variables, we will have 3 weights. Let’s generate array weights of 3 small random weights.
weights = np.random.rand(3) #Generates a numpy array with two small random floatsprint(weights)
Since we have a single target variable, we will have just one bias, b. We will also create an array bias equal to the length of features array having bias b for each element.
b = np.random.rand(1) #Generates a numpy array with a small random floatbias = np.array([b[0] for i in range(len(features))])print(bias)
Now, we will define our model function that uses weights, biases and dependent variables to calculate ŷ.
def linearRegr(features, weights, bias): """Calculates the y_hat predicted values using the given parameters of weights, dependent variables, and biases. Args:
-dependant_var: Matrix of dependant variable values
-weights: Matrix/array of weights associated with each dependant variable
-biases: Biases for the model
Returns:
-Array/matrix of predicted values """ y_hat = weights.dot(features.transpose()) + np.array([bias[0] for i in range(len(features))]) # Takes the value stored in the bias array and makes an array of length of feature matrix for addition return y_hat
Now, let us run the function once to see the results we’ll get.
y_hat = linearRegr(features, weights, b)
print(y_hat)
Now, we will define the MSE function to calculate the total loss of our model.
def meanSqrError(y, y_hat):
"""Calculates the total mean squared error.
Args-
y: Array of actual target values
y_hat: Array of predicted target values
Returns-
total mean squared error
"""
MSE = np.sum((y - y_hat) ** 2) / len(y)
return MSE
Let us now calculate the information loss based on the y_hat values we got earlier.
print('Total error- {}'.format(meanSqrError(target, y_hat)))
As we can see, our model is currently massively inaccurate and we need to optimize it.
Optimizing the Model
Now comes the most important step in the linear regression. Formulating the SGD function. This is a slightly advance topic as compared to all the basic functions we have covered up until this point. It requires some knowledge of differential calculus; partial differentiation to be specific. I have tried to explain this in the image below, however, if you don’t get it, I’d strongly suggest you get familiar with the mathematics portion of Machine Learning (Calculus, Statistics and Probability, Linear Algebra) before proceeding any further.
Image source- Adarsh Menon-Medium.com
Once we have calculated the gradients, we will update the parameters as follows.
- m = m - α Dm
- c = c - α Dc
Here,
- E- Total mean squared error
- m- Weights associated with features
- c- Model bias
- y- Array of actual target values
- ŷ- Predicted target values
- Dm- Partial derivative of E w.r.t. weights m
- Dc- Partial derivative of E w.r.t. bias c
- α- Learning rate, i.e., size of the step that the optimizer function takes.
Once we have the new updated values of the weights and biases, we will calculate the loss again. We will repeat the process for n epochs, i.e., number of cycles and plot the loss values after each epoch. To keep the code clean, I will create a separate function for calculation of gradients.
def gradient(target, features, weights, bias):
"""Returns the gradient(slopes) for weights and biases
"""
m = len(features)
target_pred = linearRegr(features, weights, bias)
loss = target - target_pred # y-y_hat
# Gradient calculation for model bias
grad_bias = np.array([-2/m * np.sum(loss)])
grad_weights = np.ones(3)
# Gradient calculation for first feature
feature_0 = np.array([feature[0] for feature in features])
grad_weights[0] = -2/m * np.sum(loss * feature_0)
# Gradient calculation for second feature
feature_1 = np.array([feature[1] for feature in features])
grad_weights[1] = -2/m * np.sum(loss * feature_1)
# Gradient calculation for third feature
feature_2 = np.array([feature[1] for feature in features])
grad_weights[2] = -2/m * np.sum(loss * feature_2)
return grad_bias, grad_weights
Now, let us write the SDG function that will return the updated weights and biases so we can formulate our final model.
def stochGradDesMODIFIED(learning_rate, epochs, target, features, weights, bias):
"""Performs stochastic gradient descent optimization on the model.
Args-
learning_rate- Size of the step the function will take during optimization
epochs- No. of iterations the function will run for on the model
target- Actual emission values
features- Matrix of dependent variables
weights- Weights associated with each feature
bias- Model bias
Returns-
return_dict = {'weights': weights, 'bias': bias[0], 'MSE': total_MSE_new, 'MSE_list': MSE_list}
"""MSE_list = []
for i in range(epochs):
grad_bias, grad_weights = gradient(target, features, weights, bias)
weights -= grad_weights * learning_rate
bias -= grad_bias * learning_rate
new_pred = linearRegr(features, weights, bias)
total_MSE_new = meanSqrError(target, new_pred)
MSE_list.append(total_MSE_new)
return_dict = {'weights': weights, 'bias': bias[0], 'MSE': total_MSE_new, 'MSE_list': MSE_list}
return return_dict
Finally, we have the optimizer function for our linear regression model. Let us run the function now and store the values for further use.
model_val = stochGradDesMODIFIED(0.001, 2000, target, features, weights, bias)print("Weights- {}\nBias- {}\nMSE- {}".format(model_val['weights'], model_val['bias'], model_val['MSE']))
The initial MSE was around 65,000 while the current MSE is around 680. We can see from the results that our model has significantly improved.
Finally, we will write the model function that uses the updated model weights and biases to predict the target values.
def LinearRegressionModel(model_val, feature_list):
"""Predicts the CO2 emission values of the vehicle
Args-
model_val- This is the dictionary returned by the stockGradDesMODIFIED function. Contains model weights and biases
feature_list- An array of the dependent variables
Returns-
co2_emission- Emission predictions for the given set of features
"""
co2_emission = np.sum(model_val['weights'] * feature_list) + model_val['bias']
return co2_emission
Testing and Evaluation
As a test run, we will now test our model on the following data.
feature_list = [2.0, 4, 8.5]
The actual target value for the data is 196. Let’s see how our model fares.
target_price = 196
feature_list = [2.0, 4, 8.5]
predicted_price = LinearRegressionModel(model_val, feature_list)
print(predicted_price)
The original target value was 196 for the given model. As we can see, our model did a fairly good job at making the prediction, considering this is a model implementation from scratch. You can further improve the model though, by tweaking a few things or maybe running more optimization epochs. However, too much optimization can lead to model overfitting, which is equally bad for the model as overfitting makes the model practically unusable for real world data.
Now, to check the accuracy of our model, we will calculate its r-squared score. The following is the formula for r2 score-
def r2_score(target, prediction):
"""Calculates the r2 score of the model
Args-
target- Actual values of the target variable
prediction- Predicted values, calculated using the model
Returns-
r2- r-squared score of the model
"""
r2 = 1- np.sum((target-prediction)**2)/np.sum((target-target.mean())**2)
return r2
As we can see, our model explains around 83% of the variability of the response data around its mean, which is fairly good. However, there is always room for improvement in a Machine Learning model!
With this, we come to an end for our project.
I am going to make a series of blogs, where we will be working on similar projects, coding new ML models from scratch, working hands-on with real world datasets and problem statements.
I am just a novice in the field of Machine Learning and Data Science so any suggestions and criticism will really help me improve.
Hit that follow and stay tuned for more ML stuff!
Link to GitHub repo for dataset and Jupyter notebook-