A step-by-step guide to Simple and Multiple Linear Regression in Python

Build and evaluate SLR and MLR machine learning models in Python

Nikhil Adithyan
CodeX
7 min readOct 10, 2020

--

Image by Pixabay on Pexels

Linear Regression

‘Linear regression is a statistical model that examines the linear relationship between two (Simple Linear Regression ) or more (Multiple Linear Regression) variables — a dependent variable and independent variable(s). Linear relationship means that when one (or more) independent variables increases (or decreases), the dependent variable increases (or decreases) too.’

Python Implementation

There are two main ways to build a linear regression model in python which is by using “Statsmodel ”or “Scikit-learn”. In this article, we’ll be building SLR and MLR models in both Statsmodel and Scikit-learn to predict CO2 emissions of cars. Before building our model, it is necessary to import and process the data and identify variables for our regression model.

Step — 1: Importing and Processing the Data

Output:

Image by Author

Now, we have an idea of what our dataset is about. Next, it is necessary to have a look at a statistical summary of our dataset.

Output:

Image by Author

Now, we have a clear idea of both structure and statistical summary of our dataset. Next, we have to remove some character columns which may disrupt our regression model.

So, we have cleaned and processed our data and we are now ready for some visualizations in order to find some linear relationships between variables.

Step — 2: Finding Linear Relationships

With CO2 emissions as the dependent variable, we have to find some positive or negative linear relationships by implementing scatter plots. These variables are further used for building our SLR and MLR models. For statistical visualizations, it is best to make use of the seaborn library and so, let’s import it.

Similar to having a statistical summary of our data let’s have a statistical visualization between variables.

Output:

Image by Author

Now, we are going to plot any single independent variable against our dependent variable which is C02 emissions to find linear relationships between them. Let’s do it in python!

(i) Engine size / CO2 emissions:

Output:

Image by Author

By plotting the Engine size variable against our dependent variable, we can observe a positive linear relationship. Hence, we can take engine size as an independent variable for our model.

(ii) Fuel Consumption Comb (L/100 km) / C02 emissions:

Output:

Image by Author

Similar to Engine size, Fuel Consumption Comb (L/100 km) also represents a positive linear relationship. Hence, it can be taken as an independent variable for our model

(iii) Fuel Consumption Hwy (L/100 km) / CO2 emissions:

Output:

Image by Author

As Fuel Consumption Hwy (L/100 km) against CO2 emissions reveals a positive relationship, it can be granted as an independent variable for building our model.

(iv) Fuel Consumption City (L/100 km) / CO2 Emissions:

Output:

Image by Author

Like all the above variables, Fuel Consumption City (L/100 km) when plotted against CO2 emissions, shows a positive linear relationship. So, this can also be considered as an independent variable for our model.

Now, we have four independent variables that can be used to train and build our regression model. Without wasting a moment, let’s build our machine learning model in Python!

SLR Model

To build a Simple Linear Regression (SLR) model, we must have an independent variable and a dependent variable. For our SLR model, we are going to take Engine size as the independent variable and undoubtedly CO2 emissions as the dependent variable. Let’s define our variables in Python:

As I said before, we will be building a model using statsmodels at first and followed by scikit-learn.

(i) Statsmodels:

Code Explanation: At first, we have imported our primary package which is “statsmodels.api”. Next, we have defined a variable “slr_model” to store our Ordinary Least Squares (OLS) model, and finally, we stored our fitted model to a variable “slr_reg”.

Now let’s see the results of our model’s performance.

Output:

Image by Author

When analyzing our results summary, we can notice that the R-squared of the model is 0.943 (94.3%) which clearly reveals that our model is doing well and can be used for real-world cases for solving problems.

(ii) Scikit-learn:

Like how we used the OLS model in statsmodels, using scikit-learn, we are going to use the ‘train_test_split’ algorithm to process our model. Let’s do it in Python!

Code Explanation: Firstly, we are importing our primary packages which are “LinearRegression” and “train_test_split”. Using the train_test_split algorithm, we are classifying the training dataset and the testing dataset whose size is 30% of the original dataset. Inside the train_test_split algorithm, I’ve passed a command “random_state = 0” which means, there should be no automatic random shuffling of data when classifying train and test data. Next, we are storing our linear model to the variable “lr” and fitting the model to the variables. Finally, we are storing the predicted values to the variable “yhat”.

Now to check the accuracy of our scikit-learn model, we are going to calculate the slope and intercept and fit that values into our model also, we are going to calculate the R-Squared value of the model. Let’s do it in Python!

Output:

Image by Author

Now let’s calculate the R-squared value of our model by scikit-learn. Follow the code to calculate the R-squared value:

Output:

R-squared : 0.7162770226132333

We can notice that the value of R-squared in the scikit-learn model is different from the statsmodels model. This is because we didn't add a constant value to the independent variable in the statsmodels model. In the upcoming MLR model, we will be adding a constant value to the independent variable in the statsmodels.

We have successfully created our SLR model using both statsmodel package and the scikit-learn package. Now let’s dive into building the Multiple Linear Regression (MLR) model.

MLR Model

To build a Multiple Linear Regression (MLR) model, we must have more than one independent variable and a dependent variable. For our MLR model, we are going to take four independent variables and undoubtedly CO2 emissions as the dependent variable. Let’s define our variables in Python:

Remember that, adding more and more independent variables to the model might result in “Overfitting”. In our CO2 data, we have only a small number of attributes but in case of huge data, we must be more cautious about picking independent variables. So, it is highly recommended to choose only relevant independent variables to the dependent variable.

(i) Statsmodels:

You can notice that we have added a constant value to our independent variable. Now that we have fitted our model and let’s view the results summary.

Output:

Image by Author

When analyzing our results summary, we can notice that the R-squared of the model is 0.874 (87.4%) and this value is derived by including the constant value of the independent variable. So we can say that this model can be used to solve real-world cases.

(ii) Scikit-learn

The code implementation and the algorithm used are the same as the SLR model but adding extra attributes to the independent variable.

To check the accuracy of the scikit-learn model, we can calculate the R-squared score and we can introduce a new way which is by distribution plot. Firstly, let’s calculate the R-squared value in Python:

Output:

R-Squared : 0.8655946234480003

We can observe that the R-squared value of the scikit-learn model is almost similar to the statsmodels model whose value is 0.87. This is because we have added a constant value to the independent variable while building the statsmodels MLR model.

The second method to check the accuracy of the MLR scikit-learn model is by constructing a distribution plot by combining the predicted values and the actual values. Follow the code to produce a distribution plot:

Output:

Image by Author

This distribution plot reveals that our prediction values have performed almost precisely to our actual values but there are some outliers that can be noticed. This is because we have built a very basic model on Linear Regression to precisely predict the outcomes.

Final Thoughts!

We have successfully run through a whole bunch of processes aiming to build and evaluate SLR and MLR models in python and of course, we have achieved our goal. Apart from SLR and MLR, there is much more to discover on Linear Regression like Polynomial and Non-polynomial regression, Ridge, and so on. In this article, we have evaluated our model using just a few methods but, there are more to dive into. Also, the math behind Linear Regression is an ocean of formulas. Even though there are powerful packages in python to deal with formulas, you can’t always depend on them. Learning and gaining a good insight into the math portion will be worthwhile. I hope, this article would help you and never ever stop learning. If you forgot to follow any of the code sections, don’t worry I’ve provided the full code below.

Happy Machine Learning!

Full code:

--

--

Nikhil Adithyan
CodeX

Founder @BacktestZone (https://www.backtestzone.com/), a no-code backtesting platform | Top Writer | Connect with me on LinkedIn: https://bit.ly/3yNuwCJ