Linear Regression explained with an Example
In this article, I am going to explain the process that I followed to build a linear regression model for a particular dataset.
Firstly, I imported the commonly used libraries — NumPy, Pandas,Matplotlib, and Seaborn. Then I imported LinearRegression to build a linear regression model and train_test_split to divide the dataset into training and testing data respectively.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsfrom sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
Before I go ahead with the process, let’s first take a look at the dataset. The dataset has the following columns and we are required to express a relationship between the car’s mpg and the rest of the attributes that are given in the dataset.
The variables along with their short forms are as follows
mpg -> MPG ( Miles per gallon)
cyl -> Cylinders
displ -> engine displacement
hp -> Horsepower
weight ->vehicle weight
accel -> time to accelerate from 0 to 60mph
yr ->model year
origin -> origin of car ( 1. American , 2.European , 3.Japanese)
Exploratory Data Analysis
In order to understand how the various attributes are related to the target variable, I checked the relationship between the target variable and other variables. I used seaborn’s line plot to plot the line on a two-dimensional plane.
- Relationship between mpg & acceleration
In this graph, we can see that most of the points are not near the regression line thus implying that there is a weak positive correlation between the two variables. This can further be confirmed by checking the value of correlation using the corr() function.
2. Relationship between mpg and hp shows that there is a strong positive correlation between the two variables.
In a similar way, I plotted graphs to check the relationship between mpg and the other variables, and here is the summary of the same in a tabular format.
Removing Unnecessary Columns
The next thing that is to be checked is whether all the columns are useful in building the model or not. The column that contains the names of cars is not going to be used in the model so I removed that column. The dataset now contains 8 columns.
Dealing with Categorical Values
The column ‘Origin’ contains the origin of the car in the form of numbers where 1 stands for America, 2 for Europe, and 3 for Japan. If the model is built without looking into these categorical values, the model would take the numbers and rank them however it wants. For example, it might rank them this way - 1<2< 3, which means that the cars of Japan are greater than those of Europe and the cars of Europe are greater than those of America. This doesn’t make sense at all! Therefore I replaced 1 with America, 2 with Europe, and 3 with Japan.
Well, there is another problem! Can you see what it is?
We only need columns that contain numerical data in it. The column origin violates this. Therefore, something needs to be done to solve this. This is where the ‘get_dummies’ function comes into the picture. This function will pick the different categories within the column of interest and create a column name in the format- ‘original column name_category’. Then, as soon as it encounters a particular column, it will mark that as 1 while the other two columns will be marked as 0. The functions’ role will be much clear upon looking at the below code’s output.
Tadaaa, the problem is solved!
Now let’s see if there are any missing values in this dataset
Dealing with missing values
There might be a couple of columns that have missing values in them. It is really important to handle these as missing values present various problems. Firstly, the absence of certain values can lead to biases in our results. It can also result in completely wrong or unacceptable conclusions. Some ways to handle missing values include -filling missing values with mean median or mode, dropping the rows that contain missing values. Dropping any data points should be our last option, as we could lose a lot of important information by doing this.
I first checked if there are any missing values using the isna() function. Along with this, I am using the sum function to count the number of NaNs if there are any.
From the output, it is evident that there are no missing values. I then checked the datatypes to confirm if what I saw is right or not. The dtypes function can be used to determine the data type of each of the columns.
We can see the kind of values that each column has and confirm it with the above output. All of the datatypes seem to be fine.
I used the pairplot function to interpret the relationship between the variables in our dataset. Upon using only the first 7 columns (as the last 3 columns are not required for plotting graphs), I get the following pairplots.
Note:The above screenshot does not show the complete result.
We can see that with just one line of code, seaborn is able to plot multiple graphs to show the relationship between all the variables. The ‘diag_kind’ is set to get the type of diagonals that we want. If it is set to kde , we get the smooth curves like how it is shown above. If we do not specify the diag_kind, we get histograms in place of the smooth graphs.
Another important point to note is that in any pairplot , whatever inference we get from the upper half of the plot is the same as the one that we get from the lower half of the curve. So, you can consider either side of the diagonal for interpretation purposes.
To understand clearly, I took up some graphs and tried to interpret
If we take a look at the graph between weight and mpg, we can see that there is a negative correlation. There is a positive correlation between weight and horsepower. In a similar way, the relationship between the other variables can be interpreted.
We are done with getting the data ready for building our model.
Building the Model
Dividing the data into dependent and independent variables
The first thing that is to be done is to divide the data into dependent and independent variables. In our dataset, we would like to predict the values of “mpg” and this mpg depends upon all the other attributes such as cylinders, horsepower etc. Therefore mpg is our dependent variable while the other variables are independent.
“mpg” is stored in y , while all the other attributes are stored in independent variable X.
Splitting the dataset into training and testing data
Then I split the data into training and testing data with 30% of the data in the testing dataset and the rest in the testing dataset. The data points in X that are going to be used in training are stored in X_train and those that are going to be used in testing are stored in X_test. Similarly, I stored the data points of Y in Y_train and Y_test. I also set a random seed to ensure that the random data that is stored in training and testing variable remains the same every time I run the code.
X_train , X_test , y_train , y_test = train_test_split(X, y , test_size=0.30, random_state=1)
Making the Model Learn
The final thing that is left to do is to make my model learn from the training dataset. Upon initializing the model, I made the model learn from X_train and y_train using the ‘fit’ function.
The equation of the dataset is in the form of - Y=m1x1+m2x2+m3x3+.….C ( till the last column of the dataset). I used the ‘coef_’ function to get the values of coefficients of various columns.
I used the intercept_ function to find the value of y-intercept
The equation is mpg = -0.23250(cyl)+0.0245(dis)-0.0014(hp)…-19.75 Once we input the values of the variables, the target variable-mpg will be predicted. Therefore, a mathematical model has been built to establish a relationship between the target variable-‘mpg’ and other variables.
Score of the model
It's also important to check how well the model is performing, so I checked the score of the training and testing data separately.
print(“Score of training data is :” + str(Training_data_score))
print(“Score of testing data is :” + str(Testing_data_score))
The training data has a score of 82% , which is pretty decent. On the other hand, the testing data’s score tells how well the data has been able to learn from the training dataset. The testing data has a score of 82.4% , which is good enough but it could have been slightly better.
Predicting the target value
The final thing that is left to do is to predict the value of ‘mpg’ for data points present in the X_test. The model has learnt from several cars present in the training dataset and now it's time to use whatever it has learnt to give the required predicted results. I used the predict function to do the same.
In the below snapshot, let’s take the values present in the 1st row.The car has 4 cylinders, engine displacement of 8 cubic inches, hp of 97, vehicle weight of 2506 lbs, 14.5 secs to accelerate from 0 to 60 mph, model year -72 and origin as Asia. Taking all of these parameters into consideration, the model has predicted its mpg to be 23.83. In a similar way, the model has predicted the mpg for all the rows present in X_test.