Multiple Linear Regression Implementation in Python

Harshita Yadav
Machine Learning with Python
6 min readMay 7, 2021

--

In the last blog, we have learned what Linear Regression is, Assumptions of Linear Regression, and Simple Linear Regression implementation in Python.

In this blog, we will learn about the Multiple Linear Regression Model and its implementation in Python.

Multiple Linear Regression

Multiple Linear Regression is an extension of Simple Linear regression as it takes more than one predictor variable to predict the response variable. It is an important regression algorithm that models the linear relationship between a single dependent continuous variable and more than one independent variable. It uses two or more independent variables to predict a dependent variable by fitting a best linear relationship.

It has two or more independent variables (X) and one dependent variable (Y), where Y is the value to be predicted. Thus, it is an approach for predicting a quantitative response using multiple features.

Equation: Y = β0 + β1X1 + β2X2 + β3X3 + … + βnXn + e

Y = Dependent variable / Target variable

β0 = Intercept of the regression line

β1, β2, β3, …. βn = Slope of the regression line which tells whether the line is increasing or decreasing

X1, X2, X3, ….Xn = Independent variable / Predictor variable

e = Error

Example: Predicting sales based on the money spent on TV, Radio, and Newspaper for marketing. In this case, there are three independent variables, i.e., money spent on TV, Radio, and Newspaper for marketing, and one dependent variable, i.e., sales, that is the value to be predicted.

Multiple Linear Regression Implementation using Python

Problem statement: Build a Multiple Linear Regression Model to predict sales based on the money spent on TV, Radio, and Newspaper for advertising.

Importing the Libraries

#Importing the librariesimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

numpy: NumPy stands for numeric Python, a python package for the computation and processing of the multi-dimensional and single-dimensional array elements.

pandas: Pandas provide high-performance data manipulation in Python.

matplotlib: Matplotlib is a library used for data visualization. It is mainly used for basic plotting. Visualization using Matplotlib generally consists of bars, pies, lines, scatter plots, and so on.

seaborn: Seaborn is a library used for making statistical graphics of the dataset. It provides a variety of visualization patterns. It uses fewer syntax and has easily interesting default themes. It is used to summarize data in visualizations and show the data’s distribution.

Reading the Dataset

#Reading the datasetdataset = pd.read_csv("advertising.csv")

The dataset is in the CSV (Comma-Separated Values) format. Hence, we use pd.read_csv()to read the dataset.

dataset.head()
Sales Dataset
Sales Dataset

Equation: Sales = β0 + (β1 * TV) + (β2 * Radio) + (β3 * Newspaper) + e

Setting the values for independent (X) variable and dependent (Y) variable

#Setting the value for X and Yx = dataset[['TV', 'Radio', 'Newspaper']]
y = dataset['Sales']

Splitting the dataset into train and test set

#Splitting the datasetfrom sklearn.model_selection import train_test_splitx_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 100)

from sklearn.model_selection import train_test_split: It is used for splitting data arrays into two subsets: for training data and testing data. With this function, you don’t need to divide the dataset manually.

We need to split our dataset into training and testing sets. We’ll perform this by importing train_test_split from the sklearn.model_selection library. It is usually good to keep 70% of the data in your train dataset and the rest 30% in your test dataset.

test_size: This parameter specifies the size of the testing dataset. The default state suits the training size. It will be set to 0.25 if the training size is set to default.

randon_state: This parameter controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.

Implementing the linear model

#Fitting the Multiple Linear Regression modelmlr = LinearRegression()  
mlr.fit(x_train, y_train)

from sklearn.linear_model import LinearRegression: It is used to perform Linear Regression in Python.

To build a linear regression model, we need to create an instance of LinearRegression() class and use x_train, y_train to train the model using the fit() method of that class. Now, the variable mlr is an instance of the LinearRegression() class.

Model Equation

#Intercept and Coefficientprint("Intercept: ", mlr.intercept_)
print("Coefficients:")
list(zip(x, mlr.coef_))
Intercept & Coefficient
Intercept & Coefficients

Regression Equation: Sales = 4.3345+ (0.0538 * TV) + (1.1100* Radio) + (0.0062 * Newspaper) + e

From the above-obtained equation for the Multiple Linear Regression Model, we can see that the value of intercept is 4.3345, which shows that if we keep the money spent on TV, Radio, and Newspaper for advertisement as 0, the estimated average sales will be 4.3345 and a single rupee increase in the money spent on TV for advertisement increases sales by 0.0538, the money spent on Radio for advertisement increases sales by 1.1100, and the money spent on Newspaper for advertisement increases sales by 0.0062.

Prediction on the test set

#Prediction of test sety_pred_mlr= mlr.predict(x_test)#Predicted valuesprint("Prediction for test set: {}".format(y_pred_mlr))
Predicted values
Predicted values

Once we have fitted (trained) the model, we can make predictions using the predict() function. We pass the values of x_test to this method and compare the predicted values called y_pred_mlr with y_test values to check how accurate our predicted values are.

Actual values and the predicted values

#Actual value and the predicted valuemlr_diff = pd.DataFrame({'Actual value': y_test, 'Predicted value': y_pred_mlr})
slr_diff.head()
Actual and the Predicted values
Actual and the Predicted values

Evaluating the Model

#Model Evaluationfrom sklearn import metricsmeanAbErr = metrics.mean_absolute_error(y_test, y_pred_mlr)
meanSqErr = metrics.mean_squared_error(y_test, y_pred_mlr)
rootMeanSqErr = np.sqrt(metrics.mean_squared_error(y_test, y_pred_mlr))
print('R squared: {:.2f}'.format(mlr.score(x,y)*100))
print('Mean Absolute Error:', meanAbErr)
print('Mean Square Error:', meanSqErr)
print('Root Mean Square Error:', rootMeanSqErr)
Evaluation Metrics
Evaluation Metrics

from sklearn import metrics: It provides metrics for evaluating the model.

R Squared: R Square is the coefficient of determination. It tells us how many points fall on the regression line. The value of R Square is 90.11, which indicates that 90.11% of the data fit the regression model.

Mean Absolute Error: Mean Absolute Error is the absolute difference between the actual or true values and the predicted values. The lower the value, the better is the model’s performance. A mean absolute error of 0 means that your model is a perfect predictor of the outputs. The mean absolute error obtained for this particular model is 1.227, which is pretty good as it is close to 0.

Mean Square Error: Mean Square Error is calculated by taking the average of the square of the difference between the original and predicted values of the data. The lower the value, the better is the model’s performance. The mean square error obtained for this particular model is 2.636, which is pretty good.

Root Mean Square Error: Root Mean Square Error is the standard deviation of the errors which occur when a prediction is made on a dataset. This is the same as Mean Squared Error, but the root of the value is considered while determining the accuracy of the model. The lower the value, the better is the model’s performance. The root mean square error obtained for this particular model is 1.623, which is pretty good.

Conclusion

The Multiple Linear Regression model performs well as 90.11% of the data fit the regression model. Also, the mean absolute error, mean square error, and the root mean square error are less.

Hey guys! I’m Harshita. I’m a Data Science student and trying to contribute a bit to the community by sharing my knowledge. Please share this with someone you know who is trying to learn Machine Learning. I would appreciate your comments, suggestions, or feedback. Thank you.

Email Id: harshita.1128@gmail.com

LinkedIn: www.linkedin.com/in/harshita-11

Github: www.github.com/Harshita0109

--

--

Harshita Yadav
Machine Learning with Python

MSc Data Science student at Christ (Deemed to be University)