Simple Linear Regression
Simple Linear Regression

Linear Regression Implementation in Python

Harshita Yadav
Machine Learning with Python

--

Linear Regression comes under the supervised learning technique. It is used to solve regression problems. Regression is the process of finding a model that predicts continuous value based on its input variables. For example, it predicts continuous values such as temperature, price, sales, salary, age, etc.

Linear regression is mainly used for finding a linear relationship between the target and one or more predictors. In other words, it predicts the target variable by fitting the best linear relationship between the dependent (target variable) and independent variables (predictors). In addition, it is used for forecasting and finding out cause-and-effect relationships between variables.

Assumptions of Linear Regression

  • The Independent variables (predictor) should be linearly related to the dependent variables (target): There should be a linear relationship between the predictor variables and the target variable. A linear relationship is one where increasing or decreasing one variable will cause a corresponding increase or decrease in the other variable too. The independence of the variables can be checked with the help of several visualization techniques such as scatter plot, pair plot, heatmap, etc.
  • The data should be Normally Distributed: Normal distribution is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. In graph form, normal distribution will appear as a bell curve. It can be checked with the help of different visualization techniques, such as Q-Q plot, histogram, etc.
  • There should be little or no Multicollinearity present in the data: Multicollinearity occurs when your model includes multiple features, i.e., independent variables correlated not just to the response variable but also each other. Multicollinearity can be detected through correlation matrix, VIF, etc.
  • The mean of the residual should be zero: Residual is the difference between the actual value and the predicted value. Zero residual means there is no error, and it's a perfect model.
  • The residuals obtained should be Normally Distributed: This can be checked using the Q-Q Plot on the residuals.
  • The variance of the residual throughout the data should be the same: This can be checked with the help of residual vs. fitted plot.
  • There should be little or no auto-correlation present in the data: Auto-Correlation occurs when the residuals are not independent of each other. Auto-correlation can be detected using Durbin-Watson test, ACF plot, etc.

Types of Linear Regression

Simple Linear Regression

Simple Linear Regression helps to find the linear relationship between two continuous variables. It uses one independent variable to predict a dependent variable by fitting a best linear relationship.

It has only one independent variable (X) and one dependent variable (Y), where Y is the value to be predicted. Thus, it is an approach for predicting a quantitative response using a single feature.

Equation: Y = β0 + β1X + e

Where,

Y = Dependent variable / Target variable

β0 = Intercept of the regression line

β1 = Slope of the regression line which tells whether the line is increasing or decreasing

X = Independent variable / Predictor variable

e = Error

Example: Predicting sales based on the money spent on TV for marketing. In this case, there is only one independent variable, i.e., money spent on TV for marketing, and one dependent variable, i.e., sales, that is the value to be predicted.

Simple Linear Regression Implementation using Python

Problem statement: Build a Simple Linear Regression Model to predict sales based on the money spent on TV for advertising.

Importing the Libraries

#Importing the librariesimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

numpy: NumPy stands for numeric Python, a python package for the computation and processing of the multi-dimensional and single-dimensional array elements.

pandas: Pandas provide high-performance data manipulation in Python.

matplotlib: Matplotlib is a library used for data visualization. It is mainly used for basic plotting. Visualization using Matplotlib generally consists of bars, pies, lines, scatter plots, and so on.

seaborn: Seaborn is a library used for making statistical graphics of the dataset. It provides a variety of visualization patterns. It uses fewer syntax and has easily interesting default themes. It is used to summarize data in visualizations and show the data’s distribution.

Reading the Dataset

#Reading the datasetdataset = pd.read_csv("advertising.csv")

The dataset is in the CSV (Comma-Separated Values) format. Hence, we use pd.read_csv()to read the dataset.

dataset.head()
Dataset
Sales Dataset

Since our problem involves only Sales and TV columns, we do not need radio and newspaper columns. Therefore, we can drop those columns.

#Dropping the unnecessary columnsdataset.drop(columns=['Radio', 'Newspaper'], inplace = True)
dataset.head()
Required Columns
Required Columns

Equation: Sales = β0 + β1*TV + e

Setting the values for independent (X) variable and dependent (Y) variable

#Setting the value for X and Yx = dataset[['TV']]
y = dataset['Sales']

Splitting the dataset into train and test set

#Splitting the datasetfrom sklearn.model_selection import train_test_splitx_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 100)

from sklearn.model_selection import train_test_split: It is used for splitting data arrays into two subsets: for training data and testing data. With this function, you don’t need to divide the dataset manually.

We need to split our dataset into training and testing sets. We’ll perform this by importing train_test_split from the sklearn.model_selection library. It is usually good to keep 70% of the data in your train dataset and the rest 30% in your test dataset.

test_size: This parameter specifies the size of the testing dataset. The default state suits the training size. It will be set to 0.25 if the training size is set to default.

randon_state: This parameter controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.

Implementing the linear model

#Fitting the Linear Regression modelfrom sklearn.linear_model import LinearRegressionslr = LinearRegression()  
slr.fit(x_train, y_train)

from sklearn.linear_model import LinearRegression: It is used to perform Linear Regression in Python.

To build a linear regression model, we need to create an instance of LinearRegression() class and use x_train, y_train to train the model using the fit() method of that class. Now, the variable slr is an instance of the LinearRegression() class.

Model Equation

#Intercept and Coefficientprint("Intercept: ", slr.intercept_)
print("Coefficient: ", slr.coef_)
Intercept & Coefficient
Intercept & Coefficient

Regression Equation: Sales = 6.948 + 0.054 * TV

From the above-obtained equation for the Simple Linear Regression Model, we can see that the value of intercept is 6.948, which shows that if we keep the money spent on TV for advertisement as 0, the estimated average sales will be 6.948 and a single rupee increase in the money spent on TV for advertisement increases sales by 0.054.

Prediction on the test set

#Prediction of test sety_pred_slr= slr.predict(x_test)#Predicted valuesprint("Prediction for test set: {}".format(y_pred_slr))
Predicted values
Predicted values

Once we have fitted (trained) the model, we can make predictions using the predict() function. We pass the values of x_test to this method and compare the predicted values called y_pred_slr with y_test values to check how accurate our predicted values are.

Actual values and the predicted values

#Actual value and the predicted valueslr_diff = pd.DataFrame({'Actual value': y_test, 'Predicted value': y_pred_slr})
slr_diff.head()
Actual and Predicted value
Actual and the Predicted values

Line of Best Fit

#Line of best fitplt.scatter(x_test,y_test)
plt.plot(x_test, y_pred_slr, 'Red')
plt.show()
Line of best fit
Line of Best Fit

The above straight line is the best approximation of the given dataset.

Evaluating the Model

#Model Evaluationfrom sklearn import metricsmeanAbErr = metrics.mean_absolute_error(y_test, y_pred_slr)
meanSqErr = metrics.mean_squared_error(y_test, y_pred_slr)
rootMeanSqErr = np.sqrt(metrics.mean_squared_error(y_test, y_pred_slr))
print('R squared: {:.2f}'.format(slr.score(x,y)*100))
print('Mean Absolute Error:', meanAbErr)
print('Mean Square Error:', meanSqErr)
print('Root Mean Square Error:', rootMeanSqErr)
Evaluation Metrics
Evaluation Metrics

from sklearn import metrics: It provides metrics for evaluating the model.

R Squared: R Square is the coefficient of determination. It tells us how many points fall on the regression line. The value of R Square is 81.10, which indicates that 81.10% of the data fit the regression model.

Mean Absolute Error: Mean Absolute Error is the absolute difference between the actual or true values and the predicted values. The lower the value, the better is the model’s performance. A mean absolute error of 0 means that your model is a perfect predictor of the outputs. The mean absolute error obtained for this particular model is 1.648, which is pretty good as it is close to 0.

Mean Square Error: Mean Square Error is calculated by taking the average of the square of the difference between the original and predicted values of the data. The lower the value, the better is the model’s performance. The mean square error obtained for this particular model is 4.077, which is pretty good.

Root Mean Square Error: Root Mean Square Error is the standard deviation of the errors which occur when a prediction is made on a dataset. This is the same as Mean Squared Error, but the root of the value is considered while determining the accuracy of the model. The lower the value, the better is the model’s performance. The root mean square error obtained for this particular model is 2.019, which is pretty good.

Conclusion

The Simple Linear Regression model performs well as 81.10% of the data fit the regression model. Also, the mean absolute error, mean square error, and the root mean square error are less.

In the next blog, we will learn about the Multiple Linear Regression Model. Till then, stay tuned!

Hey guys! I’m Harshita. I’m a Data Science student and trying to contribute a bit to the community by sharing my knowledge. Please share this with someone you know who is trying to learn Machine Learning. I would appreciate your comments, suggestions, or feedback. Thank you.

Email Id: harshita.1128@gmail.com

LinkedIn: www.linkedin.com/in/harshita-11

Github: www.github.com/Harshita0109

--

--

Harshita Yadav
Machine Learning with Python

MSc Data Science student at Christ (Deemed to be University)