Simple Linear Regression and Multiple Linear Regression Analysis with Statsmodel Library in Python.

akinsoji hammed
devcareers
Published in
4 min readSep 12, 2019

Regression analysis is the most common machine learning model that is widely used to performs fitting and prediction based on historical or retrospective data extracted from a particular operation. ‘Regression analysis is defined as effects of independent variables on dependent variable usually on linear form’. Simple linear regression shows linearity between a dependent variable and a independent variable while multiple linear regression shows linearity between a dependent variable and two or more independents variables.

Since linear regressions are widely used there are five basic assumption to be considered before accepting to use linear regression; as following :

  1. Linearity: the simple meaning of linearity is understand from scatter plot which is the straight line that drawn in between points on scatter plot. The line used to have intercept on vertical axis which is dependent axis i.e the target variable. If the line is not straight, hence, non-linear regression is to be used.
  2. No Endogeneity: this helps to look for omitted variables from the model due to model performance at first instance.It allows the model to perform well after optimization. It usually based on P-value of some independent variables, most time when P-value is greater 50% the variables considered the variable to be bias to the model.
  3. Normality and Homoscedasticity: it refers to mean and variance distribution of the dataset i.e the mean and variance should followed normal distribution and equal.
  4. Auto Correlation: this indicates correlation between variables. correlation shows that variables follow the same magnitude. To test auto correlation is by using Durbin-Watson value which ranges from 0 -4 if the value is between 2 -3 there is no auto correlation while value between less than 1 and greater 3 there is auto correlation.
  5. No Multicollinearity: this implies when two variables among independent variables have greater P-value more than 50% . It is opposite of No Endogenity.

Using linear regression predicting price of vehicles based on mileage, model and Age.

firstly, import all necessary libraries such as numpy, pandas, seaborn , statsmodel and matplotlib

#importing libararies
import pandas as pd
import seaborn as sns
import numpy as np
import statsmodels.api as sm
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import matplotlib.pyplot as plt

loading the excel file using pandas

df = pd.read_excel("Use motor data.xlsx")
df
The data set

performing descriptive statistic on the data

#performing descriptive statistic on the data
df.describe()

To select dependent variable and independent variable from the data. sell price is dependent variable while mileage is independent variable . hence simple linear regression is consider.

# X1 is independent variable
X1 = df['Mileage']
# Y is dependent variable that will be predicted based on X1
Y = df['Sell Price($)']

To check the linearity and homoscedasticity of the variables selected, it is part of linear regression assumption. it is achieve by scatter plot. Red line passing through the scatter plot indicates it linearity if else not straight it mean linear regression is not fit the data

sns.scatterplot(X1 ,Y)
plt.plot([90000,20000], [15000,40000],'r' )

Others assumption of linear regression will be perform after initial deployment of model.

Deployment of linear regression using Ordinary Least Square from statsmodel.

#constant is adding because of independent coefficient
X = sm.add_constant(X1)
#fitting the variables to model
results = sm.OLS(Y,X).fit()

summary of the model to check how efficient the model deploy

#summary of the model
results.summary()
summary of linear regression.

From the above summary tables. P(F-statistic) with yellow color is significant because the value is less than significant values at both 0.01 and 0.05. Checking the P-values with blue color of the independent variable it shows significant which implies no endogenecity. With Durbin-Watson value less than 1 that means there is auto correlation between dependent variable and independent variable. Finally,auto correlation may not be significant here because the model used single independent variable in other way the independent variable may be transform to fit the model.

With these all assertions from the model the first trial may be accept or optimization is done to increase the efficient of the model.

Performing multiple linear regression from the same data by consider more than one variable for independent observations.

# here car model and mileage is use to predict the price of vehicle
Xk = df[['Car Model', 'Mileage']]
Ym = df['Sell Price($)']
Xm = sm.add_constant(Xk)
#fitting the the model for multiple regression
Km = sm.OLS(Ym,Xm ).fit()
Km.summary()

from the summary above the Durbin-Watson value shows there is auto correlation between the independents variables, henceforth, transformation of the independent variables are required.

Github repo https://github.com/harksodje/linear-regression-with-statsmodel/tree/master/regression%20with%20statsmodel

follow me on twitter: @aksodje4real, medium: akinsoji hammed github: @harksodje

--

--