All about Linear Regression Assumptions Part 2
Linear regression is the basic concept to start with machine learning. In the blog let's start looking into the assumptions we need to make while using a linear regression model. Without much ado let's begin with the multilinear regression dataset.
To get a basic understanding of Linear regression assumption please refer part1.
Data set: https://www.kaggle.com/datasets/awaiskaggler/insurance-csv/data
Import all the libraries and read the data set.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
data = pd.read_csv('insurance.csv')
data.head()
Here ‘charges’ is our target variable and the rest are the dependent variable. We need to predict the charges of insurance based on the person's age, sex, BMI, Number of children the customer has, whether the customer is a smoker or non-smoker, and the region in which he/she resides.
data.info()
Assumptions
1. Linearity
There should be a linear relationship between independent variables and dependent variables. Pairplot can be used for checking the linear relation visually.
p = sns.pairplot(data, x_vars=['age', 'sex', 'bmi', 'children', 'smoker', 'region'], y_vars='charges', size=3, aspect=0.7)
From the above graph Age and BMI are somewhat linearly related to charges. The rest of the features are categorical variables hence we need not consider it to check the linearity.
2. No multicollinearity
We need to ensure there is no collinearity among the independent variables. If in case there exists any collinearity among the pair of independent variables then consider one among them to run the model. Either use heatmap or pearson correlation to find the multicollinearity.
data1 = data.drop(['charges'], axis = 1)
plt.figure(figsize=(10,10))
p=sns.heatmap(data1.corr(), annot=True,cmap='RdYlGn',square=True)
data.corr(method ='pearson')
The correlation of two variables tells how both the direction and magnitude vary with each other. It ranges from -1 to 1. If the value is close to 1 then it's strongly correlated in a positive direction. If it's close to -1 it's strongly correlated in a negative direction.
From the above matrix, it's evident that there is no collinearity since the magnitude is low. Hence need not exclude any variable. If the value is greater than 0.8 then it implies there is a strong correlation.
Run Linear regression model to check other assumptions. Only a linear relationship and multicollinearity between the dependent and independent variables can be checked without running the model.
Running Linear Regression model
cat_col = ['sex','children','smoker','region']
data_en = pd.get_dummies(data = data, prefix= 'ohe', prefix_sep = '-', columns = cat_col, drop_first = True, dtype = 'int8')
x = data_en.drop(["charges"],axis=1)
y = data_en.charges
sc = StandardScaler()
X = sc.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0,test_size=0.25)
regr = LinearRegression()
regr.fit(X_train,y_train)
y_pred = regr.predict(X_train)
print("R squared: {}".format(r2_score(y_true=y_train,y_pred=y_pred)))
Adj_r2 = 1 - (1-r2_score(y_train, y_pred)) * (len(y_train)-1)/(len(y_train)-X.shape[1]-1)
Adj_r2
3. Mean of Residuals is approximately zero
Residuals are the differences between the true and the predicted values.
residuals = y_train.values-y_pred
mean_residuals = np.mean(residuals)
print("Mean of Residuals {}".format(round( mean_residuals,2)))
Here the mean of residuals is zero hence this assumption holds good.
4. Check for Homoscedasticity of residual v/s predicted target variable
Homoscedasticity means that the residuals should have constant variance. Draw the graph of residual v/s predicted variable to check there are no patterns observed.
p = sns.scatterplot(x = y_pred, y = residuals)
plt.xlabel('y_pred/predicted values')
plt.ylabel('Residuals')
p = plt.title('Residuals vs fitted values plot for homoscedasticity check')
The above graph doesn't provide enough insight into homoscedasticity hence let us use Goldfeld-Quandt Test to check homoscedasticity.
Assumptions of Goldfeld-Quandt Test
- data is normally distributed.
Null and Alternate Hypothesis of Goldfeld-Quandt Test
- Null Hypothesis: Heteroscedasticity is not present.
- Alternate Hypothesis: Heteroscedasticity is present.
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ['F statistic', 'p-value']
test = sms.het_goldfeldquandt(residuals, X_train)
lzip(name, test)
Here the p-value is less than 0.05, hence rejecting the null hypothesis that error terms are not homoscedastic. This is not an ideal case for linear regression but let us still continue checking for other assumptions.
5. Check for Normality of residuals
Ensure that the residuals are normally distributed.
p = sns.distplot(residuals,kde=True)
p = plt.title('Normality of error terms/residuals')
Residuals are almost normally distributed. In reality its not possible to get perfect normal distribution.
6. No autocorrelation of residuals
Residuals should be independent of each other there shouldn't be any autocorrelation (repeated pattern).
Autocorrelations can be checked using the time-series concept ACF and PACF.
If you are new to the Time series concept kindly refer to any ACF and PACF blogs on Medium.
sm.graphics.tsa.plot_acf(residuals, lags=40)
plt.show()
# partial autocorrelation
sm.graphics.tsa.plot_pacf(residuals, lags=40)
plt.show()
The results show no signs of autocorrelation since there are no spikes outside the confidence interval region of the ACF and PACF graphs.
Conclusion
Most of the assumption of linear regression holds good in this case hence we can conclude there is linear regression can be used to predict the target variable.
EndNote:
I hope you enjoyed the article and got a clear picture of the assumptions of linear regression. Please drop your suggestions or queries in the comment section.
Would love to catch you on Linkedin. Mail me here for any queries.
Happy reading!!!!
I believe in the power of continuous learning and sharing knowledge with the community. Your contributions are invaluable in helping me create meaningful content and resources that benefit everyone. Join me on this journey of exploration and innovation in the fascinating world of data science by donating to Buy Me a Coffee.