How to perform simple linear regression using R and Python, including diagnostic checking and extracting insights
What is simple linear regression?
Simple linear regression or SLR is a statistics method used to describe the relationship between two variables, dependent and independent variables, to measure how strong the relationship is, and to estimate the dependent variable value at a certain value of the independent variable, using a linear (straight line) fit. If the independent variable is more than one, the model will be called multiple linear regression or MLR, which is the extension version of SLR. In this post, we will only discuss SLR.
What are dependent and independent variables?
For example, we want to estimate the relationship between parents’ height and children’s height. In this case, the question would be, “How big is the effect of parents’ height on children’s height?”, because children do get the DNA of their parents, not otherwise. Due to that fact, we can say that the parents’ height is the independent variable and the children’s height is the dependent variable, which “depends” on the parents’ height.
That means the dependent variable is the one who we want to estimate the value based on the value of the independent variable.
What are the assumptions that need to be fulfilled when performing SLR?
The nature of SLR is a parametric test, which means it makes certain assumptions about the data that need to be fulfilled before and after performed.
- [Before performing a linear regression] The relationship between the independent and dependent variables is linear (straight-line).
- [After performing a linear regression] The residuals and fitted data relationship is also linear (straight-line).
- [After performing a linear regression] The residuals and fitted data follow a normal distribution.
- [After performing a linear regression] The homoscedasticity of variance of the residuals and the fitted value, which means the values are equally spread. If not fulfilled, it’s called heteroscedasticity.
- [After performing a linear regression] No leverage that affects the model significantly, which means to identify any extreme values that affect the model when excluded or included.
Let’s dive into the practice!
In this case, we will use randomly generated data by me, which consists of 299 rows of children’s height, father’s height, and mother’s height. Please download the data here if you want to use it. We want to estimate the effect of children_height
using father_height
.
Checking the linearity assumption
One of the most common methods is to plot the independent against the dependent variables to show us the pattern between the variables should be approximately linear. If it’s not linear but still has a strong correlation, maybe a linear approach is not our best choice.
Linearity check using R
# import ggplot2 for data visualization
library(ggplot2)
# read the file
df <- read.csv("children_parent_height.csv")
str(df)
# scatter plot with linear trend between father and children height
ggplot(aes(x = father_height, y = children_height), data = df) +
geom_point() +
geom_smooth(method = 'lm', formula = y ~ x, se = FALSE)
Linearity check using Python
# import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# read the file
df = pd.read_csv("children_parent_height.csv")
df.info()
# scatter plot with linear trend between father and children height
plt.scatter(df['father_height'], df['children_height'])
slope, intercept = np.polyfit(df['father_height'], df['children_height'], 1)
plt.plot(df['father_height'], slope * df['father_height'] + intercept)
plt.show()
Linearity check conclusion
As we can see, using R or Python, the data visualization is similar and the conclusion is the same, we find the relationship between the children_height
and father_height
is linear enough for us to perform linear regression.
We also find the correlation between the variables is positively strong. Please read my previous post about correlation if you do not yet understand what is that means.
What we can do if the relationship is not linear?
We need to change our method from linear regression to non-linear regression, we will not discuss that topic here. But in case we still want to go with linear regression, we have to transform our data to produce a linear (straight-line) relationship before performing the linear regression model. One of the frameworks to know which constant can be used for the transformation is Tukey’s transformation ladder.
Creating an SLR model
SLR using R
# check correlation between children_height and father_height
cor(df$father_height, df$children_height)
# create an SLR model between children_height and father_height
slr <- lm(children_height ~ father_height, data = df)
summary(slr)
SLR using Python
# calculate correlation between children_height and father_height
print('Correlation: {0}.'.format(np.corrcoef(df['children_height'], df['father_height'])[0,1]))
# import sm from statsmodels.api
import statsmodels.api as sm
# extract the column x (independent) and y (dependent)
x = df['father_height'].values.reshape(-1, 1)
y = df['children_height'].values.reshape(-1, 1)
x2 = sm.add_constant(x)
slr = sm.OLS(y, x2)
slr2 = slr.fit()
print(slr2.summary())
SLR model conclusion
As we can see, the output of creating an SLR model in R and Python is the same, just a different type of display format. Let’s try to get any information we can extract from the summary of the linear model above.
Notes: [r] provided in R output only, [p] provided in Python output only, [a] provided in both R and Python.
- The correlation between
children_height
andfather_height
is correlated strongly and positively equal to 0.8232801. [a] - The residuals interval is between -11.5655 and 7.5174 with a median -0.0296. This means the prediction error of the model is moving around 0. [r]
- The intercept is equal to 26.7819 with a p-value equal to 0.000 which means the intercept is significant enough to be used in the model (p-value < 0.05). [a]
- The father_height Coefficient is equal to 0.8453 with a p-value equal to 0.000 which means the coefficient is significant enough to be used in the model (p-value < 0.05). [a]
- The R-squared and adj-R-squared are equal to 0.678 and 0.677 respectively, which means the
father_height
variable can explain thechildren_height
around 67.8% (maximum 100%), it’s a very good number because we only use a single variable! [a] - The p-value of the model is equal to less than 2.2e-16 (we can say it’s a 0), which means the model is significant enough to be used to predict
children_height
usingfather_height
(p-value < 0.05). [r] - The condition number is displayed which tells us an indication of multicollinearity. This information will be very useful when we are working on MLR. [p]
SLR model prediction
Based on what we did previously, we have got an SLR model like this:
children_height = 26.7819 + 0.8453 x father_height
For example, we are getting new samples of father_height, let’s say the father’s height is 170 cm. Using the model above, we can predict his children’s height just like this:
children_height = 26.7819 + 0.8453 x 170 cm = 170.4829 cm
Nice, the children are taller than the fathers! Now, for the second example, the newest sample is the father’s height equal to 180 cm. Let’s try again!
children_height = 26.7819 + 0.8453 x 180 cm = 178.9359 cm
Hmm, that’s weird, now the father is taller than the children. Let’s try again with the third example with a father having a height equal to 175 cm.
children_height = 26.7819 + 0.8453 x 175 cm = 174.7094 cm
Okay, starts with a father having a height equal to 174 and 175, the children’s height not longer taller than the father’s. It gives us information that
The effect of the father’s height getting smaller as the father gets taller and father_height starts from 174 cm — 175 cm, the children’s height is not taller than the father’s anymore.
Also, for any father_height
it’s contributing 84.53% of their height for the children_height
then added by 26.7819 cm to get the prediction of the children_height
. That’s how you extract insights from a model, not only generate a model that can predict. Cool, right? 😎
Visualize the prediction using R
# load jtools package
library(jtools)
# visualize the prediction with the interval
effect_plot(slr, pred = father_height, interval = TRUE, plot.points = TRUE)
Visualize the prediction using Python
# visualize the SLR prediction
ypred = slr2.predict(x2)
fig, ax = plt.subplots()
ax.plot(x, y, "o", label = "father_height x children_height")
ax.plot(x, ypred, "r", label = "SLR prediction")
ax.legend(loc = "best")
How to check the SLR assumptions?
There are some methods to check a data is normally distributed or not.
Assumptions diagnostic checking using R
# load ggfortify package
library(ggfortify)
# load the diagnostic checking graphs
autoplot(slr)
- [Residuals vs Fitted, top-left] As we can see, the blue line is not stable, the good one is when the line is stable around 0. In this case, the assumption of the linearity of the residuals is not fulfilled. It also identifies the 26, 113, and 114 data points as outliers.
- [Normal Q-Q, top-right] The standardized residuals mostly move around the 45-degree line. In this case, the assumption of the normality of the residuals is fulfilled. Similarly, it identifies 26, 113, and 114 data points as outliers.
- [Scale-Location, bottom-left] The blue horizontal line shows that the residuals are not equally spread, most of the points above the line, which means the assumption of the homoscedasticity of the residuals is not fulfilled. It also identifies 26, 113, and 114 as outliers.
- [Residuals vs Leverage, bottom-right] The goal of this diagnostic checking is to identify which data points have high leverage. Theoretically, if the standardized residuals are larger than 3 in absolute values are possible outliers (James et al. 2014). Based on our diagnostic checking, there are no data points exceeding Cook’s distance), which means the assumption of the possible data points with exceptional influence on the model is not fulfilled. It shows 24, 26, and 28 have the highest influence, but not exceeding Cook’s distance.
Assumptions diagnostic checking using Python
# residuals vs fitted values
slr_fitted_y = slr2.fittedvalues
plot = sns.residplot(x = slr_fitted_y, y = 'children_height', data = df,
lowess = True, scatter_kws = {'alpha': 0.5},
line_kws = {'color': 'red', 'lw': 1, 'alpha': 0.8})
plot.set_title('Residuals vs Fitted')
plot.set_xlabel('Fitted values')
plot.set_ylabel('Residuals')
The conclusion is the same as displayed when using R.
# residuals vs leverage plot
fig2 = sm.graphics.influence_plot(slr2, criterion="cooks")
fig2.tight_layout(pad=1.0)
The display is quite different compared to shown in R. The influence plot glossary would be like this to identify data points with high influence:
- High Cook’s distance is shown by BIG and BLUE points
- Higher leverage on the X-axis
- High or lower on the Y-axis, studentized residuals.
Same conclusion using R. Not a single value with Cook’s distance high enough to be identified as a value with exceptional influence on the model.
# partial-regression plot to check homoscedasticity and non-linearity
fig3 = sm.graphics.plot_partregress_grid(slr2)
fig3.tight_layout(pad=1.0)
- The left one is between constant and children_height. The values are not separated equally above and below the line, which means the homoscedasticity is not fulfilled. (same conclusion using R)
- The right one is between father_height and children_height. The values are not separated equally above and below the line, which means the homoscedasticity is not fulfilled. (same conclusion using R)
# normality test of the residuals
name = ['Jarque-Bera', 'ChiSquared twotail', 'Skew', 'Kurtosis']
test = sm.stats.jarque_bera(slr2.resid)
print(name, test)
# normality test of the residuals using OB
name = ['ChiSquared', 'Twotail probability']
test = sm.stats.omni_normtest(slr2.resid)
print(name, test)
- The p-value of Jarque-Bera is 0.3442 and Omnibus is 0.2623
- Both of the p-values are greater than 0.05, which means the assumption of the normality of the residuals is fulfilled because the null-hypothesis of “Residuals are normally distributed” is not rejected.
What we can do if one or more of the assumptions are not fulfilled?
- If there are data points identified as outliers or have an exceptional influence on the model, we need to take care of the data points first. The easiest solution is to remove the data points from our dataset.
- If not possible to remove the data points due to specific conditions such as the sample size already being small, we can replace the data points values with the average of the dataset without the problematic data points.
- If there are no data points identified as outliers, we have to transform the dataset using such as log transformation, square-root transformation, or please read about Tukey’s transformation ladder.
- After trying to do all methods to solve the problems, we can retry to create the SLR model from the start and hope all that effort makes all diagnostic checking fulfilled. 😅
Can we use the SLR model for the values outside the data range?
The short answer is “No”. 😶
The long answer is “Nooooooooooooooooooooooooooooooooooooo!”. 😆
The long answer is no because we can’t know for certain the linear pattern and the assumptions fulfilled for the values outside the existing data range. 👍
For your learning after reading this post
- We have some assumptions not fulfilled and I have written down the steps we need to do to solve those problems. Please try it if you have some spare time. Learning by doing almost (if not always) is the best way to understand.
- The language is up to you, choose whatever you feel more comfortable with, R or Python. But in my personal opinion, in this case, I prefer using R due to its capability to show more things that need to be checked and done, and also easier to generate it.
- Maybe some of you wondering, is it really like this how to create an SLR model. Based on my knowledge (I am coming from a Statistics background), yes, it is. But, if you want to try to implement this SLR into a machine-learning solution, I think some of the things we talked about in this post are not relevant.
- We are not doing exploratory data analysis (EDA) in this post because the post will be too long for that. If you are interested in how to do exploratory data analysis easier and faster, please read my previous post about that here.
- We also not trying to split the dataset into training and testing, which are the common steps in creating a machine-learning model. Please try it by yourself to split the dataset before creating the SLR model.
- For MLR, I will try to do it on another post, because it will be a whole different discussion compared to this one.
Conclusion
Which one do I have to learn, R or Python? In my humble opinion, why need to choose if we can learn both of them. Both languages have pros and cons, so I will just try to keep learning both of them, get all the pros, so the cons not felt like cons. 😜
Is SLR used in real-world cases? Maybe not directly, but yes, we need to learn about it, because usually, the number of independent variables is more than one (even hundreds), that’s why we also need to learn about MLR. How to understand MLR? We have to understand SLR first. 👌
Thank you for reading!
Yay, finally I did it again, make a new post about data analytics. I’m so glad I’m still trying to be consistent in doing this. I hope it’s useful for many people that want to learn about simple linear regression or SLR. If you reach this point, I’m grateful that you want to read and learn about how to perform simple linear regression using R and Python, including diagnostic checking and extracting insights through my post. I hope you enjoy this post and learn how to use it in your journey as a data analytics/science professional.
I am learning to write, mistakes are unavoidable, even when I try my best. If you find any problems/mistakes, please let me know!