Step by step procedure to perform Simple Linear Regression with Statistical Analysis in R

Pranjal Pandey
Analytics Vidhya
Published in
7 min readJul 11, 2020
Image by educba
  1. Introduction

Hello, This is my first article on Medium. In this article, I am going to show you how Simple Linear Regression, which is a very basic of Linear Regression algorithm frequently used in Machine Learning, is applied in solving real life problems. For this purpose, I’ll use a data set in which — “I have to predict Salary of an individual(say,a person) on the basis of one explanatory variable Years of Experience.” I am going to use Kaggle R-Notebook to do the above mentioned task. You may use R-studio in your PC to work offline.

2. Loading Data set

Link to download the data set for offline use — https://www.kaggle.com/rohankayan/years-of-experience-and-salary-dataset/download

Run the following command to create a R-object that will contain the whole data set -

# Loading Data
data = read.csv("../input/years-of-experience-and-salary-dataset/Salary_Data.csv" , header = T)

Now, Data will be successfully Loaded in the current Session and stored in object “data”.

3. Loading Required Libraries

library(tidyverse)
library(caret)
library(MASS)
library(lmtest)
library(olsrr)
library(broom)

4. Exploring Data set

# Random Inspection of Dataset
sample_n(data,5)# See the following output
Output-1

In the given data set, The Target “Salary” and Predictor “YearsExperience” both are Quantitative Variable. Hence, we can apply Regression on the given data set. Now,

Question is : Which type of Regression (Linear or Non-Linear) we must apply ?

My suggestion : First of all try Linear Regression since it is simpler than Non-Linear. If it does not fit properly, then go for Non-Linear Regression.

Next Question : There are many types of Linear Regression. Which will be better for this given data set ?

Answer : Since I have only one Predictor so the possibility is either fit Simple Linear Regression or Polynomial (generally, Orthogonal Polynomial to avoid multi-collinearity) Regression in one variable.

Question : Which one of the above two Linear Regression will perform better ?

Answer : Plot Scatter Diagram between Predictor (on X-axis) and Target (on Y-axis). This will give a best idea.

Let’s do it as follows -

# Scatter Plot
plot(data$YearsExperience,data$Salary)
Output-2

The above plot shows Linear relationship between Predictor and Target. So Simple Linear Regression will perform better.

Further, if you think that Polynomial model may perform better than Simple Linear Regression, Fit the Polynomial model, since if the polynomial model is not best for the given data set, the higher order terms will automatically become insignificant, which in turn shows that the best model for the given data set is Simple Linear Regression. (I will see you the same in later stage.)

SLR is based on certain assumptions which are as follows -

  1. Linearity : Must be a linear relationship between Predictor and Target.
  2. Normality : Errors must be normally distributed.
  3. Homoscadasticity : Errors must have constant variance.
  4. Auto-correlation : Errors must be uncorrelated.(Errors must be independent)
  5. Errors must have mean Zero.

The above assumptions must be satisfied, because if any of the above assumptions is not satisfied, then we can not sure about the validity and reliability of obtained results. In such situation, either try to satisfy assumptions or adopt any other method of regression to get reliable result.

Some of these assumptions are either checked before fitting the model and some are after fitting as the case will be. This will be more clear in further steps.

5. Checking Linearity assumptions by

  1. Visualization : Scatter plot between Predictor and Target helps us. (Already plotted above and found linearity.)
  2. Statistical Tests : Pearson — Correlation test helps. Use “cor.test()” function.

Let’s try it.

# Correlation Test
cor.test(data$YearsExperience , data$Salary)
Output-3

Since, p-value is less than 0.05 and sample correlation is 0.9782416. Hence, High significant correlation (Highly Linearly Related) exists in the population.

Rest assumptions will be checked after fitting the model.

6. Fitting Model

Before fitting, Split the given data set into train and test data, fit the model and obtain summary of model as follows -

# Randomly Split the data into training and test setset.seed(123)
training.samples <- data$Salary %>%
createDataPartition(p = 0.6, list = FALSE)
train.data <- data[training.samples, ]
test.data <- data[-training.samples, ]
# Fit modelmodel <- lm(Salary ~ YearsExperience , data = train.data)summary(model)
Output-4

The above output shows that -

  1. Both the coefficients are statistically significant since p-value < 0.05
  2. The model is also statistically significant since p-value: 2.098e-13 < 0.05
  3. Adjusted R-squared: 0.9504 , This shows that 95.04% of the variation available in the given data set (in Salary) is explained by this Simple Linear Regression Model. Rest 5% variation in Salary is due to some other predictors (as post, age of the individuals) or due to random cause.

7. Checking Rest Assumptions

Normality assumption by -

  1. Visualization : Q-Q Plot of residuals helps us.
  2. Statistical Test : Shapiro-wilk-test on residuals helps us. Use “shapiro.test()” function.

Let’s try it.

# Visualization
qqnorm(model$residuals)
qqline(model$residuals , col = "steelblue", lwd = 2)
# Statistical Test
shapiro.test(model$residuals)
Output-5

Since p-value (0.1123) > 0.05 , Hence, Accept the null hypothesis, i.e., Errors are normally distributed.

Homoscedasticity assumption by -

  1. Visualization : Residuals vs fitted values Plot helps us.
  2. Statistical Test : Breusch Pagan test helps us. Use bptest() function.

Let’s do it.

# Visualization
plot(model$fitted.values , model$residuals)
# Statistical Test
bptest(model)
Output-6

Since p-value (0.0633) > 0.05 , Hence, Accept null hypothesis, i.e., Errors have constant variance.

Auto-correlation assumption by -

  1. Visualization : Residual vs Order of taking observations.
  2. Statistical Test : Durbin-watson Test helps. Use durbinWatsonTest() function.

Let’s try it.

# Visualization
plot(1:20 , model$residuals)
# Statistical Test
dwtest(model, alternative = c("two.sided"))
Output-7

Since p-value (0.1349) > 0.05 ; Hence, Accept null hypothesis, i.e., Errors are uncorrelated.

Assumption of Mean of errors -

# Calculating mean of errors
mean(model$residuals)
Output-8

The above output shows that mean of errors is approximately zero.

Now, all the assumptions of Simple Linear Regression has been satisfied.

One more task is to detect whether there is any influential point in the given data set. Because if influential point is present, then the result obtained above will become unreliable. Hence, there is a need for detection of influential point.

Influential point detection by -

  1. Visualization : Cook’s distance plot helps us.
  2. Statistical Measure : Cook’s distance helps.

Run the following command.

# Visualization (Cook's distance plot)
plot(model , 4)
# Statistical Measure
model_dm <- augment(model)
# Checking Highest Cook's distance
max(model_dm$.cooksd)
Output-9

Maximum cook’s distance is 0.186480962574431

Since, for any observation in train data set, cook’s distance is not greater than 0.5, Hence, there is no influential point in the given data set.

8. Making Predictions

Now, Simple Linear Regression model is ready. Apply it on test data set to check its performance on unseen data set, i.e., Determine how well this model will perform on unseen data.

This will again be done by two methods -

  1. Visualization : predicted salary vs actual salary plot for test data set helps.
  2. Statistical Measure : R² , RMSE , MAE helps.
# Making prediction
prediction <- model %>% predict(test.data)
# Visualization
plot(test.data$Salary , prediction)
abline(lm(prediction ~ Salary, data = test.data), col = "blue")
# Statistical Measure
data.frame( R2 = R2(prediction, test.data$Salary),
RMSE = RMSE(prediction, test.data$Salary),
MAE = MAE(prediction, test.data$Salary))
Output-10

We obtain : R² = 0.9701604 , which indicates a best fit.

Since, this result is based on only one test data set. Hence, we can not sure that the model will perform better on all unseen data. To be more confident in this respect, we will use the method of K-fold cross validation to test the performance of model on different test data set.

This will be done as follows -

# Define training control
set.seed(123)
train.control <- trainControl(method = "repeatedcv",
number = 4, repeats = 3)
# Train the model
model_cv <- train(Salary ~ YearsExperience , data = data, method="lm",
trControl = train.control)
# Summarize the results
print(model_cv)
Output-11

Great Result !

On an average, This Simple Linear Regression Model captures 97.29 % variability available in the target (Salary). That is, 97.29 % variability in Salary is due to the predictor “YearsExperience”. Rest variability is due to some other causes (like post/age of the individuals,etc.) or due to random causes.

Now, prepare a polynomial model as i have said earlier as follows-

# Fitting Orthogonal polynomial modelmodel_poly <- lm(Salary ~ poly(YearsExperience,2) , data = train.data )# Summarizing model
summary(model_poly)
Output-12

The above output shows that the second order of predictor is not statistically significant since p-value (0.531) > 0.05 ; Hence, Don’t fit polynomial model and go with Simple Linear Regression only.

Want to see my Kaggle R-Notebook : Click here

Thanks for reading my first article.

If you find any mistake or you have any suggestions please let me know, you are always welcome.

--

--