Simple Linear Regression from Scratch using R Software

Published in

Budding Data Scientist

10 min readFeb 4, 2020

Here, in this blog, we will discuss what linear regression is from the basic, and also will work on a data set in order to understand how statistical analysis can be done using R software on a data that follows linear regression model. So, let us begin.

Index

What is Linear Regression?
Understanding Simple Linear Regression using R Software

Basic Descriptive Analysis
Model Confirmation
Outliers Checking
Checking if the response variable follows normal distribution
Correlation Analysis
Model Building and Analysis
Model Validation

What is Linear Regression?

Regression Analysis is one of the most widely used statistical technique in order to understand and model the relationship between two or more variables. There are two kind of variables in regression analysis — predictor variable or the independent variable and the response variable or the dependent variable.

Linear regression is one of the most commonly used predictive modelling techniques. When the response variable increases or decreases linearly with the change in predictor variable, the model will follow a linear regression model. Such a model is used to predict the value of a response variable using one or more predictor variables. There are two kinds of regression: simple linear regression and multiple linear regression.

The general equation for a simple linear regression model is given by:

y = ax + b + ε

where, y is the response variable, x is the independent variable, a and b are the regression coefficients and ε is the error term for the errors that happen during measurement of values. The coefficients of regression, a and b, can be estimated using the method of least squares or maximum likelyhood.. Also, the estimates obtained by least square method is the Best Linear Unbiased Estimator (BLUE) of the regression coefficients, by Gauss — Markov theorem.

There are a few assumptions to be satisfied when a linear model is formulated:

The response variable (y) and the predictor variable (x) should have a linear relationship.
The errors must be uncorrelated.
The expectation of errors must be zero, ie, E(ε) = 0.
The variance of errors must be a constant value. It is called homoscedasticity.

Linear regression has mainly two aims: (a) to examine whether the predictor variables predict the response variables correctly and (b) identifying which predictor variables are most significant in predicting the outcome variables.

Creating a simple linear regression model has mainly four steps:

With the available data, an initial guess on the regression model is made using scatter plot to identify whether the regression model is linear or non — linear.
Estimation of the parameters using methods like least square estimation or maximum likelyhood estimation.
Now, we have to fit the model using the estimated parameters and then verify whether the initial model (in step 1) is specified in the correct form.
Final step is model verification, where we check whether the model predicts the results correctly, ie, checking whether the predicted and the actual values are similar.

Understanding Simple Linear Regression using R Software

R software is a free programming language and a powerful statistical tool that can be used for developing statistical software, data analysis and visualization. For our analysis, we will be using R software to analyse the data and fit the model to predict the dependent variable.

Data: Spend — Sales data set

The data that will be used is spend and sales data, where the spend is the predictor (independent) variable and sales is the response (dependent) variable.

i) Basic Descriptive Analysis

First we import the data set into the software and the find the summary of the data set.

a=read.csv(“C:/Users/admin/Desktop/data.csv”,header=TRUE)
summary(a)

Using summary, we get the basic descriptive statistics of the variables like mean, median, minimum, maximum, etc. Here, we can see that the minimum value of spend variable is 1000 and the maximum value is 15000, whereas for sales, it is 9914 and 158484 respectively. The average values for spend and sales are 6542 and 70870 respectively.

ii) Model Confirmation

Now, in order to study the relationship between the dependent and independent variables, we plot both the values using a scatter plot.

plot(a$Spend,a$Sales,main=”Relationship bw Sales and Spend”, xlab=”Spend”, ylab=”Sales”,col=”red”, pch=19)

Spend and Sales variables follow a linear trend

So, we can conclude from the graph that the spend and sales variables follow a linearly increasing trend. So, the simple linear regression model will fit the given data set.

iii) Outliers Checking

Now we will see whether the data has any outliers by plotting box plot using the following code.

boxplot(a$Sales,main=”Boxplot of Sales”, xlab=”Sales”, col=”orange”, border=”brown”)boxplot(a$Spend, main=”Boxplot of Spend”, xlab=”Spend”, col=”orange”, border=”brown”)

Box plot doesn't show any significant outliers

In box plots, if any points lies outside the box (which is the inter — quartile range), they are considered as outliers. We can see that there are no outliers in both Spend and Sales variables such that the data is affected.

iv) Checking if the response variable follows normal distribution

It is essential to check if the response variable follows normal distribution. We do that by plotting the density plot of the sales variable.

install.packages(“e1071”)plot(density(a$Sales), main = “Density Plot: Sales”, ylab = “Frequency”, sub = paste(“Skewness:”, round (e1071 :: skewness (a$Sales), 2)))

The dependent variable follows normal distribution

We can see that the Sales variable follows a positively skewed normal distribution with a skewness of 0.56.

v) Correlation Analysis

Now we check the correlation between the two variables using cor(a$Spend,a$Sales) and we get the value as 0.9988322, which shows an almost perfect (99.88%) correlation between the two variables.

vi) Model Building and Analysis

So, now after ascertaining that the data follows simple linear regression model, we can now formulate the mathematical equation that fits the model by using

model=lm(a$Spend~a$Sales)print(model)

So, we get the following output:

We can see that the value of the intercept of the linear model is -114.67027 and the slope of the model is 0.09392. So, we get the complete formula of the linear model as ‘Sales = 0.09392 * (Spend) — 114.67027’. Now, since we have formulated the model, we now need to check for the statistical significance of the model by using the ‘p — value’ of the model computed using the t — test.

t.test(a$Spend,a$Sales,paired=TRUE,alternative = “two.sided”)

Checking for significance using p — value

Here, the p value is 0.000221, which is less than the level of significance of 0.05. Hence the null hypothesis is rejected and hence both the variables are significantly different and hence is statistically significant. Now, we find the summary of the model using summary(model) and we get the following output.

From the above output, we can infer the following:

Residuals are essentially the difference between the actual observed response values and the response values that the model predicted. The Residuals section of the model output breaks it down into 5 summary points. When assessing how well the model fit the data, you should look for a symmetrical distribution across these points on the mean value zero (0). For our model, the distribution of the RESIDUALS are symmetrical between -293.22 and 312.02, so the model predicted the response variable accurately.
The COEFFICIENTS show the intercept and slope of the linear regression model and it is given by ‘Sales = 0.09392 * (Spend) — 114.67027’. The estimate of sales, when the spend is 0, is 9.392e-02. The standard error coefficient is 1.437e-03, which gives the approximate variations that the sales variable can have with respect to the spend variable. The t-value is 65.378, which is significantly far from 0, which shows that the variables are statistically significant.
Residual Standard Error is measure of the quality of a linear regression fit. Theoretically, every linear model is assumed to contain an error term E. Due to the presence of this error term, we are not capable of perfectly predicting our response variable (sales) from the predictor (spend) one. The Residual Standard Error is the average amount that the response will deviate from the true regression line. and here it is calculated as 217.5 on 10 degrees of freedom.
The R-squared (R²) statistic provides a measure of how well the model is fitting the actual data. R² is a measure of the linear relationship between our predictor variable and our response/target variable. The value obtained is 0.9977 which shows a 99.77% of fitting of the variable to the linear model.
The further the F-statistic is from 1 the better it is. Here, the F value is 4274 on 1, which is relatively larger than 1 hence, we see that there is a good relationship between sales and spend variables.

vii) Performing Linear Diagnostics on the Model

Linear diagnostics are used to evaluate the model assumptions and investigate whether or not there are observations with a large, undue influence on the analysis. It can be done using the code plot(model). We get a set of four graphs.

a) Residual vs Fitted Graph: If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data. Else, a non-linear model is more appropriate. Here, we can see that the equally spread residuals around a horizontal line without distinct patterns. Hence, a linear regression model is well fit for the data set.

b) Normal Q-Q Graph: This plot checks if the residuals are normally distributed. Here, we can infer that the residuals are normally distributed, since the residuals are lined well on the straight dashed line.

c) Scale — Location Graph: This plot shows if residuals are spread equally along the ranges of predictors. The residuals are equally distributed and the red line is horizontal. Hence, the variance is equal for the data or we can say that the data is homoscedastic.

d) Residual vs Leverage Graph: This graph is used to find the influential points in the data. Any point lying beyond the Cook’s distance (red — dotted line) are influential points. We can see that the red dotted line shows the Cook’s distance and there is one data point lying after the dashed line. So, there is one influential points in the data. Hence, the results can be affected by that influential point.

viii) Model Validation

For model validation, we create a training and testing data set and check for the accuracy of the prediction. It can be done using the following codes:

# Create the training and test data (60:40)set.seed(100) #setting seed to reproduce results of random samplingrows=sample(nrow(a))# Randomly order data:data=a[rows, ]# Identify row to split on: splitsplit = round(nrow(data) * .60)# Create traintrain=data[1:split,]# Create testtest=data[(split+1):nrow(data),]

Now we fit the model to the training data set and try predicting the values of the response (sales) variable.

linmod = lm(Spend ~ Sales, data=train) # build the modelPred = predict(linmod, test)linmodPred

We get the following output:

The model that is fitting the data is given by ‘Sales = (Spend)*0.09207 + 39.01667’. The predicted values are also given.

We check for the accuracy of the model using:

actuals_preds = data.frame(cbind (Actual_Value = test $ Sales, Predicted_Value = Pred))actuals_preds

These are the actual values and the predicted values. We now find the min — max accuracy as well as the mean absolute percentage error of the predicted model.

This can be done using the following code:

min_max_accuracy = mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))min_max_accuracy

#0.09266136

mape = mean (abs ((actuals_preds $ Predicted_Value — actuals_preds $ Actual_Value)) / actuals_preds $ Actual_Value)mape

#0.9073386

Here, from the above values, the min — max accuracy is 0.09266136, which shows that the model has only around 9.26% of accuracy, which implies that the given model is not good at predicting the values of the response variable. The mean absolute percentage error (mape) is 0.9073386, which implies a high amount of prediction error (90.7%). Hence, the given model is not a good fit for the data.

How to improve?

In order to improve the accuracy of the linear model, we can use methods like log transformations, use logit function on the model, or we can use repeated modelling techniques which will give different coefficients of parameters. We will select the model that has the most suitable parameter and the best accuracy among all the test models.

References:

1. https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R

2. http://r-statistics.co/Linear-Regression.html