Simple Linear Regression In R Programming

Joyeeta Dey
5 min readJun 18, 2022

--

linear regression

Linear regression is a basic and commonly used type of predictive analysis. The overall idea of regression is to examine two things:

  1. does a set of predictor variables do a good job in predicting an outcome (dependent) variable?
  2. Which variables, in particular, are significant predictors of the outcome variable, and in what way do they–indicated by the magnitude and sign of the beta estimates–impact the outcome variable?

The most common types of linear regressions used are:

  1. Simple Linear regression — 1 dependent variable
  2. Multiple Linear regression — more than 1 dependent variable.

Simple Linear Regression

Simple Linear regression involves the use of one dependent variable and one independent variable. Mathematically,

y = a+b*x

where y is the independent variable, x is the dependent variable, b is the regression coefficient and a is a constant.

Let’s understand a bit more using an example. We are using a dataset of ‘marketing’. This dataset is available in Rstudio using the package Datarium. In order to access the

rm(list=ls())
install.packages("datarium")
library(datarium)
data("marketing")
str(marketing)
head(marketing)
marketing dataset

For example, we want to predict future sales based on the advertising budget spent on Youtube.

Before jumping into regression analysis, let’s visualize the dataset using ‘ggplot’

library(ggplot2)
par(mar=c(2,2,2,2))
ggplot(marketing,aes(x=youtube,y=sales))+geom_point()+geom_smooth(method='lm',se=FALSE)

Syntax analysis:

  1. x=variable youtube(dependent variable) based on which we are predicting,
  2. y = variable sales(independent variable) which we want to predict
  3. geom_point() — for plotting the points
  4. geom_smooth(method= ‘lm’) — used for linear model
  5. se=FALSE — sets the standard error as false. You can make it true to check the standard error margin.
visualization of linear data

Next, we check the correlation between the x and y variables. Correlation coefficients are indicators of the strength of the linear relationship between two different variables, x, and y. The value of the correlation coefficient should lie in the range [-1,1]. For linear regression, if the correlation coefficient is closer to -1 or 1 then the variables are strongly correlated and we can get more accurate predictions for variable y using variable x.

#finding correlation
cor(marketing$youtube,marketing$sales)
correlation result

Here, we can see that the variables are strongly correlated. Now, we can build our linear regression model.

#building a linear model
model <- lm(sales~youtube,data=marketing)
model

We get the value of a and b. Where a = 8.43911 and b = 0.04754. We can check the detailed summary of the linear model to gain a closer understanding.

#model summary
summary(model)

Since the p-value of the model is less than 0.05 which is generally the standard, we can accept the model.

Predicting values using the model

Now comes the ultimate application for which the model was built. We will predict the value of the sales, if the value for youtube is 200.

#prediction interval on a particular outcomehead(marketing)
youtube=200
new_dt <- data.frame(youtube)
pred_int_pt <- predict(model,new_dt)
pred_int_pt
prediction

The confidence interval of the model

A confidence interval is the mean of your estimate plus and minus the variation in that estimate. This is the range of values you expect your estimate to fall between if you redo your test, within a certain level of confidence. In order the check the 95% confidence interval of the model,

#confidence interval on the expected outcome
conf_int_pt <- predict(model,new_dt,level = .95,interval = "confidence")
conf_int_pt
confidence interval result

Now, let’s evaluate the model.

1. Residual Standard Error (RSE)

#RSE
sigma(mode)

RSE (Residual Standard Error) is the estimate of the standard deviation of the irreducible error (the error which can’t be reduced even if we knew the true regression line; hence, irreducible). In simpler words, it is the average deviation between the actual outcome and the true regression line.

In order for the model to have higher accuracy, the RSE should be minimum.

2. Mean Absolute Percentage Error

#percentage error
sigma(model)*100/mean(marketing$sales)

Mean absolute percentage error is commonly used as a loss function for regression problems and in model evaluation, because of its very intuitive interpretation in terms of relative error.

3. Root Mean Squared Error

The most common metric for evaluating linear regression model performance is called root mean squared error, or RMSE. The basic idea is to measure how bad/erroneous the model’s predictions are when compared to actual observed values. So a high RMSE is “bad” and a low RMSE is “good”.

library(tidyverse)
library(caret)
library(dplyr)
#Make predictions
pred <- model %>%
predict(new_dt)
pred
#model performance
RMSE <- RMSE(pred,marketing$sales)
RMSE
model evaluation results

Visualising the model

Residuals of the model

#Evaluating the linearity assumption
par(mar=c(2,2,2,2))
#Evaluating the residuals
plot(model,1)

Visualizing model after taking log.

model1 <- lm(log(sales)~youtube,data=marketing)
plot(model1,1)

Hope you learned some new information about linear regression in R programming. For any basics in R check out my other stories. Connect with me on LinkedIn.

--

--

Joyeeta Dey

-CSE undergraduate, Front End and Back-End Developer, Machine Learning and Deep Learning Enthusiast , Highly interested in Data Structures and Algorithms.