R Applications — Part 2: Multiple Linear Regression

Burak Dilber
Data Science Earth
Published in
6 min readFeb 23, 2021

--

Hello to everyone!

In the first part of the R series of applications, we examined modeling of a data set with simple linear regression. In the second part of the series, I will touch on multiple linear regression analysis.

Multiple linear regression is one of the analysis methods used when a dependent variable is wanted to be explained with more than one independent variable. The multiple linear regression model is shown below:

Our aim here is to try to predict the dependent variable by using n independent variables. Beta parameters (slopes) are used for each independent variable. The term Epsilon error indicates the Y dependent variable.

Let’s start with a sample dataset. There is a marketing data set in the datarium package in R. The codes for uploading the data set to R are shown below.

install.packages(“datarium”)
library(datarium)
data(marketing)

First, let’s get to know the data set. The data set consists of 3 independent variables: youtube, facebook, newspaper. The dependent variable is the sales variable. The aim is to investigate the effect of this on sales by advertising from different platforms. Let’s get the descriptive statistics of the dataset:

Descriptive Statistics

Descriptive statistics can be obtained using the summary() function in R:

summary(marketing)
Descriptive statistics

Here, descriptive statistics values can be reached for each variable. For example; The mean value for youtube is seen as 176.45. Let’s create the matrix plot to examine the relationship between variables.

Matrix Plot

In R, a matrix plot can be created using the plot() function.

plot(marketing)
Matrix Plot

Here we can see the relationship between the variables. For example, we can say that there is a very close to linearity relationship between youtube and sales variables.

Let’s interpret the output by applying multiple linear regression analysis.

Multiple Linear Regression Model

Multiple linear regression model and summary output are shown below:

model<-lm(sales~.,data=marketing)
summary(model)
Summary of the model

Variables that are significant in the R programming language are indicated by the * symbol. Variables that best describe the dependent variable are denoted with the symbol ***. Here, youtube and facebook variables are the variables that best describe the dependent variable. We can create a hypothesis test for this. The hypothesis tests created are shown below:

It is seen that p value is below 0.05 in Youtube and facebook variables. So H0 hypothesis is rejected and we can say that these variables are significant. The newspaper variable is not significant. We see that the R square value is 0.89. In other words, 89% of the total variation in sales is explained by the variation in youtube, facebook and newspaper variables, and it can be interpreted. We can say that the R square value is good for this model. However, the newspaper variable may need to be omitted. In order to understand this better, we can use stepwise regression.

Stepwise Regression

Forward Selection

With forward selection, we can see which variables should remain in the model. R codes are shown below:

install.packages(“olsrr”)
library(olsrr)
stepw_forward<-ols_step_forward_p(model)
Forward selection

Looking at this output, we can say that the variables that entered the model are youtube and facebook. Another method used for Stepwise is the backward elemination method.

Backward Elemination

With this method, we can decide which variable to remove of the model. R codes and output are shown below:

stepw_backward<-ols_step_backward_p(model)
Backward elemination

According to the output, it can be interpreted that the newspaper variable should remove of the model.

Model 2

model2<-lm(sales~youtube+facebook,data=marketing)
summary(model2)
Summary of the model

We see that the variables are significant and the R squared value is 0.89. Now let’s look at the assumption check.

par(mfrow=c(2,2))
plot(model2)
Residual analysis

When we look at the QQ plot, we can say that the residuals are not normally distributed. We cannot make a definitive comment for variance homogeneity. “Shapiro — Wilk” and “studentized Breusch — Pagan” tests can be applied for normality and variance homogeneity. Let’s run these tests on R.

shapiro.test(model2$residuals)
bptest(sales~youtube+facebook,data=marketing)
Tests

According to Shapiro — Wilk normality test, p value was below 0.05. The residuals is not distributed normally, it can be interpreted. According to Studentized Breusch — Pagan test, variances can be said to be homogeneous. So let’s get the normal distribution of residuals by applying transformation and re-model.

Model 3

We can apply square root transformation to Youtube variable. The R code for this is shown below.

youtube_new<-sqrt(marketing$youtube)

Thus, we applied square root transformation to youtube variable. Let’s create a model.

model3<-lm(marketing$sales~youtube_new+marketing$facebook)
summary(model3)
Summary of the model

We see that the variables are significant and the R square value increases. Let’s look at the assumption check:

par(mfrow=c(2,2))
plot(model3)
Residual analysis

Now we see that the scraps are normally distributed. Let’s apply the tests to make precise comments.

shapiro.test(model3$residuals)
bptest(marketing$sales~youtube_new+marketing$facebook)
Tests

Yes, now the residuals are distributed normally. However, we have a small problem, we see that variance homogeneity is impaired. When we look at the graph where the analysis is done, we can say that the 131st observation is the outlier value. So let’s continue our analysis by removing this observation.

Model 4

First, we remove the 131st Observation from our model and again apply square root transformation to the youtube variable. R codes are shown below.

marketing_new<-marketing[-c(131),]
youtube_x<-sqrt(marketing_new$youtube)

Let’s create our model:

model4<-lm(marketing_new$sales~youtube_x+marketing_new$facebook)
summary(model4)
Summary of the model

We see that the R squared value increases, we can also say that the variables are significant. Now let’s do the assumption check.

par(mfrow=c(2,2))
plot(model4)
Residual analysis
shapiro.test(model4$residuals)
bptest(marketing_new$sales~youtube_x+marketing_new$facebook)
Tests

We can say that p values are over 0.05 according to both tests. We see that variance homogeneity and normality are achieved.

As a result, the 4th model created is the most successful model for multiple linear regression, we can comment. Thus, we learned how to model this data set with multiple linear regression when there are multiple independent variables. You can specify your thoughts and criticisms about my analysis in the comment section. See you in the next part…

Have a nice day :)

--

--