R Applications — Part 1: Simple Linear Regression

Burak Dilber
Data Science Earth
Published in
7 min readFeb 23, 2021

Hello everyone from the R series of applications!

For those who want to do studies on data science, the first part of the series of applications, in which various analyzes will be made using the R programming language, starts with simple linear regression analysis. Enjoy your readings :)

Regression analysis is one of the analysis methods that examines the relationship between two or more variables. One of these variables is called dependent variable, the others are called independent variables. The aim is to predict the dependent variable by using independent variables. If the dependent variable is to be explained with a single independent variable, this is called simple linear regression. For example; Let’s examine the relationship between the size of the house and its price. The data set of these variables is shown below:

Independent variable: The size of the house (in square meters)

Dependent variable: House price in 1000 TL

Let’s introduce the data of these variables to the system using R programming language. The codes are shown below:

house_price<-c(245,312,279,308,199,219,405,324,319,255)
house_size<-c(130,148,157,174,102,143,218,227,132,157)

In order to obtain general information about the relationship between the variables, a graphic of these variables should be created first. The plot is shown below:

plot(house_size,house_price)
Relationship Between Variables

Looking at the graph, we can say that there is a nearly linear relationship between the size of the house and the price of the house. Let’s model the relationship between these variables with simple linear regression analysis. First, let’s give information about the model. In fact, different methods are used to model the relationship between variables in statistics. One of them is a simple linear regression model. The function for this model is shown below:

Regression Model

Let’s not go over this. The linear regression model can also be used when the relationship between the variables is non-linear. What is meant here as linearity is that the parameters (beta parameters) are linear. Below are examples of linear regression models:

So where are the second and third functions used? As I just said, when the relationship between the variables is non-linear, a linear regression model can be applied. However, transform is required for variables. For example; If there is a quadratic relationship between the variables, the analysis of the data in this variable is calculated squares and the analysis continues. In this case, the linear regression model is the second function.

Let’s go back to our own data and create a linear regression model. For this, lm () command is used in R:

model<-lm(house_price~house_size)

After creating the model, we can look at the summary result values. To do this, the summary () function is used in R:

summary(model)
Summary of the Model

By looking at this output, we can get a lot of information about the model. First, the “residuals” section gives us information about residuals. Here are the error values ​​that are now specified. In other words, it is the difference between the house price values ​​in the data set and the predicted house price values. We can see the smallest, largest and median values ​​of the residuals here. We can find information about the parameters in the “coefficients” section. The first column here shows the estimated values ​​of the parameters. The second column is the standard error values ​​of the parameters. The t and p values ​​are shown in the three and fourth columns, respectively. These values ​​are used to test the hypothesis whether the parameters are equal to zero or not. In the last part of the output, the standard error of the residuals and the degrees of freedom are shown. The “multiple R-square” value shown here is called the coefficient of determination. This value is the portion of the total variation in the dependent variable explained by the variation in the independent variable. Here, 58% of the total variation in the house price is explained by the change in the size of the house. Generally, comments are made about models with more than one independent variable about the “Adjusted R-square” value. This value decreases when an unusable variable is added to the model. Finally, “F-statistic” and “p-value” values ​​are used to test hypothesis whether the slope parameter is equal to zero or not.

So how are the estimated values for home prices? If you remember, we talked about a regression model. In this regression model, we were able to reach real house price values by writing parameter values. When we extract the notation of the residuals from this model, we can obtain the predicted house price values. So the function can be written like this:

Here, we can write the parameter values by looking at the output where the summary result values are listed.

In this function, we can reach the predicted home price values when the data belonging to the house size variable is written instead. 10 predicted values are obtained for 10 observations. The R output for this is shown below:

fitted(model)

When we add residuals to these values, we can obtain real house price values. The output R for residuals is shown below:

model$residuals

Now let’s draw the regression line on the plot. We can get the regression line by using the abline () command in R:

abline(model)
Regression Line

We have seen how to do linear regression analysis up to this stage. However, there is another important detail for this analysis: Assumption control. It is necessary to provide assumptions in linear regression in order to obtain realistic values. Let’s explain these assumptions in terms of items:

  • The relationship between the variables is linear by the parameters.
  • Observations should be independent from each other.
  • Residues should be distributed normally.
  • Variance homogeneity must be ensured between residuals.

In order to check these assumptions, it is necessary to perform “residual” analysis in R. The R output and codes for this are shown below:

par(mfrow=c(2,2))
plot(model)
Residual Analysis

In the upper left graph, observations should show a distribution around zero. Looking at this graph, we can say that it has a regular distribution around zero. In the upper right plot, the Q-Q plot for residuals is shown for normality. In this graph, we see that the data is evenly distributed on the line. We can say that the residuals provide the normal distribution assumption. The bottom left graphic is used to test variance homogeneity. We can test that residuals have an equal variance along the regression line. Finally, Finally, the lower right graph is used to identify influential observations in the data set. Observations that need to be removed from the data set to improve our model can be seen through this graph. However, since the number of observations in our data set is small, there is no need to make observations for now. However, we may not be able to interpret this assumption control very clearly by looking at the graphs. It is very difficult to comment from graphics, especially in more complex datasets. For this reason, some statistical tests are needed. “Shapiro — Wilk” test can be used for normality and “stundentized Breusch — Pagan” test for variance homogeneity. In order to apply the “stundentized Breusch — Pagan” test in R, “lmtest” library must be loaded. R functions and outputs are shown below.

shapiro.test(model$residuals)
install.packages(“lmtest”)
library(lmtest)
bptest(house_price~house_size)
Tests

Firstly, hypotheses are established and interpreted for normality and variance homogeneity.

H0: Residuals are normally distributed.

H1: Residuals are not distributed normally.

According to Shapiro — Wilk normality test, p value is over 0.05. So the H0 hypothesis cannot be reject. We can say, “The residuals are normally distributed”.

H0: There is variance homogeneity among residuals.

H1: There is not variance homogeneity among residuals.

According to Stundentized Breusch — Pagan test, p value is over 0.05. So the H0 hypothesis cannot be reject. There is a variance homogeneity among residuals, it can be interpreted.

According to the analysis, we see that the assumptions of our regression model are provided. We can say that this model is suitable for linear regression analysis.

In cases where assumptions are not provided, we can try to provide assumptions by applying some transformations to variables.

Up to this point, we learned how to apply simple linear regression analysis using the R programming language and how to interpret the outputs. However, when we look at the relationship between variables, there may be more than one independent variable that affects the dependent variable. In this case, one of the analysis methods to be performed is multiple linear regression analysis. In the second part of the R series of applications, I will discuss multiple linear regression analysis.

Have a nice day :)

--

--