Supervised Learning in R: Regression Diagnosis

Fatih Emre Ozturk, MSc
7 min readJun 30, 2023

--

When we fit a linear regression model to a data set, many problems may occur. To solve these problems, what we do is called regression diagnosis. Regression diagnosis is a set of procedures used in regression analysis to assess the validity of a model. It involves exploring the model’s underlying statistical assumptions, examining the structure of the model, and looking for observations that are poorly represented by the model or that have a relatively large effect on the model’s predictions.

In the previous post of the supervised learning series, we explored how to fit a line, regression evaluation metrics, interpreting regression output, and feature selection methods. This post assumes you know those topics and focuses on diagnosing the linear regression model.

Most common problems in linear regression can be listed under five topics: non-linearity of the response-predictor relationship, correlation of residuals, non- constant variance of residuals, outliers, leverages, collinearity. In this post, I will explain what each problem is and show how these are implemented in R.

Non-linearity

If we want our linear regression model is not to be suspicious, there must be a straight-line relationship between predictor(s) and response variable. If there is no this kind of relationship, our predictions becomes suspicious. In order to identify non-linearity, we can check residual plots. Residual plots is a plot which there is fitted values(predicted values) in the x-axis and residuals in the y-axis.

In residual plots, what we must check is any trend. For instance, if there is a “U” shape in the data, it means that the data set is not linear, which something we do not want. The following plot may be a good example of it:

Assume that we observed non-linearity in our model. What should we do? Of course, we can use non-linear transformations of the predictor(s) X(s). For example X², X³, sqrt(X), or log X might solve non-linearity problem.

As an example for non-linearity and all other problems, I will use Carseats data set from ISLR package. I builded the following model with the following features:

df <- Carseats
model <- lm(Sales~Advertising+Price+Age+CompPrice+Income+Population,data=df)

We, now, check residual plot with the following codes:

plot(model$fitted.values, model$residuals, xlab = "Fitted Values", ylab = "Residuals")

Also, it is possible to draw residuals plot as the following:

par(mfrow = c(2,2))
plot(model)

As it can easily be seen from both of the plots, there is no observed pattern. Thus, it should be noted that there is not a non-linearity problem for this model.

Correlation of Residuals

Another assumption of the linear regression model is that residuals should not be correlated. If there is a correlation between residuals, true standard errors will be underestimated by the estimated standard errors. As a result of this, intervals(prediction and confidance) will be narrower than as it should be. Also, p-value of the model will be lower that as it should be. This will change everything about our model.

Non — Constant Variance of Error Terms

Another important problem about linear regression model is that residuals should have a constant variance. Since hypothesis tests, standard errors, and confidence intervals rely on this, it is very important.

There are two ways to identify this problem. The first one is again residual plot. Non-constant variance, or heteroscedasticity, can be identified by residual plot from the presence of a funnel shape in the plot. Following plot is a good axample of funnel shaped residual plot.

Another way to identify heteroscedasticity is breush-pagan testidir. Hypothesis of the test is as follows:

Ho : Errors have constant variance.

Ha : Errors does not have constant variance.

According to the test result, we want p-value less than level of significance.

When heteroscedasticity is observed, it is recommended to transform response variable or to use weighted least squares method.

In R, Breush — Pagan test can be found in the lmtest package. The following code shows how it is implemented.

bptest(Sales~Advertising+Price+Age+CompPrice+Income+Population,data=df)
 studentized Breusch-Pagan test

data: Sales ~ Advertising + Price + Age + CompPrice + Income + Population
BP = 0.88444, df = 6, p-value = 0.9896

Outliers

An outlier in linear regression model is an observation which is far away from the predicted value of the model. It does not have much effect on the fitted line, however, it can cause other problems such as decline in R² or increase in RMSE.

Again, residual plots can be used to identify outliers. In the following illustration the ourtlier with red point is clearly visible. However, in practice, it might not be so easy to detect it. To address this problem, we can check studentized or standardized residual plot. It is computed by dividing each residual by its estimeted standard error. Observations whose studentized residuals are greater than 3 in absolute value is identified as possible outliers.

To deal with observations, we need to be careful. Observations which are believed to occured due to an error in data collection and observations whose studentized residuals are greater than 3 in absolute value can be removed.

Following code shows how outliers are identified in R.

# calculating standardized residuals
standardized.residuals<-
model$residuals/sqrt(var(model$residuals))

# drawing standardized residual plot
plot(model$fitted.values,standardized.residuals, xlab = "Predictions", ylab = "Standardized Residuals")
abline(h=0)
abline(h=-3)
abline(h=3)

Since, there is no observation whose standardized residual is greater than 3 in absolute value, there is no outlier in this example.

Leverages

As I mentioned, outliers are the observations which are far away from the predicted values of the model. In contrast, high leverages are the observations which are far away from the response variable. Another difference between outliers and leverages is on their impact on the fitted line. High leverage observations have a sizable impact on the estimated regression line.

From the beginning of this section, I am talking about high leverages. Yet, what are those? In order to quantify an observation’s levarage, leverage statistics is calculated. Large value of it means that there is a high leverages.

It is also possible to identify leverages from standardized residual plot. The following code and output shows how it is identified.

st.res <- model$residuals/sd(model$residuals)
plot(hatvalues(model),st.res)
abline(h=c(-2,2),v=2*mean(hatvalues(model)))

In this example, there is no high leverage. However, if there was any, it should be on the shaded areas.

Collinearity

Collinearity refers to close relation between one or more predictor variables. This is another problematic issue. Since if there is a close relation between two predictor variables, it would be difficult to separate predictor’s individual effects on response variable. Also, it can be stated that it causes a grow in the standard error for predictor variable. This causes a decline in the t-statistics which refer to significance of the predictor variable.

There are two ways to detect collinearity. The first way is to check correlation matrix of predictor variables. If there are a pair of highly correlated variables, then we can state that there is a collinearity problem in the model. However, this is not always that easy. Since there might be some cases that there is a collinearity problem between three or more variables. This situation is called multicollinearity. In this case, correlation matrix would not be enough to detect it. For the case of multicollinearity, we can check variance inflation factor(VIF). It is the ration of variance of betas when fitting the full model divided by the variance of betas if fit on its own. The smallest value of VIF is 1. However, to be ablo state a multicollinearity problem; we need a VIF value that exceeds 5 or 10.

There are also two ways to deal with collinearity problem. The first one is to drop one of the problematic variable from the regression model. The second way is to combine problematic variables together to a one variable. However, it might not be the case for all kind of data sets.

In R, car package has vif function to check collinearity in a model. All you need to do is to write the object name you created the linear regression model.

# loading library
library(car)
# checking the variance inflation factor
vif(model)
Advertising       Price         Age   CompPrice      Income  Population 
1.084517 1.534319 1.015966 1.549623 1.011341 1.090368

As you can see from the output, since there is no variables with a VIF value more than 10, there is no collinearity problem in this model.

Just like always:

“In case I don’t see ya, good afternoon, good evening, and good night!”

Reference and Further Reading

Gareth, James, et al. _An introduction to statistical learning: with applications in R. Spinger, 2013. p(92–103)

--

--