General Validation of Linear Model Assumptions

Burak Tiras
3 min readAug 28, 2022

--

Load the dataset, do some data cleaning stuff, build the model, run the results BAM BAM BAM!!!

Easy, right?

Nope. Not that easy.

Well at least, it shouldn’t be that simple. All of the steps mentioned above are indeed obligatory, yes. But if you go into machine learning thing, it demands some extra work before you build your model. Not that complicated, but certainly mandatory. If you skip that part, at the end of the day you will still have a model that seemingly work…

…but that could not be further from the truth.

Assumptions, my dear friend, assumptions.

Morpheus lingers around the room and looks into your eyes:

Assumptions are everywhere. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work… when you go to church… when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth.

Construct scene from Matrix (1999), Wachowskis.

Assumptions that we’re gonna talk today is not that complicated and no, we will not talk about the fact that the world is actually an digital image constructed upon our assumptions about it.

Assumptions that we’re gonna talk about today are statistical assumptions. Before building your model, there are certain apriori thoughts that must be validated. Without having testing them your model is statistically garb-, I mean, your model might be inaccurate, so to speak.

Linear Model Assumptions

Let’s look into assumptions regarding linear models.. There are four assumptions that must be met, which are:

  1. Linearity (Obvious)
  2. Normality (Obvious as well)
  3. Heteroscedasticity (Man what the f-)
  4. Independence (Your predictor variables must not have collinearity issues.)

That’s right, you must check this one by one before building your model. One by one. Each of them. Yes.

Joking.

Luckily, you and I are blessed with an R package that can check if the model satisfies above assumptions or not. How beautiful, isnt’t it?

The package that I’m referring to is:

gvlma

You can access to CRAN page by clicking onto it. It is developed by Edsel A. Pena and Elizabeth H. Slate and currently maintained by Elizabeth H. Slate.

One simple function and it’s done.

How to Install

install.packages(“gvlma”)

How to Deploy

library(gvlma)

Build Your Model

We will use built-in Orange dataset to predict circumference by using age.

View(Orange)m <- lm(circumference ~ age, data = Orange)

Validate the Assumptions

Use gvlma() function to conduct validation process.

validation_m <- gvlma(m)summary(validation_m)

Explore the Results

As you can see, we have a green light. All assumptions are accepted.

Model results. Image by Author

Check the Model

Since our assumptions are satisfied and suitable for a linear model, it’s time to look into model results.

Model results. Image by Author

Plot the Validation Summary

As we have a linear regression model with a quite high R-squared, let’s honor it with gvlma packege by plotting the validation_m object, so that we can further investigate the assumption check.

To visualise our plot we’ll use a gvlma function:

plot.gvlma(validation_m)
Validation Summary Plot. Image by Author

Yeah, that was all. That’s what we usually get when R’s simplicity meets talented statisticians. Special thanks to author of this package.

P.S: You can dive deep into collinearity validation by checking VIF scores.

Take care,

Burak.

https://www.linkedin.com/in/burak-tiras/

Further Reading:https://cran.r-project.org/web/packages/gvlma/gvlma.pdfhttps://www.statology.org/linear-regression-assumptions/

--

--