Model Diagnostics and Selection — 9

Vivekanandan Srinivasan
Analytics Vidhya
Published in
9 min readDec 12, 2019

If you have not read part 8 of the R data analysis series kindly go through the following article where we discussed Multiple Linear Regression — 8.

The contents in the article are gist from a couple of books that I got introduced during my IIM-B days.

R for Everyone — Jared P. Lander

Practical Data Science with R — Nina Zumel & John Mount

All the code blocks discussed in the article are present in the form of R markdown in the Github link.

To see all the articles written by me kindly use the link, Vivek Srinivasan.

Building a model can be a never-ending process in which we constantly improve the model by adding interactions, taking away variables, doing transformations and so on. However, at some point, we need to confirm that we have the best model at the time or even a good model. That leads to the question: How do we judge the quality of a model?

In almost all cases the answer has to be: in relation to other models. This could be an analysis of residuals, the results of an ANOVA test, a Wald test, drop-in deviance, the AIC or BIC score, cross-validation error or bootstrapping.

Residuals

One of the first-taught ways of assessing model quality is an analysis of the residuals, which is the difference between the actual response and the fitted values, values predicted by the model. This is a direct result of the formulation discussed in our previous article where the errors, akin to residuals, are normally distributed. The basic idea is that if the model is appropriately fitted to the data, the residuals should be normally distributed as well. To see this, we start with the housing data to which we fit regression and visualize with a coefficient plot.

housing <- read.table("../data/housing.csv",sep = ",", header = TRUE,stringsAsFactors = FALSE)
names(housing) <- c("Neighborhood", "Class", "Units", "YearBuilt","SqFt", "Income", "IncomePerSqFt", "Expense",
"ExpensePerSqFt", "NetIncome", "Value","ValuePerSqFt", "Boro")
housing <- housing[housing$Units < 1000, ]
house1 <- lm(ValuePerSqFt ~ Units + SqFt + Boro, data=housing)
summary(house1)

A quick way to grab the coefficients from a model is to either use the coef function or get them from the model using the $ operator on the model object as we discussed in our previous article.

library(coefplot)
coefplot(house1)

For linear regression, three important residual plots are fitted values against residuals, Q-Q plots and the histogram of residuals. The first is easy enough with ggplot2. Fortunately, ggplot2 has a handy trick for dealing with lm models. We can use the model as the data source and ggplot2 “fortifies” it, creating new columns, for easy plotting.

library(ggplot2)
# see what a fortified lm model looks like
head(fortify(house1))

The plot of residuals versus fitted values as shown below at first glance disconcerting, because the pattern in the residuals shows that they are not as randomly dispersed as desired.

ggplot(aes(x=.fitted, y=.resid), data = house1) +
geom_point() +
geom_hline(yintercept = 0) +
geom_smooth(se = FALSE) +
labs(x="Fitted Values", y="Residuals")

However, further investigation reveals that this is due to the structure that Boro gives the pattern to the data and it will be fine.

ggplot(aes(x=.fitted, y=.resid), data = house1) +
geom_point() +
geom_hline(yintercept = 0) +
geom_smooth(se = FALSE) +
labs(x="Fitted Values", y="Residuals") +
geom_point(aes(color=Boro))

Next up is the Q-Q plot. If the model is a good fit, the standardized residuals should all fall along a straight line when plotted against the theoretical quantiles of the normal distribution.

ggplot(house1, aes(sample=.stdresid)) + stat_qq() + geom_abline()

Q-Q plot for house1. The tails drift away from the ideal theoretical line, indicating that we do not have the best fit.

Another diagnostic is a histogram of the residuals. The histogram is not normally distributed, meaning that our model is not an entirely correct specification.

ggplot(house1, aes(x=.resid)) + geom_histogram()

Model Selection

All of this measuring of model fit only really makes sense when comparing multiple models, because all of these measures are relative. So we will fit a number of models in order to compare them to each other.

house2 <- lm(ValuePerSqFt ~ Units * SqFt + Boro, data=housing)
house3 <- lm(ValuePerSqFt ~ Units + SqFt * Boro + Class,data=housing)
house4 <- lm(ValuePerSqFt ~ Units + SqFt * Boro + SqFt*Class,data=housing)
house5 <- lm(ValuePerSqFt ~ Boro + Class, data=housing)

As usual, our first step is to visualize the models together using multiplot from the coefplot package. The result shows that Boro is the only variable with a significant effect on ValuePerSqFt as do certain condominium types.

multiplot(house1, house2, house3, house4, house5, pointSize=2)

ANOVA For Model Selection

While we do not promote using ANOVA for a multisample test, we do believe it serves a useful purpose in testing the relative merits of different models. Simply passing multiple model objects to anova will return a table of results including the residual sum of squares (RSS), which is a measure of error, the lower the better.

anova(house1, house2, house3, house4, house5)

This shows that the fourth model, house4, has the lowest RSS, meaning it is the best model of the bunch. The problem with RSS is that it always improves when an additional variable is added to the model. This can lead to excessive model complexity and overfitting.

AIC And BIC

Another metric, which penalizes model complexity, is the Akaike Information Criterion (AIC). As with RSS, the model with the lowest AIC — even negative values — is considered optimal. The BIC (Bayesian Information Criterion) is a similar measure where, once again, lower is better. The formula for AIC is

where ln(L) is the maximized log-likelihood and p is the number of coefficients in the model. As the model improves the log-likelihood gets bigger, and because that term is negated the AIC gets lower. However, adding coefficients increases the AIC; this penalizes model complexity. The formula for BIC is similar, except that instead of multiplying the number of coefficients by 2 it multiplies it by the natural log of the number of rows. This is seen in

The AIC and BIC for our models are calculated using the AIC and BIC functions, respectively.

AIC(house1, house2, house3, house4, house5)
BIC(house1, house2, house3, house4, house5)

Again it is pretty much visible from the above table, model house4 is better than other models since it has the least AIC and BIC score.

Cross-Validation

Residual diagnostics and model tests such as ANOVA and AIC are a bit old fashioned and came along before modern computing horsepower. The preferred method to assess model quality — at least by most data scientists — is cross-validation, sometimes called k-fold cross-validation. The data is broken into k (usually five or ten) non-overlapping sections. Then a model is fitted on k − 1 section of the data, which is then used to make predictions based on the kth section.

This is repeated k times until every section has been held out for testing once and included in model fitting k − 1 times. Cross-validation provides a measure of the predictive accuracy of a model, which is largely considered a good means of assessing model quality.

There are a number of packages and functions that assist in performing cross-validation. Each has its own limitations or quirks, so rather than going through a number of incomplete functions, we show one that works well for generalized linear models (including linear regression), and then build a generic framework that can be used generally for an arbitrary model type.

The boot package by Brian Ripley has cv.glm for performing cross-validation on. As the name implies, it works only for generalized linear models, which will suffice for a number of situations.

library(boot)
#refit house1 using glm instead of lm
houseG1 <- glm(ValuePerSqFt ~ Units + SqFt + Boro,data=housing, family=gaussian(link="identity"))
# ensure it gives the same results as lm
identical(coef(house1), coef(houseG1))

The results from cv.glm include delta, which has two numbers, the raw cross-validation error based on the cost function (in this case the mean squared error, which is a measure of correctness for an estimator) for all the folds and the adjusted cross-validation error. This second number compensates for not using leave-one-out cross-validation, which is like k-fold cross-validation except that each fold is the all but one data point with one point held out. This is very accurate but highly computationally intensive.

houseCV1 <- cv.glm(housing, houseG1, K=5)
houseCV1$delta

While we got a nice number for the error, it helps us only if we can compare it to other models, so we run the same process for the other models we built, rebuilding them with glm first.

# Building multiple models
houseG2 <- glm(ValuePerSqFt ~ Units * SqFt + Boro, data=housing)
houseG3 <- glm(ValuePerSqFt ~ Units + SqFt * Boro + Class,data=housing)
houseG4 <- glm(ValuePerSqFt ~ Units + SqFt * Boro + SqFt*Class,data=housing)
houseG5 <- glm(ValuePerSqFt ~ Boro + Class, data=housing)
# run cross-validation
houseCV2 <- cv.glm(housing, houseG2, K=5)
houseCV3 <- cv.glm(housing, houseG3, K=5)
houseCV4 <- cv.glm(housing, houseG4, K=5)
houseCV5 <- cv.glm(housing, houseG5, K=5)
## Creating a dataframe of model Error values and adding model name
cvResults <- as.data.frame(rbind(houseCV1$delta, houseCV2$delta, houseCV3$delta, houseCV4$delta,houseCV5$delta))
names(cvResults) <- c("Error", "Adjusted.Error")
cvResults$Model <- sprintf("houseG%s", 1:5)
cvResults

Once again, the fourth model, houseG4, is the superior model. Let us visualize how much ANOVA, AIC and cross-validation agree on the relative merits of the different models. The scales are all different but the shapes of the plots are identical.

# measure with ANOVA
cvANOVA <-anova(houseG1, houseG2, houseG3, houseG4, houseG5)
cvResults$ANOVA <- cvANOVA$`Resid. Dev`
# measure with AIC
cvResults$AIC <- AIC(houseG1, houseG2, houseG3, houseG4, houseG5)$AIC
# make the data.frame suitable for plotting
library(reshape2)
cvMelt <- melt(cvResults, id.vars="Model", variable.name="Measure",value.name="Value")
# Plotting Results
ggplot(cvMelt, aes(x=Model, y=Value)) +
geom_line(aes(group=Measure, color=Measure)) +
facet_wrap(~Measure, scales="free_y") +
theme(axis.text.x=element_text(angle=90, vjust=.5)) +
guides(color=FALSE)

Plots for cross-validation error (raw and adjusted), ANOVA and AIC for housing models. The scales are different, as they should be, but the shapes are identical, indicating that houseG4 truly is the best model.

Determining the quality of a model is an important step in the model building process. This can take the form of traditional tests of fit such as ANOVA or more modern techniques like cross-validation. The bootstrap is another means of determining model uncertainty, especially for models where confidence intervals are impractical to calculate. We have not discussed the bootstrap technique in this article. We will discuss it when we see Random Forest and Gradient Boosting algorithms. These can all be shaped by helping select which variables are included in a model and which are excluded.

Logistic Regression Using R— 10 — will be published soon

Do share your thoughts and support by commenting and sharing the article among your peer groups.

--

--

Vivekanandan Srinivasan
Analytics Vidhya

An analytics professional with over six years of experience spanning across predictive modelling, statistical analysis and big data technologies.