# Model Diagnostics and Selection — 9

If you have not read part 8 of the R data analysis series kindly go through the following article where we discussedMultiple Linear Regression — 8.

The contents in the article are gist from a couple of books that I got introduced during my IIM-B days.

R for Everyone — Jared P. Lander

Practical Data Science with R — Nina Zumel & John Mount

All the code blocks discussed in the article are present in the form of R markdown in theGithub link.

To see all the articles written by me kindly use the link,Vivek Srinivasan.

Building a model can be a never-ending process in which we constantly improve the model by adding interactions, taking away variables, doing transformations and so on. However, at some point, we need to confirm that we have the best model at the time or even a good model. That leads to the question: How do we judge the `quality `

of a model?

In almost all cases the answer has to be: in relation to other models. This could be an analysis of `residuals`

, the results of an `ANOVA `

test, a `Wald `

test, `drop-in deviance`

, the `AIC `

or `BIC `

score, `cross-validation`

error or `bootstrapping`

.

## Residuals

One of the first-taught ways of assessing model quality is an analysis of the `residuals`

, which is the difference between the actual response and the fitted values, values predicted by the model. This is a direct result of the formulation discussed in our previous article where the errors, akin to residuals, are `normally `

distributed. The basic idea is that if the model is appropriately fitted to the data, the residuals should be normally distributed as well. To see this, we start with the housing data to which we fit regression and visualize with a coefficient plot.

`housing <- read.table("../data/housing.csv",sep = ",", header = TRUE,stringsAsFactors = FALSE)`

names(housing) <- c("Neighborhood", "Class", "Units", "YearBuilt","SqFt", "Income", "IncomePerSqFt", "Expense",

"ExpensePerSqFt", "NetIncome", "Value","ValuePerSqFt", "Boro")

housing <- housing[housing$Units < 1000, ]

house1 <- lm(ValuePerSqFt ~ Units + SqFt + Boro, data=housing)

summary(house1)

A quick way to grab the coefficients from a model is to either use the `coef `

function or get them from the model using the `$`

operator on the model object as we discussed in our previous article.

`library(coefplot)`

coefplot(house1)

For linear regression, three important residual plots are `fitted values against residuals`

, `Q-Q plots`

and the `histogram of residuals.`

The first is easy enough with ggplot2. Fortunately, ggplot2 has a handy trick for dealing with `lm `

models. We can use the model as the data source and ggplot2 “fortifies” it, creating new columns, for easy plotting.

`library(ggplot2)`

# see what a fortified lm model looks like

head(fortify(house1))

The plot of `residuals `

versus `fitted values`

as shown below at first glance disconcerting, because the pattern in the residuals shows that they are not as randomly dispersed as desired.

`ggplot(aes(x=.fitted, y=.resid), data = house1) +`

geom_point() +

geom_hline(yintercept = 0) +

geom_smooth(se = FALSE) +

labs(x="Fitted Values", y="Residuals")

However, further investigation reveals that this is due to the structure that `Boro `

gives the pattern to the data and it will be fine.

`ggplot(aes(x=.fitted, y=.resid), data = house1) +`

geom_point() +

geom_hline(yintercept = 0) +

geom_smooth(se = FALSE) +

labs(x="Fitted Values", y="Residuals") +

geom_point(aes(color=Boro))

Next up is the `Q-Q plot`

. If the model is a good fit, the standardized residuals should all fall along a straight line when plotted against the theoretical quantiles of the normal distribution.

`ggplot(house1, aes(sample=.stdresid)) + stat_qq() + geom_abline()`

`Q-Q plot`

for house1. The tails drift away from the ideal theoretical line, indicating that we do not have the best fit.

Another diagnostic is a `histogram `

of the `residuals`

. The histogram is not normally distributed, meaning that our model is not an entirely correct specification.

`ggplot(house1, aes(x=.resid)) + geom_histogram()`

## Model Selection

All of this measuring of model fit only really makes sense when comparing multiple models, because all of these measures are relative. So we will fit a number of models in order to compare them to each other.

`house2 <- lm(ValuePerSqFt ~ Units * SqFt + Boro, data=housing)`

house3 <- lm(ValuePerSqFt ~ Units + SqFt * Boro + Class,data=housing)

house4 <- lm(ValuePerSqFt ~ Units + SqFt * Boro + SqFt*Class,data=housing)

house5 <- lm(ValuePerSqFt ~ Boro + Class, data=housing)

As usual, our first step is to visualize the models together using `multiplot `

from the `coefplot `

package. The result shows that `Boro `

is the only variable with a significant effect on `ValuePerSqFt `

as do certain `condominium `

types.

`multiplot(house1, house2, house3, house4, house5, pointSize=2)`

## ANOVA For Model Selection

While we do not promote using `ANOVA `

for a multisample test, we do believe it serves a useful purpose in testing the relative merits of different models. Simply passing multiple model objects to `anova `

will return a table of results including the residual sum of squares `(RSS)`

, which is a measure of error, the lower the better.

`anova(house1, house2, house3, house4, house5)`

This shows that the fourth model, `house4`

, has the lowest `RSS`

, meaning it is the best model of the bunch. The problem with `RSS `

is that it always improves when an additional variable is added to the model. This can lead to excessive model complexity and overfitting.

## AIC And BIC

Another metric, which penalizes model complexity, is the `Akaike Information Criterion (AIC)`

. As with `RSS`

, the model with the lowest `AIC `

— even negative values — is considered optimal. The` BIC (Bayesian Information Criterion)`

is a similar measure where, once again, lower is better. The formula for AIC is

where `ln(L)`

is the maximized log-likelihood and `p`

is the number of coefficients in the model. As the model improves the log-likelihood gets bigger, and because that term is negated the `AIC`

gets lower. However, adding coefficients increases the AIC; this penalizes model complexity. The formula for `BIC `

is similar, except that instead of multiplying the number of coefficients by 2 it multiplies it by the natural log of the number of rows. This is seen in

The `AIC `

and `BIC `

for our models are calculated using the `AIC `

and `BIC `

functions, respectively.

`AIC(house1, house2, house3, house4, house5)`

BIC(house1, house2, house3, house4, house5)

Again it is pretty much visible from the above table, model `house4 `

is better than other models since it has the least `AIC `

and `BIC `

score.

## Cross-Validation

Residual diagnostics and model tests such as `ANOVA `

and `AIC `

are a bit old fashioned and came along before modern computing horsepower. The preferred method to assess model quality — at least by most data scientists — is cross-validation, sometimes called` k-fold cross-validation`

. The data is broken into k (usually five or ten) non-overlapping sections. Then a model is fitted on` k − 1`

section of the data, which is then used to make predictions based on the `kth`

section.

This is repeated `k`

times until every section has been held out for testing once and included in model fitting` k − 1`

times. Cross-validation provides a measure of the predictive accuracy of a model, which is largely considered a good means of assessing model quality.

There are a number of packages and functions that assist in performing `cross-validation`

. Each has its own limitations or quirks, so rather than going through a number of incomplete functions, we show one that works well for `generalized linear models`

(including linear regression), and then build a generic framework that can be used generally for an arbitrary model type.

The boot package by Brian Ripley has `cv.glm`

for performing `cross-validation `

on. As the name implies, it works only for generalized linear models, which will suffice for a number of situations.

library(boot)

#refit house1 using glm instead of lm

houseG1 <- glm(ValuePerSqFt ~ Units + SqFt + Boro,data=housing, family=gaussian(link="identity"))# ensure it gives the same results as lm

identical(coef(house1), coef(houseG1))

The results from `cv.glm`

include delta, which has two numbers, the raw `cross-validation error`

based on the cost function (in this case the mean squared error, which is a measure of correctness for an estimator) for all the folds and the `adjusted cross-validation error`

. This second number compensates for not using leave-one-out cross-validation, which is like `k-fold cross-validation`

except that each fold is the all but one data point with one point held out. This is very accurate but highly computationally intensive.

`houseCV1 <- cv.glm(housing, houseG1, K=5)`

houseCV1$delta

While we got a nice number for the error, it helps us only if we can compare it to other models, so we run the same process for the other models we built, rebuilding them with `glm`

first.

# Building multiple models

houseG2 <- glm(ValuePerSqFt ~ Units * SqFt + Boro, data=housing)

houseG3 <- glm(ValuePerSqFt ~ Units + SqFt * Boro + Class,data=housing)

houseG4 <- glm(ValuePerSqFt ~ Units + SqFt * Boro + SqFt*Class,data=housing)

houseG5 <- glm(ValuePerSqFt ~ Boro + Class, data=housing)# run cross-validation

houseCV2 <- cv.glm(housing, houseG2, K=5)

houseCV3 <- cv.glm(housing, houseG3, K=5)

houseCV4 <- cv.glm(housing, houseG4, K=5)

houseCV5 <- cv.glm(housing, houseG5, K=5)## Creating a dataframe of model Error values and adding model name

cvResults <- as.data.frame(rbind(houseCV1$delta, houseCV2$delta, houseCV3$delta, houseCV4$delta,houseCV5$delta))

names(cvResults) <- c("Error", "Adjusted.Error")

cvResults$Model <- sprintf("houseG%s", 1:5)cvResults

Once again, the fourth model, `houseG4`

, is the superior model. Let us visualize how much `ANOVA`

, `AIC `

and `cross-validation`

agree on the relative merits of the different models. The scales are all different but the shapes of the plots are identical.

# measure with ANOVA

cvANOVA <-anova(houseG1, houseG2, houseG3, houseG4, houseG5)

cvResults$ANOVA <- cvANOVA$`Resid. Dev`# measure with AIC

cvResults$AIC <- AIC(houseG1, houseG2, houseG3, houseG4, houseG5)$AIC# make the data.frame suitable for plotting

library(reshape2)

cvMelt <- melt(cvResults, id.vars="Model", variable.name="Measure",value.name="Value")# Plotting Results

ggplot(cvMelt, aes(x=Model, y=Value)) +

geom_line(aes(group=Measure, color=Measure)) +

facet_wrap(~Measure, scales="free_y") +

theme(axis.text.x=element_text(angle=90, vjust=.5)) +

guides(color=FALSE)

Plots for `cross-validation error`

(raw and adjusted), `ANOVA `

and `AIC `

for housing models. The scales are different, as they should be, but the shapes are identical, indicating that `houseG4 `

truly is the best model.

Determining the quality of a model is an important step in the model building process. This can take the form of traditional tests of fit such as `ANOVA `

or more modern techniques like `cross-validation`

. The `bootstrap `

is another means of determining model uncertainty, especially for models where confidence intervals are impractical to calculate. We have not discussed the bootstrap technique in this article. We will discuss it when we see `Random Forest`

and `Gradient Boosting`

algorithms. These can all be shaped by helping select which variables are included in a model and which are excluded.

Logistic Regression Using R— 10 — will be published soon

Do share your thoughts and support by commenting and sharing the article among your peer groups.