Model Diagnostics and Selection — 9
If you have not read part 8 of the R data analysis series kindly go through the following article where we discussed Multiple Linear Regression — 8.
The contents in the article are gist from a couple of books that I got introduced during my IIM-B days.
R for Everyone — Jared P. Lander
Practical Data Science with R — Nina Zumel & John Mount
All the code blocks discussed in the article are present in the form of R markdown in the Github link.
To see all the articles written by me kindly use the link, Vivek Srinivasan.
Building a model can be a never-ending process in which we constantly improve the model by adding interactions, taking away variables, doing transformations and so on. However, at some point, we need to confirm that we have the best model at the time or even a good model. That leads to the question: How do we judge the quality
of a model?
In almost all cases the answer has to be: in relation to other models. This could be an analysis of residuals
, the results of an ANOVA
test, a Wald
test, drop-in deviance
, the AIC
or BIC
score, cross-validation
error or bootstrapping
.
Residuals
One of the first-taught ways of assessing model quality is an analysis of the residuals
, which is the difference between the actual response and the fitted values, values predicted by the model. This is a direct result of the formulation discussed in our previous article where the errors, akin to residuals, are normally
distributed. The basic idea is that if the model is appropriately fitted to the data, the residuals should be normally distributed as well. To see this, we start with the housing data to which we fit regression and visualize with a coefficient plot.
housing <- read.table("../data/housing.csv",sep = ",", header = TRUE,stringsAsFactors = FALSE)
names(housing) <- c("Neighborhood", "Class", "Units", "YearBuilt","SqFt", "Income", "IncomePerSqFt", "Expense",
"ExpensePerSqFt", "NetIncome", "Value","ValuePerSqFt", "Boro")
housing <- housing[housing$Units < 1000, ]
house1 <- lm(ValuePerSqFt ~ Units + SqFt + Boro, data=housing)
summary(house1)
A quick way to grab the coefficients from a model is to either use the coef
function or get them from the model using the $
operator on the model object as we discussed in our previous article.
library(coefplot)
coefplot(house1)
For linear regression, three important residual plots are fitted values against residuals
, Q-Q plots
and the histogram of residuals.
The first is easy enough with ggplot2. Fortunately, ggplot2 has a handy trick for dealing with lm
models. We can use the model as the data source and ggplot2 “fortifies” it, creating new columns, for easy plotting.
library(ggplot2)
# see what a fortified lm model looks like
head(fortify(house1))
The plot of residuals
versus fitted values
as shown below at first glance disconcerting, because the pattern in the residuals shows that they are not as randomly dispersed as desired.
ggplot(aes(x=.fitted, y=.resid), data = house1) +
geom_point() +
geom_hline(yintercept = 0) +
geom_smooth(se = FALSE) +
labs(x="Fitted Values", y="Residuals")
However, further investigation reveals that this is due to the structure that Boro
gives the pattern to the data and it will be fine.
ggplot(aes(x=.fitted, y=.resid), data = house1) +
geom_point() +
geom_hline(yintercept = 0) +
geom_smooth(se = FALSE) +
labs(x="Fitted Values", y="Residuals") +
geom_point(aes(color=Boro))
Next up is the Q-Q plot
. If the model is a good fit, the standardized residuals should all fall along a straight line when plotted against the theoretical quantiles of the normal distribution.
ggplot(house1, aes(sample=.stdresid)) + stat_qq() + geom_abline()
Q-Q plot
for house1. The tails drift away from the ideal theoretical line, indicating that we do not have the best fit.
Another diagnostic is a histogram
of the residuals
. The histogram is not normally distributed, meaning that our model is not an entirely correct specification.
ggplot(house1, aes(x=.resid)) + geom_histogram()
Model Selection
All of this measuring of model fit only really makes sense when comparing multiple models, because all of these measures are relative. So we will fit a number of models in order to compare them to each other.
house2 <- lm(ValuePerSqFt ~ Units * SqFt + Boro, data=housing)
house3 <- lm(ValuePerSqFt ~ Units + SqFt * Boro + Class,data=housing)
house4 <- lm(ValuePerSqFt ~ Units + SqFt * Boro + SqFt*Class,data=housing)
house5 <- lm(ValuePerSqFt ~ Boro + Class, data=housing)
As usual, our first step is to visualize the models together using multiplot
from the coefplot
package. The result shows that Boro
is the only variable with a significant effect on ValuePerSqFt
as do certain condominium
types.
multiplot(house1, house2, house3, house4, house5, pointSize=2)
ANOVA For Model Selection
While we do not promote using ANOVA
for a multisample test, we do believe it serves a useful purpose in testing the relative merits of different models. Simply passing multiple model objects to anova
will return a table of results including the residual sum of squares (RSS)
, which is a measure of error, the lower the better.
anova(house1, house2, house3, house4, house5)
This shows that the fourth model, house4
, has the lowest RSS
, meaning it is the best model of the bunch. The problem with RSS
is that it always improves when an additional variable is added to the model. This can lead to excessive model complexity and overfitting.
AIC And BIC
Another metric, which penalizes model complexity, is the Akaike Information Criterion (AIC)
. As with RSS
, the model with the lowest AIC
— even negative values — is considered optimal. The BIC (Bayesian Information Criterion)
is a similar measure where, once again, lower is better. The formula for AIC is
where ln(L)
is the maximized log-likelihood and p
is the number of coefficients in the model. As the model improves the log-likelihood gets bigger, and because that term is negated the AIC
gets lower. However, adding coefficients increases the AIC; this penalizes model complexity. The formula for BIC
is similar, except that instead of multiplying the number of coefficients by 2 it multiplies it by the natural log of the number of rows. This is seen in
The AIC
and BIC
for our models are calculated using the AIC
and BIC
functions, respectively.
AIC(house1, house2, house3, house4, house5)
BIC(house1, house2, house3, house4, house5)
Again it is pretty much visible from the above table, model house4
is better than other models since it has the least AIC
and BIC
score.
Cross-Validation
Residual diagnostics and model tests such as ANOVA
and AIC
are a bit old fashioned and came along before modern computing horsepower. The preferred method to assess model quality — at least by most data scientists — is cross-validation, sometimes called k-fold cross-validation
. The data is broken into k (usually five or ten) non-overlapping sections. Then a model is fitted on k − 1
section of the data, which is then used to make predictions based on the kth
section.
This is repeated k
times until every section has been held out for testing once and included in model fitting k − 1
times. Cross-validation provides a measure of the predictive accuracy of a model, which is largely considered a good means of assessing model quality.
There are a number of packages and functions that assist in performing cross-validation
. Each has its own limitations or quirks, so rather than going through a number of incomplete functions, we show one that works well for generalized linear models
(including linear regression), and then build a generic framework that can be used generally for an arbitrary model type.
The boot package by Brian Ripley has cv.glm
for performing cross-validation
on. As the name implies, it works only for generalized linear models, which will suffice for a number of situations.
library(boot)
#refit house1 using glm instead of lm
houseG1 <- glm(ValuePerSqFt ~ Units + SqFt + Boro,data=housing, family=gaussian(link="identity"))# ensure it gives the same results as lm
identical(coef(house1), coef(houseG1))
The results from cv.glm
include delta, which has two numbers, the raw cross-validation error
based on the cost function (in this case the mean squared error, which is a measure of correctness for an estimator) for all the folds and the adjusted cross-validation error
. This second number compensates for not using leave-one-out cross-validation, which is like k-fold cross-validation
except that each fold is the all but one data point with one point held out. This is very accurate but highly computationally intensive.
houseCV1 <- cv.glm(housing, houseG1, K=5)
houseCV1$delta
While we got a nice number for the error, it helps us only if we can compare it to other models, so we run the same process for the other models we built, rebuilding them with glm
first.
# Building multiple models
houseG2 <- glm(ValuePerSqFt ~ Units * SqFt + Boro, data=housing)
houseG3 <- glm(ValuePerSqFt ~ Units + SqFt * Boro + Class,data=housing)
houseG4 <- glm(ValuePerSqFt ~ Units + SqFt * Boro + SqFt*Class,data=housing)
houseG5 <- glm(ValuePerSqFt ~ Boro + Class, data=housing)# run cross-validation
houseCV2 <- cv.glm(housing, houseG2, K=5)
houseCV3 <- cv.glm(housing, houseG3, K=5)
houseCV4 <- cv.glm(housing, houseG4, K=5)
houseCV5 <- cv.glm(housing, houseG5, K=5)## Creating a dataframe of model Error values and adding model name
cvResults <- as.data.frame(rbind(houseCV1$delta, houseCV2$delta, houseCV3$delta, houseCV4$delta,houseCV5$delta))
names(cvResults) <- c("Error", "Adjusted.Error")
cvResults$Model <- sprintf("houseG%s", 1:5)cvResults
Once again, the fourth model, houseG4
, is the superior model. Let us visualize how much ANOVA
, AIC
and cross-validation
agree on the relative merits of the different models. The scales are all different but the shapes of the plots are identical.
# measure with ANOVA
cvANOVA <-anova(houseG1, houseG2, houseG3, houseG4, houseG5)
cvResults$ANOVA <- cvANOVA$`Resid. Dev`# measure with AIC
cvResults$AIC <- AIC(houseG1, houseG2, houseG3, houseG4, houseG5)$AIC# make the data.frame suitable for plotting
library(reshape2)
cvMelt <- melt(cvResults, id.vars="Model", variable.name="Measure",value.name="Value")# Plotting Results
ggplot(cvMelt, aes(x=Model, y=Value)) +
geom_line(aes(group=Measure, color=Measure)) +
facet_wrap(~Measure, scales="free_y") +
theme(axis.text.x=element_text(angle=90, vjust=.5)) +
guides(color=FALSE)
Plots for cross-validation error
(raw and adjusted), ANOVA
and AIC
for housing models. The scales are different, as they should be, but the shapes are identical, indicating that houseG4
truly is the best model.
Determining the quality of a model is an important step in the model building process. This can take the form of traditional tests of fit such as ANOVA
or more modern techniques like cross-validation
. The bootstrap
is another means of determining model uncertainty, especially for models where confidence intervals are impractical to calculate. We have not discussed the bootstrap technique in this article. We will discuss it when we see Random Forest
and Gradient Boosting
algorithms. These can all be shaped by helping select which variables are included in a model and which are excluded.
Logistic Regression Using R— 10 — will be published soon
Do share your thoughts and support by commenting and sharing the article among your peer groups.