If you have not read part 8 of the R data analysis series kindly go through the following article where we discussed Multiple Linear Regression — 8.
The contents in the article are gist from a couple of books that I got introduced during my IIM-B days.
R for Everyone — Jared P. Lander
Practical Data Science with R — Nina Zumel & John Mount
All the code blocks discussed in the article are present in the form of R markdown in the Github link.
To see all the articles written by me kindly use the link, Vivek Srinivasan.
Building a model can be a never-ending process in which we constantly improve the model by adding interactions, taking away variables, doing transformations and so on. However, at some point, we need to confirm that we have the best model at the time or even a good model. That leads to the question: How do we judge the
quality of a model?
In almost all cases the answer has to be: in relation to other models. This could be an analysis of
residuals, the results of an
ANOVA test, a
drop-in deviance, the
cross-validation error or
One of the first-taught ways of assessing model quality is an analysis of the
residuals, which is the difference between the actual response and the fitted values, values predicted by the model. This is a direct result of the formulation discussed in our previous article where the errors, akin to residuals, are
normally distributed. The basic idea is that if the model is appropriately fitted to the data, the residuals should be normally distributed as well. To see this, we start with the housing data to which we fit regression and visualize with a coefficient plot.
housing <- read.table("../data/housing.csv",sep = ",", header = TRUE,stringsAsFactors = FALSE)
names(housing) <- c("Neighborhood", "Class", "Units", "YearBuilt","SqFt", "Income", "IncomePerSqFt", "Expense",
"ExpensePerSqFt", "NetIncome", "Value","ValuePerSqFt", "Boro")
housing <- housing[housing$Units < 1000, ]
house1 <- lm(ValuePerSqFt ~ Units + SqFt + Boro, data=housing)
A quick way to grab the coefficients from a model is to either use the
coef function or get them from the model using the
$ operator on the model object as we discussed in our previous article.
For linear regression, three important residual plots are
fitted values against residuals,
Q-Q plots and the
histogram of residuals. The first is easy enough with ggplot2. Fortunately, ggplot2 has a handy trick for dealing with
lm models. We can use the model as the data source and ggplot2 “fortifies” it, creating new columns, for easy plotting.
# see what a fortified lm model looks like
The plot of
fitted values as shown below at first glance disconcerting, because the pattern in the residuals shows that they are not as randomly dispersed as desired.
ggplot(aes(x=.fitted, y=.resid), data = house1) +
geom_hline(yintercept = 0) +
geom_smooth(se = FALSE) +
labs(x="Fitted Values", y="Residuals")
However, further investigation reveals that this is due to the structure that
Boro gives the pattern to the data and it will be fine.
ggplot(aes(x=.fitted, y=.resid), data = house1) +
geom_hline(yintercept = 0) +
geom_smooth(se = FALSE) +
labs(x="Fitted Values", y="Residuals") +
Next up is the
Q-Q plot. If the model is a good fit, the standardized residuals should all fall along a straight line when plotted against the theoretical quantiles of the normal distribution.
ggplot(house1, aes(sample=.stdresid)) + stat_qq() + geom_abline()
Q-Q plot for house1. The tails drift away from the ideal theoretical line, indicating that we do not have the best fit.
Another diagnostic is a
histogram of the
residuals. The histogram is not normally distributed, meaning that our model is not an entirely correct specification.
ggplot(house1, aes(x=.resid)) + geom_histogram()
All of this measuring of model fit only really makes sense when comparing multiple models, because all of these measures are relative. So we will fit a number of models in order to compare them to each other.
house2 <- lm(ValuePerSqFt ~ Units * SqFt + Boro, data=housing)
house3 <- lm(ValuePerSqFt ~ Units + SqFt * Boro + Class,data=housing)
house4 <- lm(ValuePerSqFt ~ Units + SqFt * Boro + SqFt*Class,data=housing)
house5 <- lm(ValuePerSqFt ~ Boro + Class, data=housing)
As usual, our first step is to visualize the models together using
multiplot from the
coefplot package. The result shows that
Boro is the only variable with a significant effect on
ValuePerSqFt as do certain
multiplot(house1, house2, house3, house4, house5, pointSize=2)
ANOVA For Model Selection
While we do not promote using
ANOVA for a multisample test, we do believe it serves a useful purpose in testing the relative merits of different models. Simply passing multiple model objects to
anova will return a table of results including the residual sum of squares
(RSS), which is a measure of error, the lower the better.
anova(house1, house2, house3, house4, house5)
This shows that the fourth model,
house4, has the lowest
RSS, meaning it is the best model of the bunch. The problem with
RSS is that it always improves when an additional variable is added to the model. This can lead to excessive model complexity and overfitting.
AIC And BIC
Another metric, which penalizes model complexity, is the
Akaike Information Criterion (AIC). As with
RSS, the model with the lowest
AIC — even negative values — is considered optimal. The
BIC (Bayesian Information Criterion) is a similar measure where, once again, lower is better. The formula for AIC is
ln(L) is the maximized log-likelihood and
p is the number of coefficients in the model. As the model improves the log-likelihood gets bigger, and because that term is negated the
AIC gets lower. However, adding coefficients increases the AIC; this penalizes model complexity. The formula for
BIC is similar, except that instead of multiplying the number of coefficients by 2 it multiplies it by the natural log of the number of rows. This is seen in
BIC for our models are calculated using the
BIC functions, respectively.
AIC(house1, house2, house3, house4, house5)
BIC(house1, house2, house3, house4, house5)
Again it is pretty much visible from the above table, model
house4 is better than other models since it has the least
Residual diagnostics and model tests such as
AIC are a bit old fashioned and came along before modern computing horsepower. The preferred method to assess model quality — at least by most data scientists — is cross-validation, sometimes called
k-fold cross-validation. The data is broken into k (usually five or ten) non-overlapping sections. Then a model is fitted on
k − 1 section of the data, which is then used to make predictions based on the
This is repeated
k times until every section has been held out for testing once and included in model fitting
k − 1 times. Cross-validation provides a measure of the predictive accuracy of a model, which is largely considered a good means of assessing model quality.
There are a number of packages and functions that assist in performing
cross-validation. Each has its own limitations or quirks, so rather than going through a number of incomplete functions, we show one that works well for
generalized linear models (including linear regression), and then build a generic framework that can be used generally for an arbitrary model type.
The boot package by Brian Ripley has
cv.glm for performing
cross-validation on. As the name implies, it works only for generalized linear models, which will suffice for a number of situations.
#refit house1 using glm instead of lm
houseG1 <- glm(ValuePerSqFt ~ Units + SqFt + Boro,data=housing, family=gaussian(link="identity"))# ensure it gives the same results as lm
The results from
cv.glm include delta, which has two numbers, the raw
cross-validation error based on the cost function (in this case the mean squared error, which is a measure of correctness for an estimator) for all the folds and the
adjusted cross-validation error. This second number compensates for not using leave-one-out cross-validation, which is like
k-fold cross-validation except that each fold is the all but one data point with one point held out. This is very accurate but highly computationally intensive.
houseCV1 <- cv.glm(housing, houseG1, K=5)
While we got a nice number for the error, it helps us only if we can compare it to other models, so we run the same process for the other models we built, rebuilding them with
# Building multiple models
houseG2 <- glm(ValuePerSqFt ~ Units * SqFt + Boro, data=housing)
houseG3 <- glm(ValuePerSqFt ~ Units + SqFt * Boro + Class,data=housing)
houseG4 <- glm(ValuePerSqFt ~ Units + SqFt * Boro + SqFt*Class,data=housing)
houseG5 <- glm(ValuePerSqFt ~ Boro + Class, data=housing)# run cross-validation
houseCV2 <- cv.glm(housing, houseG2, K=5)
houseCV3 <- cv.glm(housing, houseG3, K=5)
houseCV4 <- cv.glm(housing, houseG4, K=5)
houseCV5 <- cv.glm(housing, houseG5, K=5)## Creating a dataframe of model Error values and adding model name
cvResults <- as.data.frame(rbind(houseCV1$delta, houseCV2$delta, houseCV3$delta, houseCV4$delta,houseCV5$delta))
names(cvResults) <- c("Error", "Adjusted.Error")
cvResults$Model <- sprintf("houseG%s", 1:5)cvResults
Once again, the fourth model,
houseG4, is the superior model. Let us visualize how much
cross-validation agree on the relative merits of the different models. The scales are all different but the shapes of the plots are identical.
# measure with ANOVA
cvANOVA <-anova(houseG1, houseG2, houseG3, houseG4, houseG5)
cvResults$ANOVA <- cvANOVA$`Resid. Dev`# measure with AIC
cvResults$AIC <- AIC(houseG1, houseG2, houseG3, houseG4, houseG5)$AIC# make the data.frame suitable for plotting
cvMelt <- melt(cvResults, id.vars="Model", variable.name="Measure",value.name="Value")# Plotting Results
ggplot(cvMelt, aes(x=Model, y=Value)) +
geom_line(aes(group=Measure, color=Measure)) +
facet_wrap(~Measure, scales="free_y") +
theme(axis.text.x=element_text(angle=90, vjust=.5)) +
cross-validation error (raw and adjusted),
AIC for housing models. The scales are different, as they should be, but the shapes are identical, indicating that
houseG4 truly is the best model.
Determining the quality of a model is an important step in the model building process. This can take the form of traditional tests of fit such as
ANOVA or more modern techniques like
bootstrap is another means of determining model uncertainty, especially for models where confidence intervals are impractical to calculate. We have not discussed the bootstrap technique in this article. We will discuss it when we see
Random Forest and
Gradient Boosting algorithms. These can all be shaped by helping select which variables are included in a model and which are excluded.
Logistic Regression Using R— 10 — will be published soon
Do share your thoughts and support by commenting and sharing the article among your peer groups.