Model selection 101, using R

Quick and dirty markup of simple model selection using R


What‘re we doing?

Since this is a very introductory look at model selection we assume the data you’ve acquired has already been cleaned, scrubbed and ready to go. Data cleaning is a whole subject in and of itself and is actually the primary time-sink of any Data Scientist. Go to the end of this article if you want to download the data for yourself and follow along!

Hit that ‘clap’ if you enjoy this kind of content and want more (this isn’t monetized, I just want to gauge interest!)

Edit: I’ve made a “sequel” to this article about visualizing and plotting the model we find if you want to check that out after reading this one!:


Lets look at the pipeline:

This is the skeleton I use for creating a simple LM or GLM:

  1. Create a base-model using all available variables and data
  2. Factorize categorical variables if R didn’t do the job
  3. Add relevant power-transformations
  4. Add relevant variable interaction
  5. Remove insignificant variables with relevant testing criteria
    - Repeat step 3–5 until you’ve exhausted your options
  6. Remove any outlying datapoints
  7. Evaluate your model
  8. Visualizing your findings

1. Creating a base-model

Lets start by setting up a workspace and loading our data. In this example we’re working on a dataset describing employment-status of women based on whether or not you’re a foreigner, the amount of government-entitled support (log-transformed), age, years of education and the number of children (spread in two categorical variables ‘young.children’ and ‘school.children’):

rm(list=ls()) # “Clear current R environment”
setwd(“C:/R/Workspace”) # Setting up your workspace
dat <- read.table(“employment_data.txt”, header=T) # Load that data
str(dat)
summary(dat)

which outputs:

The first thing we notice is that our response-variable is binomial (obviously) suggesting that we have a binomial distribution which means we’ll have to fit a GLM instead of a traditional LM:

fit.1 <- glm(employed == "yes" ~ ., data = dat, family = binomial)
summary(fit.1)

This fit is the most general fit we can use by default, it fits a binomial model (family = binomial) with respect to the response-variable “employed” having a value “yes”, using every variable in the dataset (~ .) giving us the following output:

Right, so a few problems right off the bat, we don’t like seeing p-values above 0.05, much less above 0.1, but before we recklessly remove them lets check for variable interactions and power-transformations first!


2. Factorizing categorical variables

Lets consider the possibility that there’s a categorical difference between not having any children and actually having any number of children larger than zero, thus we add the categorical variables for having 0 children: ‘factor(young.children == 0)’, ‘factor(school.children == 0)’ and a combined factor for not having any children at all ‘factor(young.children + school.children == 0)’

We can update our fit with the new variables:

tempfit <- update(fit.1, .~. + factor(young.children == 0) 
+ factor(school.children == 0)
+ factor(young.children +
school.children == 0))
summary(tempfit)

So we’ve already improved our model a bit in terms of AIC from 1066.8 to 1050.2! Lets take a look at our continuous variables and watch for possible power-transformations:


3. Adding relevant power-transformations

A neat way to look for potential power-transformations of a binomial distribution is using this custom function:

logit.plot <- function(x, y, h, link=’logit’, xlab=deparse(substitute(x)), yname=deparse(substitute(y)), ylab, rug=T, data, …){
if(!missing(data)){
call <- match.call()
dataPos <- match(“data”,names(call))
return(invisible(with(data, eval(call[-dataPos]))))
}
if (length(levels(factor(y)))!=2) stop(‘y must be binary’)
succes <- levels(factor(y))[2]
if (missing(ylab)) ylab <- paste(binomial(link)$link,’ P{‘,yname,’=’,succes,’|’,xlab,’}’, sep=’’, collapse=’’)
if (is.factor(y)) y <- as.numeric(y==levels(y)[2])
x.seq <- seq(min(x),max(x),length=101)
smooth <- binomial(link)$linkfun(ksmooth(x, y, ’normal’, h, x.points=x.seq)$y)
plot(smooth~x.seq, ylab=ylab, xlab=xlab, type=’l’,…)
if (rug) rug(x)
invisible(xy.coords(x = x.seq, y = smooth, xlab = xlab, ylab = ylab))
}

It’s actually decently simple, it plots the link(E[y|x]) against x with E[y|x] estimated using a Nadaraya-Watson kernel regression estimate:

This take the following arguments:
x — Your explanatory variables
y — The binary outcome variable
h — Bandwidth for the Nadaraya-Watson kernel regression estimate
data — Self explanatory
… — Whatever additional arguments you want to pass to plot()

Using this function iterating over different bandwidths we get the following kinds of plot:

for(i in seq(2.5,10,length.out = 6))
logit.plot(x = age, y = employed == ‘yes’, h = i, data = dat, main = paste0(“bandwidth = “,i))

In this example with ‘age’ we can see that the function begins smoothing around a bandwidth around 7, the graph could approximate a 2. or maybe 3. degree polynomial, the same is true for ‘education’ and ‘gov.support’ but for simplicity we’ll consider the case of all three taking shape as 2. degree polynomials:

tempfit <- update(tempfit, .~. + I(age^2) 
+ I(education^2)
+ I(gov.support^2))
summary(tempfit)

Quite the improvement in terms of AIC, from 1050.2 to 1017.7! Lots of insignificant variables though!

This is our first attempt at a “full” model, so lets define this as our ‘fit.2’ and continue.

fit.2 <- tempfit

4. Adding variable interaction

The easiest way to check for variable interaction is using the R-function ‘add1’, this is simply the case of defining a scope to test and which test to use when testing relative to the original model. F-tests are usually only relevant for LM and AOV models so we can ‘safely’ ignore that testing criteria, we’ll instead be using a χ²-test (Chisq or Chi²):

add1(fit.2, scope = .~. + .^2, test=”Chisq”)

This scope is simply asking to test the current model (.~.) plus interaction between existing variables (+ .^2), this will output a lot of interactions, some with statistically significant P-values, but it can be annoying to manually sort through, so lets sort the list so we get the lowest P-values on the top:

add1.test <- add1(fit.2, scope = .~. + .^2, test=”Chisq”)
add1.test[order(add1.test$’Pr(>Chi)’),]

Right, so it seems like there might be an interaction between the foreigner and age variables. One thing to consider before simply adding the interaction with the lowest P-value is whether or not this makes sense in context of our current model, right now age² is actually the most significant variable in our model so we might argue that adding the interaction between foreigner and age² is more intuitive, for simplicity we’ll stick with the foreigner:age interaction.

Lets test for more interactions after adding the variable interaction foreigner:age:

add1.test <- add1(update(fit.2, .~. + foreigner:age), scope = .~. + .^2, test=”Chisq”)
add1.test[order(add1.test$’Pr(>Chi)’),]

Now it seems like there’s a significant interaction in foreigner:factor(young.children + school.children == 0)

After a few rounds of this we end up seeing no new statistically significant interactions, by the end we’ve added the following interactions: 
+ foreigner:age 
+ foreigner:factor(young.children + school.children == 0) 
+ age:school.children 
+ gov.support:factor(young.children == 0)

So lets update our fit with the new variable interactions as follows:

fit.3 <- update(fit.2, .~. 
+ foreigner:age
+ foreigner:factor(young.children + school.children == 0)
+ age:school.children
+ gov.support:factor(young.children == 0))
summary(fit.3)

A small improvement in AIC once more, from 1017.7 to 1006.9.


5. Removing insignificant variables

This process is quite similar to the last one in step 4. We’ll simply be using the drop1 function in R now instead of add1, and due to us seeking to remove instead of appending variables we seek the highest P-value instead of the lowest (we’ll still use χ²-test as our criteria):

drop1.test <- drop1(fit.3, test=”Chisq”)
drop1.test[rev(order(drop1.test$’Pr(>Chi)’)),]

This tells us mostly the same as our model-summary, gov.support² definitely dosn’t seem to be statistically significant, so we’ll remove that first and so on and so forth, we end up removing the following variables:
- gov.support²
- young.children
- education
- education²

After those have been removed we see that all the remaining variables are statistically significant, so lets update our fit by removing the variables listed above:

fit.4 <- update(fit.3, .~. 
— I(gov.support^2)
— young.children
— education
— I(education^2))
summary(fit.4)

Nice, another improvement in AIC although marginal and insignificant, the main advantage of this model over our previous is the added simplicity inherent in the reduced number of explanatory variables!

One might wonder “Why aren’t we removing the gov.support variable? It’s clearly insignificant when looking at the summary of our model!” this is due to the principle of marginality which prohibits us from removing an insignificant variable if said variable has a significant interaction with another, like gov.support:factor(young.children == 0).

You might argue that removing gov.support would be beneficial to the simplicity of the model given that it’s clearly insignificant and that the interaction with the young.children == 0 variable is only marginally significant (p = 0.435), however upon further inspection when removing gov.support from the model the variable-interaction splits into two variables for TRUE and FALSE, thus not giving us any added simplicity, the AIC, all other coefficients as well as the null- and residual deviance stays the exact same and by that account I close to leave it in the model.

Doing an add1 and a drop1 test on our new and improved model shows us there’re no new interactions that are significant and that all current variables are significant so we’re done! The final fit is:

glm(employed == “yes” ~ foreigner 
+ gov.support
+ age
+ school.children
+ factor(young.children == 0)
+ factor(school.children == 0)
+ factor(young.children + school.children == 0)
+ I(age^2)
+ foreigner:age
+ foreigner:factor(young.children + school.children == 0)
+ age:school.children
+ gov.support:factor(young.children == 0), family = binomial, data = dat)

6. Removing outliers

So now that we have a model we’re satisfied with we can look for outliers that negatively effect the model.

Using the “car” package https://www.rdocumentation.org/packages/car/versions/1.0-2 we can use the influencePlot() and outlierTest() functions to find potential outlier:

We see that the datapoint 416 is classified as an outlier in both tests, we can take a look at the point in a few of our plots to gauge whether or not to remove it:

par(mfrow = c(2, 2)) # 2 row(s), 2 column(s)
plot(fit.4, col=ifelse(as.integer(rownames(dat))==416, “red”, “black”), pch=20)

It seems like this could very well be screwing a bit with our model, note that we should actually be using pearson residuals to gauge our models fit so the fact that we don’t have anything close to a straight line in the upper left plot is fine, Q-Q plots are irrelevant for this kind of model as well.

Lets try removing the point and take a look at the new fit:

final.fit <- update(fit.4, data = dat[-c(416),])
summary(final.fit)

Down to almost 1000 AIC from the original 1067, this isn’t really a relevant measure of performance when comparing the AIC of two different sets of data (since we removed point 416), we would actually have to conclude that 416 was an outlier in the initial model as well, remove it and then compare the AIC value of the initial model without point 416 to our final fit without point 416 as well.

Looking at another round of influencePlot() and outlierTest() we find that datapoint 329 is acting out as well, however looking at the actual plots we see that we can’t really justify a removal of the data like we could with 416. This is our final fit.


7. Model evaluation

So now that we have a final fit where we can’t confidently add or remove any interactions variables and other transformations it’s time to evaluate if our model actually fits our data and if there’s even a statistically significant difference between our final fit and the first “naive” fit.

Lets start by taking a look at our fitted values vs. the Pearson residuals:

par(mfrow = c(1, 1)) # 2 row(s), 2 column(s)
plot(p.resid ~ fit.val)
lines(ksmooth(y = p.resid, x = fit.val, bandwidth=.1, kernel=’normal’), col=’blue’)
abline(h=0)

This is a pretty decent fit, lets take a look at the fit for the major explanatory variables as well:

With the exception of ‘gov.support’ everything looks quite nice, also the reason for the bend in ‘gov.support’ seems to be a single outlier in which someone was granted a substantially lower amount of support compared to all our other points of data.

Checking for potential overfitting

Overfitting is the bane of all statistical modelling, how can you make sure your model isn’t just fitting to the exact data you fed it? The goal is to make a model which generalizes and doesn’t just cater to the current data at hand. So how do we test if our model is overfitting on our data?

A popular metric to test is the delta value generated through Cross-Validation, we can calculated these by using the cv.glm function from the ‘boot’ package and compare our final fit to our first!

cv.glm(dat[-c(416),], final.fit, K = 13)$delta
cv.glm(dat[-c(416),], update(fit.1, data = dat[-c(416),]), K = 13)$delta

In the code above we’re using k-fold cross-validation with k = 13 (since 13 is a factor of 871 which is the length of our data after removing the outlier) this means we’re splitting our data in 13 ‘chunks’.

The delta values is a vector wherein the first component is the raw cross-validation estimate of prediction error and the second component is the adjusted cross-validation estimate (designed to compensate for the bias introduced by not using an exhaustive testing methods such as leave-one-out)

Running the code above yields the following delta-values, note that these are subject to some random variance so you might not get the exact same values:

The prediction error is lower for the final fit, even when testing with cross-validation. Thus we can assume that our model hasn’t been overfitting on our data!

The final piece of the puzzle

So now we’ve concluded that our model is actually a pretty decent fit for our data, but is it a statistically significant difference from the “naive” model without any transformations and variable interactions? We can use an ANOVA test for this, we just have to remove the same point of data in both fits:

There’s definitely a significant difference between the two fits, we can happily conclude that our hard work has paid off!


8. Model visualization

Now that we’ve gotten ourselves a model, how do we actually visualize and interpret what it says about the relationships in our data?

Take a look at the following walk-trough which uses the same data and model as this article!:


Closing remarks

Remember to leave a ‘clap’ if you enjoyed the article or want more of the sort, this isn’t eligible for the “members-program” so I don’t get paid for your claps, I just want to gauge the interest, thanks!

Please keep in mind that this is purely introductory and that this isn’t an exhaustive analysis or conclusion! If we were more rigorous in our pursuit we would’ve incorporated Cross-Validation tests and ANOVA tests on each new iteration of our model, ie. whenever we add a new variable, interaction or power-transformation.

Feel free to message me if you have any questions and please correct me if you feel like I missed something or did something wrong, do keep in mind that this is suppose to serve as an introduction to modelling in R, I’m well aware that this process is highly simplified compared to more advanced methods!

If you want to try your luck with this same dataset give it a go here: https://github.com/pela15ae/statmod/blob/master/employment_data.txt

The data is Danish so to convert the headers and categorical values to English run this piece of code:

names(dat) <- c(“employed”, “foreigner”, “gov.support”, “age”, “education”, “young.children”, “school.children”)
levels(dat$employed)[1] <- “yes”
levels(dat$employed)[2] <- “no”
levels(dat$foreigner)[1] <- “yes”
levels(dat$foreigner)[2] <- “no”

Thanks for reading!