# Model selection 101, using R

Quick and dirty markup of simple model selection using R

#### What‘re we doing?

Since this is a very introductory look at model selection **we assume the data you’ve acquired has already been cleaned, scrubbed and ready to go.** Data cleaning is a whole subject in and of itself and is actually the primary time-sink of any Data Scientist. Go to the end of this article if you want to download the data for yourself and follow along!

Hit that ‘**clap**’ if you enjoy this kind of content and want more *(this **isn’t **monetized, I just want to gauge interest!)*

**Edit**: I’ve made a “*sequel*” to this article about visualizing and plotting the model we find if you want to check that out after reading this one!:

#### Lets look at the pipeline:

This is the skeleton I use for creating a simple LM or GLM:

**Create a base-model**using all available variables and data**Factorize categorical variables**if R didn’t do the job**Add relevant power-transformations****Add relevant variable interaction****Remove insignificant variables**with relevant testing criteria

- Repeat step 3–5 until you’ve exhausted your options**Remove any outlying datapoints****Evaluate your model****Visualizing your findings**

### 1. Creating a base-model

Lets start by setting up a workspace and loading our data. In this example we’re working on a dataset describing **employment-status of women based on whether or not you’re a foreigner, the amount of government-entitled support (log-transformed), age, years of education and the number of children** (spread in two categorical variables *‘young.children’* and ‘*school.children*’):

rm(list=ls()) # “Clear current R environment”

setwd(“C:/R/Workspace”) # Setting up your workspace

dat <- read.table(“employment_data.txt”, header=T) # Load that data

str(dat)

summary(dat)

which outputs:

The first thing we notice is that our response-variable is binomial (obviously) suggesting that we have a binomial distribution which means we’ll have to fit a **GLM **instead of a traditional **LM**:

fit.1 <- glm(employed == "yes" ~ ., data = dat, family = binomial)

summary(fit.1)

This fit is the most general fit we can use by default, it fits a binomial model (*family = binomial*) with respect to the response-variable “employed” having a value “yes”, using every variable in the dataset (*~ .*) giving us the following output:

Right, so a few problems right off the bat, we don’t like seeing **p-values** above **0.05**, much less above **0.1**, but before we recklessly remove them lets check for variable interactions and power-transformations first!

### 2. Factorizing categorical variables

Lets consider the possibility that there’s a categorical difference between not having any children and actually having any number of children larger than zero, thus we add the categorical variables for having** 0 children**: *‘factor(young.children == 0)’*,* ‘factor(school.children == 0)’* and a combined factor for not having any children at all *‘factor(young.children + school.children == 0)’*

We can update our fit with the new variables:

tempfit <- update(fit.1, .~. + factor(young.children == 0)

+ factor(school.children == 0)

+ factor(young.children +

school.children == 0))

summary(tempfit)

So we’ve already improved our model a bit in terms of **AIC **from **1066.8** to **1050.2**! Lets take a look at our continuous variables and watch for possible power-transformations:

### 3. Adding relevant power-transformations

A neat way to look for potential power-transformations of a binomial distribution is using this custom function:

logit.plot <- function(x, y, h, link=’logit’, xlab=deparse(substitute(x)), yname=deparse(substitute(y)), ylab, rug=T, data, …){

if(!missing(data)){

call <- match.call()

dataPos <- match(“data”,names(call))

return(invisible(with(data, eval(call[-dataPos]))))

}

if (length(levels(factor(y)))!=2) stop(‘y must be binary’)

succes <- levels(factor(y))[2]

if (missing(ylab)) ylab <- paste(binomial(link)$link,’ P{‘,yname,’=’,succes,’|’,xlab,’}’, sep=’’, collapse=’’)

if (is.factor(y)) y <- as.numeric(y==levels(y)[2])

x.seq <- seq(min(x),max(x),length=101)

smooth <- binomial(link)$linkfun(ksmooth(x, y, ’normal’, h, x.points=x.seq)$y)

plot(smooth~x.seq, ylab=ylab, xlab=xlab, type=’l’,…)

if (rug) rug(x)

invisible(xy.coords(x = x.seq, y = smooth, xlab = xlab, ylab = ylab))

}

It’s actually decently simple, it plots the **link(E[y|x])** against **x **with E[y|x] estimated using a** Nadaraya-Watson** kernel regression estimate:

This take the following arguments:**x **— Your explanatory variables**y **— The binary outcome variable**h **— Bandwidth for the Nadaraya-Watson kernel regression estimate**data **— Self explanatory**… **— Whatever additional arguments you want to pass to plot()

Using this function iterating over different bandwidths we get the following kinds of plot:

for(i in seq(2.5,10,length.out = 6))

logit.plot(x = age, y = employed == ‘yes’, h = i, data = dat, main = paste0(“bandwidth = “,i))

In this example with ‘age’ we can see that the function begins smoothing around a bandwidth around 7, the graph could approximate a 2. or maybe 3. degree polynomial, the same is true for ‘education’ and ‘gov.support’ but for simplicity we’ll consider the case of all three taking shape as 2. degree polynomials:

tempfit <- update(tempfit, .~. + I(age^2)

+ I(education^2)

+ I(gov.support^2))

summary(tempfit)

Quite the improvement in terms of **AIC**, from **1050.2 **to **1017.7**! Lots of insignificant variables though!

This is our first attempt at a “full” model, so lets define this as our ‘fit.2’ and continue.

fit.2 <- tempfit

**4. Adding variable interaction**

The easiest way to check for variable interaction is using the R-function ‘**add1**’, this is simply the case of defining a scope to test and which test to use when testing relative to the original model. F-tests are usually only relevant for **LM **and **AOV **models so we can ‘safely’ ignore that testing criteria, we’ll instead be using a ** χ²-test** (Chisq or Chi²):

add1(fit.2, scope = .~. + .^2, test=”Chisq”)

This scope is simply asking to test the current model (.~.) plus interaction between existing variables (+ .^2), this will output a lot of interactions, some with statistically significant P-values, but it can be annoying to manually sort through, so lets **sort the list** so we get the lowest **P-values** on the top:

add1.test <- add1(fit.2, scope = .~. + .^2, test=”Chisq”)

add1.test[order(add1.test$’Pr(>Chi)’),]

Right, so it seems like there might be an interaction between the **foreigner **and **age** variables. One thing to consider before simply adding the interaction with the lowest P-value is whether or not this makes sense in context of our current model, right now **age²** is actually the most significant variable in our model so we might argue that adding the interaction between **foreigner **and **age²** is more intuitive, for simplicity we’ll stick with the **foreigner:age **interaction.

Lets test for more interactions **after **adding the variable interaction **foreigner:age**:

add1.test <- add1(update(fit.2, .~. + foreigner:age), scope = .~. + .^2, test=”Chisq”)

add1.test[order(add1.test$’Pr(>Chi)’),]

Now it seems like there’s a significant interaction in **foreigner:factor(young.children + school.children == 0)**

After a few rounds of this we end up seeing no new statistically significant interactions, by the end we’ve added the following interactions:

+** foreigner:age**

+ **foreigner:factor(young.children + school.children == 0)**

+ **age:school.children**

+ **gov.support:factor(young.children == 0)**

So lets update our fit with the new variable interactions as follows:

fit.3 <- update(fit.2, .~.

+ foreigner:age

+ foreigner:factor(young.children + school.children == 0)

+ age:school.children

+ gov.support:factor(young.children == 0))

summary(fit.3)

A small improvement in **AIC **once more, from **1017.7** to **1006.9**.

**5. Removing insignificant variables**

This process is quite similar to the last one in step 4. We’ll simply be using the **drop1 **function in R now instead of **add1, **and due to us seeking to remove instead of appending variables we seek the highest **P-value** instead of the lowest (we’ll still use ** χ²-test** as our criteria):

drop1.test <- drop1(fit.3, test=”Chisq”)

drop1.test[rev(order(drop1.test$’Pr(>Chi)’)),]

This tells us mostly the same as our model-summary, **gov.support²** definitely dosn’t seem to be statistically significant, so we’ll remove that first and so on and so forth, we end up removing the following variables:

- **gov.support²**

- **young.children**

-** education**

- **education²**

After those have been removed we see that all the remaining variables are statistically significant, so lets update our fit by removing the variables listed above:

fit.4 <- update(fit.3, .~.

— I(gov.support^2)

— young.children

— education

— I(education^2))

summary(fit.4)

Nice, another improvement in **AIC **although **marginal and insignificant**, the main advantage of this model over our previous is the **added simplicity** inherent in the reduced number of explanatory variables!

One might wonder *“Why aren’t we removing the **gov.support** variable? It’s clearly insignificant when looking at the summary of our model!” *this is due to the **principle of marginality** which prohibits us from removing an insignificant variable if said variable has a significant interaction with another, like **gov.support:factor(young.children == 0).**

You might argue that removing ** gov.support** would be beneficial to the simplicity of the model given that

**it’s clearly insignificant**and that the interaction with the young.children == 0 variable is only

**marginally significant (**, however upon further inspection when removing

*p*= 0.435)**from the model the variable-interaction**

*gov.support***splits into two variables**for

**TRUE**and

**FALSE**, thus not giving us any added simplicity, the

**AIC**,

**all other coefficients**as well as the

**null**- and

**residual deviance**stays the

**exact same**and by that account I close to leave it in the model.

Doing an **add1 **and a **drop1 **test on our new and improved model shows us there’re no new interactions that are significant and that all current variables are significant so we’re done! **The final fit is**:

glm(employed == “yes” ~ foreigner

+ gov.support

+ age

+ school.children

+ factor(young.children == 0)

+ factor(school.children == 0)

+ factor(young.children + school.children == 0)

+ I(age^2)

+ foreigner:age

+ foreigner:factor(young.children + school.children == 0)

+ age:school.children

+ gov.support:factor(young.children == 0), family = binomial, data = dat)

**6. Removing outliers**

So now that we have a model we’re satisfied with we can look for outliers that negatively effect the model.

Using the “car” package https://www.rdocumentation.org/packages/car/versions/1.0-2 we can use the influencePlot() and outlierTest() functions to find potential outlier:

We see that the datapoint 416 is classified as an outlier in both tests, we can take a look at the point in a few of our plots to gauge whether or not to remove it:

par(mfrow = c(2, 2)) # 2 row(s), 2 column(s)

plot(fit.4, col=ifelse(as.integer(rownames(dat))==416, “red”, “black”), pch=20)

It seems like this could very well be screwing a bit with our model, note that we should actually be using **pearson residuals** to gauge our models fit so the fact that we don’t have anything close to a straight line in the upper left plot is fine, **Q-Q plots** are irrelevant for this kind of model as well.

Lets try removing the point and take a look at the new fit:

final.fit <- update(fit.4, data = dat[-c(416),])

summary(final.fit)

Down to almost **1000** **AIC** from the original **1067, **this isn’t really a relevant measure of performance when **comparing the AIC of two different sets of data** (since we removed point 416), we would actually have to conclude that 416 was an outlier in the initial model as well, remove it and then compare the AIC value of the initial model without point 416 to our final fit without point 416 as well**.**

Looking at another round of **influencePlot() **and **outlierTest()** we find that **datapoint 329** is acting out as well, however looking at the actual plots we see that we can’t really justify a removal of the data like we could with **416**.** This is our final fit.**

### 7. Model evaluation

So now that we have a final fit where we can’t confidently add or remove any interactions variables and other transformations **it’s time to evaluate if our model actually fits our data and if there’s even a statistically significant difference between our final fit and the first “naive” fit.**

Lets start by taking a look at our fitted values vs. the Pearson residuals:

par(mfrow = c(1, 1)) # 2 row(s), 2 column(s)

plot(p.resid ~ fit.val)

lines(ksmooth(y = p.resid, x = fit.val, bandwidth=.1, kernel=’normal’), col=’blue’)

abline(h=0)

This is a pretty decent fit, lets take a look at the fit for the major explanatory variables as well:

With the exception of ‘*gov.support*’ everything looks quite nice, also the reason for the bend in ‘*gov.support*’ seems to be a single outlier in which someone was granted a substantially lower amount of support compared to all our other points of data.

#### Checking for potential overfitting

**Overfitting **is the bane of all statistical modelling, how can you make sure your model isn’t just fitting to the exact data you fed it? **The goal is to make a model which generalizes** and doesn’t just cater to the current data at hand. So how do we test if our model is overfitting on our data?

A popular metric to test is the delta value generated through **Cross-Validation**, we can calculated these by using the **cv.glm** function from the ‘**boot**’ package and compare our final fit to our first!

cv.glm(dat[-c(416),], final.fit, K = 13)$delta

cv.glm(dat[-c(416),], update(fit.1, data = dat[-c(416),]), K = 13)$delta

In the code above we’re using **k-fold cross-validation **with **k = 13** (since 13 is a factor of 871 which is the length of our data after removing the outlier) this means we’re splitting our data in 13 ‘chunks’.

The delta values is a vector wherein the first component is **the raw cross-validation estimate of prediction error** and the second component is **the adjusted cross-validation estimate** (designed to compensate for the **bias **introduced by not using an exhaustive testing methods such as ** leave-one-out**)

Running the code above yields the following delta-values, note that these are subject to some random variance so you might not get the exact same values:

The prediction error is **lower **for the **final fit**, even when testing with cross-validation. Thus we can assume that our model hasn’t been **overfitting **on our data!

#### The final piece of the puzzle

So now we’ve concluded that our model is actually a pretty decent fit for our data, but is it a statistically significant difference from the “naive” model without any transformations and variable interactions? We can use an **ANOVA **test for this, we just have to remove the same point of data in both fits:

There’s definitely a significant difference between the two fits, we can happily conclude that our hard work has paid off!

### 8. Model visualization

Now that we’ve gotten ourselves a model, how do we actually **visualize **and **interpret **what it says about **the relationships in our data**?

Take a look at the following walk-trough which uses **the same data and model **as this article!:

### Closing remarks

Remember to leave a ‘**clap**’ if you enjoyed the article or want more of the sort, this isn’t eligible for the *“members-program”* so I don’t get paid for your claps, I just want to gauge the interest, thanks!

Please keep in mind that this is **purely introductory** and that **this isn’t an exhaustive analysis or conclusion**! If we were more rigorous in our pursuit we would’ve incorporated **Cross-Validation** tests and **ANOVA** tests on **each new iteration** of our model, ie. whenever we add a new variable, interaction or power-transformation.

Feel free to message me if you have any questions and please correct me if you feel like I missed something or did something wrong, do keep in mind that this is suppose to serve as an introduction to modelling in R, I’m well aware that this process is highly simplified compared to more advanced methods!

If you want to try your luck with this same dataset give it a go here: https://github.com/pela15ae/statmod/blob/master/employment_data.txt

The data is Danish so to convert the headers and categorical values to English run this piece of code:

names(dat) <- c(“employed”, “foreigner”, “gov.support”, “age”, “education”, “young.children”, “school.children”)

levels(dat$employed)[1] <- “yes”

levels(dat$employed)[2] <- “no”

levels(dat$foreigner)[1] <- “yes”

levels(dat$foreigner)[2] <- “no”

Thanks for reading!