Predicting Employee Turnover using R

Employee turnover is a major challenge for companies today, especially when the labor market is competitive and certain skills are in high demand. When an employee leaves, not only is the productivity of that person lost, but the productivity of many others is impacted. Finding replacements can take months of time and effort on the part of hiring managers and recruiting staff, who are then forced to take time away from the work they could be doing. When a replacement is finally found, it takes weeks to months for that employee to be completely onboarded and working at full capacity. By some estimates, it costs about 20% of an employee’s salary to replace a position, and costs can be much higher — over 200% — for highly skilled and educated positions like many in demand today.

Retention of valued employees makes good business sense. Traditional approaches such as employee engagement surveys and proper training for managers are important components of a good workforce planning strategy. Now that ‘big data’ has arrived, the insights it provides are becoming an increasingly valuable part of that strategy.

This blog presents a relatively simple machine learning approach, using R, to harnessing workforce data to understand a company’s employee turnover, and predict future employee turnover before it happens so that actions can be taken now, before it’s too late.

GOALS:

  • Create model to accurately predict employees who leave
  • Identify key factors related to employee churn

Dataset

The dataset represents fictitious/fake data on terminations from the Employee Attrition Kaggle competition. For each of 10 years it shows employees who are active or terminated. Lots of details and ways to explore the data can be found in Lyndon Sundmark’s fine tutorial.

In this analysis, R libraries are intentionally introduced and loaded at the point they are needed, to make it easier for readers to understand which libraries are required for specific portions of the analysis.

1. Let’s first look at the data

# load data
emp <- read.csv("MFG10YearTerminationData_Kaggle.csv", header = TRUE)
emp$termreason_desc <- as.factor(gsub("Resignaton", "Resignation", emp$termreason_desc)) # correct misspelling in original Kaggle dataset
# basic EDA
dim(emp) # number of rows & columns in data
Data Summary
summary(emp)  # summary stats

Data Summary

The dim function (dimension, above) shows that the dataset has 49,653 rows and 18 columns. The summary statistics reveal that there are about 7,000 employee IDs with records across years from 2006–15. The variables include hire and termination dates; birthdate, age, and gender; length of service; city, store, and department names; job titles; and status, status year, and termination type and reason. This list of variables is more limited than typically available to companies, but gives us enough to build and test some models.

First, let’s calculate how many employees leave each year:

# explore status/terminations by year
library(tidyr) # data tidying (e.g., spread)
library(data.table) # data table manipulations (e.g., shift)
library(dplyr) # data manipulation w dataframes (e.g., filter)
status_count <- with(emp, table(STATUS_YEAR, STATUS))
status_count <- spread(data.frame(status_count), STATUS, Freq)
status_count$previous_active <- shift(status_count$ACTIVE, 1L, type = "lag")
status_count$percent_terminated <- 100*status_count$TERMINATED / status_count$previous_active
status_count

We can see that from 2006 to 2015 this company had between 4445 and 5215 active employees, and between 105 and 253 terminations. The termination rate jumped from about 2% in 2014 to almost 5% in 2015.

Let’s see the breakdown of employee terminations each year, by termination reason:

# create a dataframe of the subset of terminated employees
terms <- as.data.frame(emp %>% filter(STATUS=="TERMINATED"))
# plot terminations by reason
library(ggplot2)
ggplot() + geom_bar(aes(y = ..count..,x = STATUS_YEAR, fill = termreason_desc), data=terms, position = position_stack()) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

A couple of findings jump out. First, we see that a spike in layoffs occur in 2014–15, compared with no layoffs before 2014. The data also show that many of the terminations are retirements, especially in 2006–10. Resignations increase after 2010, pointing to some downward shift in the desirability of working for the company.

2. Modeling — Terminations

OK, let’s start modeling to see how well we can predict terminations. We begin by selecting the variables to include in the model (as ‘term_vars’).

Here we use all years before 2015 (2006–14) as the training set, with the last year (2015) as the test set. We will start with a basic CART (Classification and Regression Tree) decision tree:

# select variables to be included in model predicting terminations
term_vars <- c("age","length_of_service","city_name", "department_name","job_title","store_name","gender_full","BUSINESS_UNIT","STATUS")
# import libraries
library(rattle) # graphical interface for data science in R
library(magrittr) # For %>% and %<>% operators.
library(rpart.plot) # decision tree model and plot
# Partition the data into training and test sets
emp_term_train <- subset(emp, STATUS_YEAR < 2015)
emp_term_test <- subset(emp, STATUS_YEAR == 2015)
set.seed(99) # set a pre-defined value for the random seed so that results are repeatable
# Decision tree model
rpart_model <- rpart(STATUS ~.,
data = emp_term_train[term_vars],
method = 'class',
parms = list(split='information'),
control = rpart.control(usesurrogate = 0,
maxsurrogate = 0))
# Plot the decision tree
rpart.plot(rpart_model, roundint = FALSE, type = 3)

Termination Factors

Age is the most important variable, largely because many terminations are retirements of employees 65 years or older. Of those employees under 65, male employees 60 years or older left the company, many as layoffs in 2014–15.

Let’s plot the age distribution of terminated versus active employees:

# plot terminated & active by age
library(caret) # data viz, functions to streamline process for
# predictive models, & machine learning packages including
# gbm (generalized boost regression models)
featurePlot(x=emp[,6], y=emp$STATUS, plot="density",
auto.key = list(columns = 2), labels = c("Age (years)", ""))

This plot shows that the majority of terminations are older employees at or near retirement age. But it also shows a peak in resignations among the youngest employees, mainly those in their 20s.

Most companies are interested in identifying the employees, especially their top-performing employees, that are at risk of leaving voluntarily. So, let’s focus the analysis on that employee segment.

3. Modeling — Resignations

To predict future resignations (voluntary terminations), we need to create a ‘resigned’ variable:

# create separate variable for voluntary_terminations
emp$resigned <- ifelse(emp$termreason_desc == "Resignation", "Yes", "No")
emp$resigned <- as.factor(emp$resigned) # convert to factor (from character)
summary(emp$resigned)

We can see that there are only 385 resignations compared to 49,268 non-resignations.

This is a highly imbalanced dataset, so machine learning models will have difficulty identifying the rare class. For example, a random forest model (not shown, for brevity) run on this data had a recall of 0, meaning that the goal of the model completely failed and none of the ‘resigned’ employees in 2015 were correctly identified. There are a variety of options for adjusting for this imbalance, such as up-sampling the minority class, down-sampling the majority class, or using an algorithm to create synthetic data based on feature space similarities from minority samples (read here) for more. Here, we use the ROSE (Random Over Sampling Examples) package to create a more balanced dataset.

# Subset the data again into train & test sets. Here we use all years before 2015 (2006-14) as the training set, with the last year (2015) as the test set
emp_train <- subset(emp, STATUS_YEAR < 2015)
emp_test <- subset(emp, STATUS_YEAR == 2015)
library(ROSE) # "Random Over Sampling Examples"; generates synthetic balanced samples
emp_train_rose <- ROSE(resigned ~ ., data = emp_train, seed=125)$data
# Tables to show balanced dataset sample sizes
table(emp_train_rose$resigned)

Random Forest Model

Now the dataset is relatively balanced between resigned and non-resigned employees, let’s run a Random Forest model. Random Forest is a popular approach for a variety of reasons. In brief, it randomly samples the data, builds a decision tree for that sample, and repeats the process until many (here, 500) trees are generated. With many trees, the method generates a ‘forest’ of decision trees and outputs the mode of the classes of individual trees. More details on the random forest approach can be found here.

library(randomForest)  # random forest modeling
# Select variables (res_vars) for the model to predict 'resigned'
res_vars <- c("age","length_of_service","city_name", "department_name","job_title","store_name","gender_full","BUSINESS_UNIT","resigned")
set.seed(222)
emp_res_rose_RF <- randomForest(resigned ~ .,
data = emp_train_rose[res_vars],
ntree=500, importance = TRUE,
na.action = na.omit)
varImpPlot(emp_res_rose_RF,type=1,
main="Variable Importance (Accuracy)",
sub = "Random Forest Model")
var_importance <-importance(emp_res_rose_RF)
emp_res_rose_RF # view results & Confusion matrix

Random Forest Model — Results

This model does a fairly good job. It correctly identified about 84% (Recall = 0.836) of employees who resigned. With Precision = 0.809, most of the employees the model identified as likely to resign were true positives (they did resign).

However, the real test of a model’s success is its performance on the test dataset because models can ‘overfit’ the training dataset. So, let’s see how well the model predicts employees who resigned in 2015:

# generate predictions based on test data ("emp_test")
emp_res_rose_RF_pred <- predict(emp_res_rose_RF, newdata = emp_test)
confusionMatrix(data = emp_res_rose_RF_pred,
reference = emp_test$resigned,
positive = "Yes", mode = "prec_recall")

Here Recall = 77%. 20 out of 26 employees who resigned in 2015 were correctly predicted. But, Precision = 0.019, so only 2% of those identified as ‘at risk’ actually resigned. We’ll discuss these results after running the Gradient Boost Model and comparing the results of the two approaches below.

Gradient Boost Model

We use Gradient Boost as our second modeling approach. Like Random Forest, it is an ensemble method that builds individual decision trees, but it does so very differently. Instead of building full decision trees based on data subsamples in parallel, Gradient Boost builds models sequentially. After the first decision trees are built, the next trees focus their training on the observations that were not predicted by earlier trees. Unlike Random Forest, Gradient Boost decision trees can be very shallow, even a single split. It then combines the weak learners into a strong total prediction.

Gradient Boost is more prone to overfitting (where the model too specifically fits the training set, and performs worse on other datasets), but it often performs better than Random Forests especially if various parameters are tuned appropriately. More details can be found here.

Let’s run a basic Gradient Boost model, and see how it performs when predicting the test dataset:

## Using caret library for gbm on the ROSE balanced dataset
# Results are not shown for sake of brevity
set.seed(432)
objControl <- trainControl(method = 'cv', number = 3,
returnResamp='none',
summaryFunction = twoClassSummary,
classProbs = TRUE)
emp_res_rose_caretgbm <- train(resigned ~ ., data = emp_train_rose[res_vars],
method = 'gbm',
trControl = objControl,
metric = "ROC",
preProc = c("center", "scale"))
# summary(emp_res_rose_caretgbm) # outputs variable importance list & plot
emp_res_rose_caretgbm_preds <- predict(object =
emp_res_rose_caretgbm,
emp_test[res_vars],
type = 'raw')
confusionMatrix(data = emp_res_rose_caretgbm_preds,
reference = emp_test$resigned,
positive = 'Yes', mode = 'prec_recall')

Gradient Boost Model — Results

16 of 26 employees in the test dataset who resigned were correctly identified (Recall = 61.5%). Out of the 1033 employees identified as at risk, 16 resigned (Precision = 1.55%)

Let’s compare the performance of the Gradient Boost vs Random Forest approaches:

These results underscore the difficulty in predicting rare events, especially those involving something as complex as human decisions. But remember that this is a fake data set, where only 26 out of 4,961 employees resigned in 2015, or 0.5%.

In real companies in the US, the voluntary turnover rate was 12.8% in 2016, and as high as 20.7% in some sectors such as the hospitality industry. While retirements account for some of that turnover, most of it is cause by employees leaving for jobs at other companies.

One third (33%) of leaders at companies with 100 plus employees are currently looking for jobs, according to one article on employee retention.

Bottom line: the turnover risk model works better on ‘real’ data, especially at a company where a segment has high voluntary turnover.

4. Actions to Take

Armed with these results, at least two actions can be taken to improve employee experience and retention:

  1. Identify employees most at risk of resigning in the future, so that steps can be taken now to improve employee engagement.
  2. Identify factors causing high levels of resignations, and take corrective action.

Action 1 — Identify ‘At Risk’ Employees

Let’s use the Random Forest model (which performed better) to identify the active employees most likely to resign in the future (here, the 2015 employees):

# Calculate prediction probabilites of employees who will resign
emp_res_rose_RF_pred_probs <- predict(emp_res_rose_RF, emp_test, type="prob")
Employees_flight_risk <- as.data.frame(cbind(emp_test$EmployeeID,
emp_res_rose_RF_pred_probs))
Employees_flight_risk <- rename(Employees_flight_risk,
EmployeeID = V1)
Employees_flight_risk <- arrange(Employees_flight_risk, desc(Yes))
head(Employees_flight_risk)

Armed with a list of employees with the greatest risk of resigning, leadership can take steps to get ahead of the problem. For example, we can ask managers of ‘at risk’ employees to make recommendations on what can be done to improve the engagement and experience of their direct report(s). Perhaps they are overdue for a promotion, haven’t been recognized for their work in some time, or are at risk of burnout and need more assistance? Exploring these cases can often reveal more systemic problems, or issues with specific departments. Discovering these now can save a great deal of future work and cost.

Action 2 — Identify Factors Related to Resigning Employees

The Gradient Boost model uncovered some of the specific job titles and department names, as well as age and length of service, that factored highly in the model. Here are the top dozen factors, selected from longer list generated by “summary(emp_res_rose_caretgbm)” above. Following that, let’s plot how resignations vary by age, job_title, and department_name.

# plot resigned by age
featurePlot(x=emp[,6], y=emp$resigned,plot="density",auto.key = list(columns = 2), labels = c("Age (years)", ""))
# plot terminations by reason & job_title
ggplot() + geom_bar(aes(y = ..count.., x = job_title, fill = termreason_desc), data=terms, position = position_stack())+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
# plot terminations by reason & department
ggplot() + geom_bar(aes(y = ..count.., x = department_name, fill = termreason_desc), data=terms, position = position_stack())+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

Action 2 — Results

We can see from these plots that resignations tend to be younger employees, mainly in their 20s.

Resignations are particularly high in certain job_titles, such as Cashier and Shelf Stocker, and in certain departments, such as Customer Service, Dairy, and Processed Foods. This data can lead to direct action:

  • What about these departments and roles leads to high levels of resignations?
  • Are there issues with management?
  • Do employees in these roles have the resources they need to perform at their best?
  • What about the employee experience can be improved?

Take-Home Message

This post shows a relatively simple example of how to develop an employee turnover model focused on employees at risk of leaving a company. Real employee datasets typically have higher voluntary turnover rates and many more variables available for developing a model. They are likely to identify real factors leading to unnecessarily high — and costly — turnover.

  • Armed with this data, companies can address the main factors causing their best employees to leave, and do something about it now.
  • Armed with predictions of ‘at risk’ employees, managers can take steps to improve their employees’ experiences and engagement before they start thinking about leaving.

Win-win!

If you found this useful, please show your support by clapping below and share it with others. Comments, questions, and suggestions are welcomed!

The full code can be found on Github here, along with ‘extra code’ for analyses that supported this blog and might be useful to others. For more data science resources, check out my website!