Predicting Employee Turnover using R

Employee turnover is a major challenge for companies today, especially when the labor market is competitive and certain skills are in high demand. When an employee leaves, not only is the productivity of that person lost, but the productivity of many others is impacted. Finding replacements can take months of time and effort on the part of hiring managers and recruiting staff, who are then forced to take time away from the work they could be doing. When a replacement is finally found, it takes weeks to months for that employee to be completely onboarded and working at full capacity. By some estimates, it costs about 20% of an employee’s salary to replace a position, and costs can be much higher — over 200% — for highly skilled and educated positions like many in demand today.

Retention of valued employees makes good business sense. Traditional approaches such as employee engagement surveys and proper training for managers are important components of a good workforce planning strategy. Now that ‘big data’ has arrived, the insights it provides are becoming an increasingly valuable part of that strategy.

This blog presents a relatively simple machine learning approach, using R, to harnessing workforce data to understand a company’s employee turnover, and predict future employee turnover before it happens so that actions can be taken now, before it’s too late.


  • Identify key factors related to employee churn


In this analysis, R libraries are intentionally introduced and loaded at the point they are needed, to make it easier for readers to understand which libraries are required for specific portions of the analysis.

1. Let’s first look at the data

# load data
emp <- read.csv("MFG10YearTerminationData_Kaggle.csv", header = TRUE)
emp$termreason_desc <- as.factor(gsub("Resignaton", "Resignation", emp$termreason_desc)) # correct misspelling in original Kaggle dataset
# basic EDA
dim(emp) # number of rows & columns in data Data Summary
summary(emp)  # summary stats

Data Summary

First, let’s calculate how many employees leave each year:

# explore status/terminations by year
library(tidyr) # data tidying (e.g., spread)
library(data.table) # data table manipulations (e.g., shift)
library(dplyr) # data manipulation w dataframes (e.g., filter)
status_count <- with(emp, table(STATUS_YEAR, STATUS))
status_count <- spread(data.frame(status_count), STATUS, Freq)
status_count$previous_active <- shift(status_count$ACTIVE, 1L, type = "lag")
status_count$percent_terminated <- 100*status_count$TERMINATED / status_count$previous_active

We can see that from 2006 to 2015 this company had between 4445 and 5215 active employees, and between 105 and 253 terminations. The termination rate jumped from about 2% in 2014 to almost 5% in 2015.

Let’s see the breakdown of employee terminations each year, by termination reason:

# create a dataframe of the subset of terminated employees
terms <- %>% filter(STATUS=="TERMINATED"))
# plot terminations by reason
ggplot() + geom_bar(aes(y = ..count..,x = STATUS_YEAR, fill = termreason_desc), data=terms, position = position_stack()) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

A couple of findings jump out. First, we see that a spike in layoffs occur in 2014–15, compared with no layoffs before 2014. The data also show that many of the terminations are retirements, especially in 2006–10. Resignations increase after 2010, pointing to some downward shift in the desirability of working for the company.

2. Modeling — Terminations

Here we use all years before 2015 (2006–14) as the training set, with the last year (2015) as the test set. We will start with a basic CART (Classification and Regression Tree) decision tree:

# select variables to be included in model predicting terminations
term_vars <- c("age","length_of_service","city_name", "department_name","job_title","store_name","gender_full","BUSINESS_UNIT","STATUS")
# import libraries
library(rattle) # graphical interface for data science in R
library(magrittr) # For %>% and %<>% operators.
library(rpart.plot) # decision tree model and plot
# Partition the data into training and test sets
emp_term_train <- subset(emp, STATUS_YEAR < 2015)
emp_term_test <- subset(emp, STATUS_YEAR == 2015)
set.seed(99) # set a pre-defined value for the random seed so that results are repeatable
# Decision tree model
rpart_model <- rpart(STATUS ~.,
data = emp_term_train[term_vars],
method = 'class',
parms = list(split='information'),
control = rpart.control(usesurrogate = 0,
maxsurrogate = 0))
# Plot the decision tree
rpart.plot(rpart_model, roundint = FALSE, type = 3)

Termination Factors

Let’s plot the age distribution of terminated versus active employees:

# plot terminated & active by age
library(caret) # data viz, functions to streamline process for
# predictive models, & machine learning packages including
# gbm (generalized boost regression models)
featurePlot(x=emp[,6], y=emp$STATUS, plot="density",
auto.key = list(columns = 2), labels = c("Age (years)", ""))

This plot shows that the majority of terminations are older employees at or near retirement age. But it also shows a peak in resignations among the youngest employees, mainly those in their 20s.

Most companies are interested in identifying the employees, especially their top-performing employees, that are at risk of leaving voluntarily. So, let’s focus the analysis on that employee segment.

3. Modeling — Resignations

# create separate variable for voluntary_terminations
emp$resigned <- ifelse(emp$termreason_desc == "Resignation", "Yes", "No")
emp$resigned <- as.factor(emp$resigned) # convert to factor (from character)

We can see that there are only 385 resignations compared to 49,268 non-resignations.

This is a highly imbalanced dataset, so machine learning models will have difficulty identifying the rare class. For example, a random forest model (not shown, for brevity) run on this data had a recall of 0, meaning that the goal of the model completely failed and none of the ‘resigned’ employees in 2015 were correctly identified. There are a variety of options for adjusting for this imbalance, such as up-sampling the minority class, down-sampling the majority class, or using an algorithm to create synthetic data based on feature space similarities from minority samples (read here) for more. Here, we use the ROSE (Random Over Sampling Examples) package to create a more balanced dataset.

# Subset the data again into train & test sets. Here we use all years before 2015 (2006-14) as the training set, with the last year (2015) as the test set
emp_train <- subset(emp, STATUS_YEAR < 2015)
emp_test <- subset(emp, STATUS_YEAR == 2015)
library(ROSE) # "Random Over Sampling Examples"; generates synthetic balanced samples
emp_train_rose <- ROSE(resigned ~ ., data = emp_train, seed=125)$data
# Tables to show balanced dataset sample sizes

Random Forest Model

library(randomForest)  # random forest modeling
# Select variables (res_vars) for the model to predict 'resigned'
res_vars <- c("age","length_of_service","city_name", "department_name","job_title","store_name","gender_full","BUSINESS_UNIT","resigned")
emp_res_rose_RF <- randomForest(resigned ~ .,
data = emp_train_rose[res_vars],
ntree=500, importance = TRUE,
na.action = na.omit)
main="Variable Importance (Accuracy)",
sub = "Random Forest Model")
var_importance <-importance(emp_res_rose_RF)
emp_res_rose_RF # view results & Confusion matrix

Random Forest Model — Results

However, the real test of a model’s success is its performance on the test dataset because models can ‘overfit’ the training dataset. So, let’s see how well the model predicts employees who resigned in 2015:

# generate predictions based on test data ("emp_test")
emp_res_rose_RF_pred <- predict(emp_res_rose_RF, newdata = emp_test)
confusionMatrix(data = emp_res_rose_RF_pred,
reference = emp_test$resigned,
positive = "Yes", mode = "prec_recall")

Here Recall = 77%. 20 out of 26 employees who resigned in 2015 were correctly predicted. But, Precision = 0.019, so only 2% of those identified as ‘at risk’ actually resigned. We’ll discuss these results after running the Gradient Boost Model and comparing the results of the two approaches below.

Gradient Boost Model

Gradient Boost is more prone to overfitting (where the model too specifically fits the training set, and performs worse on other datasets), but it often performs better than Random Forests especially if various parameters are tuned appropriately. More details can be found here.

Let’s run a basic Gradient Boost model, and see how it performs when predicting the test dataset:

## Using caret library for gbm on the ROSE balanced dataset
# Results are not shown for sake of brevity
objControl <- trainControl(method = 'cv', number = 3,
summaryFunction = twoClassSummary,
classProbs = TRUE)
emp_res_rose_caretgbm <- train(resigned ~ ., data = emp_train_rose[res_vars],
method = 'gbm',
trControl = objControl,
metric = "ROC",
preProc = c("center", "scale"))
# summary(emp_res_rose_caretgbm) # outputs variable importance list & plot
emp_res_rose_caretgbm_preds <- predict(object =
type = 'raw')
confusionMatrix(data = emp_res_rose_caretgbm_preds,
reference = emp_test$resigned,
positive = 'Yes', mode = 'prec_recall')

Gradient Boost Model — Results

Let’s compare the performance of the Gradient Boost vs Random Forest approaches:

These results underscore the difficulty in predicting rare events, especially those involving something as complex as human decisions. But remember that this is a fake data set, where only 26 out of 4,961 employees resigned in 2015, or 0.5%.

In real companies in the US, the voluntary turnover rate was 12.8% in 2016, and as high as 20.7% in some sectors such as the hospitality industry. While retirements account for some of that turnover, most of it is cause by employees leaving for jobs at other companies.

One third (33%) of leaders at companies with 100 plus employees are currently looking for jobs, according to one article on employee retention.

Bottom line: the turnover risk model works better on ‘real’ data, especially at a company where a segment has high voluntary turnover.

4. Actions to Take

  1. Identify employees most at risk of resigning in the future, so that steps can be taken now to improve employee engagement.
  2. Identify factors causing high levels of resignations, and take corrective action.

Action 1 — Identify ‘At Risk’ Employees

# Calculate prediction probabilites of employees who will resign
emp_res_rose_RF_pred_probs <- predict(emp_res_rose_RF, emp_test, type="prob")
Employees_flight_risk <-$EmployeeID,
Employees_flight_risk <- rename(Employees_flight_risk,
EmployeeID = V1)
Employees_flight_risk <- arrange(Employees_flight_risk, desc(Yes))

Armed with a list of employees with the greatest risk of resigning, leadership can take steps to get ahead of the problem. For example, we can ask managers of ‘at risk’ employees to make recommendations on what can be done to improve the engagement and experience of their direct report(s). Perhaps they are overdue for a promotion, haven’t been recognized for their work in some time, or are at risk of burnout and need more assistance? Exploring these cases can often reveal more systemic problems, or issues with specific departments. Discovering these now can save a great deal of future work and cost.

Action 2 — Identify Factors Related to Resigning Employees

# plot resigned by age
featurePlot(x=emp[,6], y=emp$resigned,plot="density",auto.key = list(columns = 2), labels = c("Age (years)", ""))
# plot terminations by reason & job_title
ggplot() + geom_bar(aes(y = ..count.., x = job_title, fill = termreason_desc), data=terms, position = position_stack())+
# plot terminations by reason & department
ggplot() + geom_bar(aes(y = ..count.., x = department_name, fill = termreason_desc), data=terms, position = position_stack())+

Action 2 — Results

Resignations are particularly high in certain job_titles, such as Cashier and Shelf Stocker, and in certain departments, such as Customer Service, Dairy, and Processed Foods. This data can lead to direct action:

  • What about these departments and roles leads to high levels of resignations?
  • Are there issues with management?
  • Do employees in these roles have the resources they need to perform at their best?
  • What about the employee experience can be improved?

Take-Home Message

  • Armed with this data, companies can address the main factors causing their best employees to leave, and do something about it now.
  • Armed with predictions of ‘at risk’ employees, managers can take steps to improve their employees’ experiences and engagement before they start thinking about leaving.


The full code can be found on Github here, along with ‘extra code’ for analyses that supported this blog and might be useful to others. For more data science resources, check out my website!

Data Scientist, Health & Wellness Techie, People Analytics geek, PhD in Anthropological Sciences, outdoors enthusiast, dad. Not necessarily in that order.