Human Resources Analytics: Predict Employee Attrition

The analytic methods can improve Human Resources (HR) management for companies with large number of employees. It is very easy to give example, how can companies benefit from machine learning methods applied to HR. Let’s assume that training of new employee costs 1000$ and if we can predict which employee is going to leave next month, and propose him/her a bonus program worth 500$ to keep him for next 6 months, we are 500$ on plus and keep experienced, well-trained employee under the hood, with higher morale.

In this article, I would like to present how to predict employee attrition with machine learning. For analysis I will use a data set created by IBM data scientists, which is available here. However, I will do a split into train and test samples to better explain you how machine learning methods can be applied to this problem. The splitted data is available at my github. The train set represents historical data about employees. In this data each sample (row) describes the employee with parameters like: age, department, distance from home, marital status, income, years at company. You can check all used descriptors here. For each employee in the train set the attrition is known (it is historical value). In test data we have employees descriptors available, however the attrition is unknown and we want to predict (compute) it with our machine learning model. (To be honest, the attrition values in test data are available, but for better explanation let’s assume that it is missing).

There are 1200 samples (employees) in train data. For model training we will use MLJAR which allows to create machine learning models in the browser (no installations required!). We start with project set-up, we will set project title and task: binary classification (we predict yes/no for attrition).

Create new awesome project :)

We need to add our data sets: train and test. For test, we should set a checkbox that: ‘This dataset will be used only for prediction’.

Upload of training (historical) data
Upload of test data. We will use it for predictions. Please remember to mark it as ‘This dataset will be used only for predictions’

For each data set we need to select which column (attribute) will be a Target column, in our case it will be Attrition. Additionally, we can check data distribution for all attributes. After selecting the target column, please click ‘Accept column usage’. OK, now we are ready to start training machine learning models.

Select target attribute: Attrition and accept attributes usage

Let’s go to Experiments view and click on Add new experiment. In dialog we need to select:

  • Input dataset, we will use train set
  • Algorithms, we will use Extreme Gradient Boosting, LightGBM and Random Forest. We will also create ensemble from our models (it will built even stronger model)
  • For metric we will select Area Under ROC Curve (the higher value the better)

When everything is set, we can click Create & Start and wait till all models are trained (it can take some time).

Our Machine Learning experiment

Models will be trained after some time. We can easily compare different ML algorithms and use the model with the highest score — in our case it is Ensemble model.

Before doing predictions, let’s check feature importance for models in Feature Analysis view. The feature importance for the best single model is presented below:

From features importance we get some insights what are key factors for employee attrition. For example, in the selected model, the number of business travels was the most important for making a decision about attrition.

Let’s use the model for predictions!

Now it is a time to use our machine learning model to compute attrition for test data, which presents employees for which attrition is unknown (we assume unknown of attrition for this tutorial purposes ;))

Compute predictions!

Ok, but we know the true attrition in test samples, so we can inspect how our model is doing. I make a table with two columns:

  • first column is predictions from model
  • second column is true attrition (known from test data)

I sort rows (employees) based on descending predicted value, below are few top and bottom rows.

Few top and bottom rows from table sorted by descending predicted values.

You can see that you need take care about employees with high predicted values, because they will left your company in the near future. There can be a question how many employees from the top we should take care of … and there is no one answer for this, because it depends on hire vs keep costs.

In this article you have learned how you can build a predictive model for employee attrition. In this example a problem of data collecting and cleaning was not considered, which is crucial for building good machine learning model. After data is collected & cleaned the machine learning part can be easily done with MLJAR :)