Predictive Employee Turnover Analytics Part I

There was a new dataset on Kaggle with the name of Human Resources Analytics. (https://www.kaggle.com/ludobenistant/hr-analytics) This could be a good opportunity to explore different machine learning tools to see how they differ at the power of predicting binary variables. More importantly, this is a real situation that a firm will sooner or later run into. It will be insightful to take it as a real business analysis case. With the help of Tableau, it is quite easy to take a both quick and deep look into the data set and it could further help me to decide which machine learning tools I will use in later study. The first time I did it, I started with machine learning and ended up with Tableau. The ideal process, however, is to look at the raw data in a quite direct and intuitive way. So I really should have started with Tableau and then gone to statistics.

A quick scan of data

Okay let’s get started. The whole data set contains 14999 rows and 10 columns. The “left” column is where we know if an employee has left. All other columns excluding “sales” are directly or indirectly related to the choice made by the employees. From “sales” we know which department an employee belongs to.

A first look at the data

I chose to calculate their correlation matrix as the first step. It is easy to do in R and Excel and it gave a first impression how they correlate with each other. Since I know that they are actually related to each other, it is safe to include all of them in the analysis.

Correlation matrix

The row highlighted in yellow contains correlations associated with the response. Even though some number is very small down to 0.006567, it is still significant statistically. I will cut the Chi-square test here. In the next step, I will go to Tableau to see if there is any business instinct behind the story.

Who are leaving the company?

Fraction of leaving in each department

This chart gives audience a general view of what this firm is like and where the turnovers are from. The dark blue part stands for turnovers which is recorded as 1 numerically and 0 otherwise. The overall turnover ratio is 0.238 and across departments the ratio is quite close to the overall one. Sales, technical, and support are the three biggest departments and most turnovers happened there.

How subjective feelings and objective performance decide the choice

We will look at how personal satisfaction and employee evaluation affect the choice to leave.

On the x-axis it is satisfaction level while on the y is performance. Grey dots represent less turnover ratio and the more red the higher the ratio. Notice that there are three red main clusters in this graph. The first one is on upper left corner. It contains high-performing and less satisfied people. To be broad, people with less than 0.11 satisfaction are highly likely to leave. Also notice that when satisfaction is lower that 0.11, very few people are doing a great job. This bunch of people are marked as “highly likely to leave” in my alert system which HR can use to identify the people who are considering never coming back tomorrow. The second cluster is somewhere left to lower middle part. They fall into the region where satisfaction is between 0.36 and 0.46 and evaluation between 0.45 and 0.57. It is hard to find a reason to justify their motivation. They seem not differentiable from their neighbors. Another thing to note is that the dots become really dense in this area. This problem need further investigation in that I have no idea why it is happening. The only clue from data is that dots become denser here but in another big denser part on its right that stretches out to upper right corner does not show similar patterns. In this big denser area, there is a part that has a little higher probability of turnover than its surrounding. Why do they want to leave even if they are greatly satisfied with work and they work hard? Although they are not “totally” satisfied like the people on their right, they are also not like the people on the left. They perform well of course but how does this fact make their leave true? Perhaps this two metrics are not perfect to explain the behavior of turnoverers. There might be a bigger picture that I miss here. Another question to ask is why there should be clusters? I can understand that low satisfaction creates the drive to leave. But when there is a group of people who are not as satisfied and hard-working as a person, why is he more inclined to leave? This is not an easy question to answer so I will leave it here. I will try to look by different department to gain more detailed insight into the company. Due to the fact the graph similar to the overall one above shows up for different department, I will just show HR department.

The graph above replicate the pattern to a large extent. It is quite surprising that in different department, people share similar motivations at least for turnover stuff. Is it possible that IT or technical department is more sedentary than sales and management so that the mentality and how they look at each other could be a little different? The data do not tell us about it. Since this is simulated data, maybe itself is biased. Or possibly, there can be stuff that has long been hidden from people. The goal to use data is really that we are able to find underappreciated secrets after all.

How salary and workload impact the choice

Work and get paid. These are the two sides of the same coin. It is time to look from this angle.

The three long shapes composed by circles contains information from low to high salary. On each shape, the more green the circle the higher turnover ratio, the larger the circle the higher the total number of turnover. Each circle stands for a group of people who work for the same certain amount of time monthly.

First we will discuss on company level. There are two discussion for each graph, one for the likelihood of turnover, another for the number of turnover.

Likelihood. At the low salary level, people at two ends are very likely to leave meaning that people who over or under work and get paid poorly quit more frequently. In extreme cases, turnover rate can reach up to 100% when people insanely work over time (above 290 h/month). When the salary is raised to medium level, the same group of people are more inclined to leave as well. But this time people at lower end are less likely to leave compared to the counterparts who earn lowly. Yet one thing to note that people who still work insanely large number of hours will leave the company almost definitely. As salary goes up to high level, there are less people quitting while the same pattern can be seen.

Number. People earning less contribute to more turnover in part because the salary structure should be like a pyramid.

Next I will go down to department level. Accounting is the first to come up.

This graph looks very “green” and different from the company graph. As long as there are people leaving, the likelihood for this circle is very high. People with medium salary quit a lot and almost certainly. Then comes HR department.

Another “green” and distinct graph. Then comes IT.

A more colorful picture but still different. Under low level salary, as long as an employee does not work for a proper amount of time, that one will quit probably. But as that guy earns more, unless he/she works insanely overly or idly quitting will not happen that likely. Next is management.

It is an easy job to depict what is happening but do notice the differences. I would love to show three graphs in combination to showcase a very important and general part of data analytics as the next step.

These three are sales, support, and technical respectively in order. They look quite different from the graphs above them but similar to the company-level graph. This explains why the above-graphs look different, they are compromised by these three. Remember that at the beginning I said these three departments are the biggest three in the company, of course they decide largely how the company-level graph look like. If I do not look deeper, the department-specific patterns will never be unearthed. Data analyzer can never miss out details.

The following part will focus on how promotion affect the choice to leave. The graph goes first:

A decision of not quitting is associated with the possibility of promotion as high as 2.6% beating a less than 0.6% possibility of promotion for the group of people who quit by a visible margin. Using statistics I can tell if the margin is significant which I will not do here. The above graph only shows a correlational relationship not a causal one. I did not forget to do my research on department level but I will not show it here. IT and marketing are very different in this aspect.

Also I did look at what role work accident plays here which I will not shed a light on in this article in that it can be likened to promotion.

A great pre-research is done using Tableau to find out hidden secrets in turnover data. Personally I have learnt a lot about how to use Tableau and it is really different from simply using statistics to do research at least in two respects. One, Tableau is more powerful and quick to use when I break down data set very often. It is easy to toggle, add, change stuff instantly. Two, Tableau connects data with me in a sensational way. Sometimes it looks pretty and lovely.

In the following part II, I will use logit regression, KNN and decision tree to put up a more scientific setting how I should look into data.