Using machine learning to predict employee turnover without damaging your business in the process

Published in

Exness Tech Blog

9 min readDec 13, 2022

Have you ever come across machine learning predictions for employee outflow as opposed to customer outflow?

A few years ago, this was a trending topic — and it keeps turning up in the media every once in a while. Some of you may have heard or even witnessed incidents to the tune of: “My big data team has analyzed things, and some of you are fired.”

But today, I suggest setting aside the extreme cases and, instead take a birds-eye view of such tasks, which can present a certain interest in three aspects:

The origin of the task and its business use (if any)
Some ethical aspects
Proper formulation and possible solution of the task in terms of Machine Learning

So buckle up.

Ethical groundwork

Let’s kick off by asking where the task of predicting an employee’s resignation comes from.

Before we answer, there’s some basic groundwork to establish. The right of an employee to terminate their contract with their employer is mandated by labor law, which — aside from being a basic human right — ensures natural regulation and improved working conditions in a competitive job market. It’s what keeps any organization’s recruitment wheels turning: looking for new employees to replace outgoing ones or simply to introduce new talent into the workflow. Any fairly large organization must ensure this process is seamless and persistent.

Another important point is that a company’s employee count isn’t equivalent to its commercial yield. Suppose you run an IT business with a headcount of 100. 50 of them are all stripes of developers, while the remaining 50 perform the rest of the duties. In this case, what will a cleaner quitting the company mean for you? What about a CTO bidding farewell? A garden-variety developer calling it quits? These three scenarios pose entirely different risks and entail three different response strategies. In a well-run organization, none of these events (or even all of them at once) deal a lethal blow to its operations. It is precisely these risks that we are handling when faced with the task of predicting an employee(s) resignation.

This brings us to the first important characteristics that we are dealing with: the unpredictability of resignation and the inequality of the affected positions.

Here it pays to discuss another basic thing that will help us in our further study — the resignation event per se. Viewing resignations from a pure data analysis perspective may lead to a somewhat counterintuitive conclusion: sooner or later, all of our company’s employees will be stepping down. Technically it’s true: pick a sample of any organization’s employees with more numbers and longer stints, and you’ll be quick to realize that someone will walk out as early as tomorrow, someone will bow out in a year’s time, and for others, it may take 10 more years before calling it a day. But the bottom line is, any employee resigns from the company once they hit retirement age.

Now instead of asking ourselves if an employee will quit, let’s switch to two more specific questions: who will leave the company and when will they do it? Predicting some general outflow figures is not enough: after all, no ML is needed to crunch the turnover rate numbers. Instead, these tasks involve specific job titles and expected time horizons for employees who may eventually stand down. Simply put, who is going to resign and exactly when — these are the questions we are processing.

Enough with the basics, though. Let’s move forward and assume we do have exact and exhaustive answers to these questions regarding each employee. Suppose we have an algorithm which you can feed an employee name in return for the expected resignation date, even with 100% accuracy. What can be done with it? Is this information even useful for the business?

Usefulness assessment

Perhaps we can do something to retain our key employees who, according to our algorithm, are eyeing the exit sign. And this knowledge may help us keep them aboard.

In this case, the algorithm will indeed be of use. But there are two catches.

Number one. Do we as an organization have some leverage to make an individual change his or her mind and mentally get back on the team? After all, people quit not because some random event generator says so. Far from that, the resignation can be shaped by some internal reasons (an individual disapproves of something) or external factors (going through a rough patch, getting a better offer, etc.). Basically, there is no way you can change those external factors. On the other hand, we can change the internal factors, though really only to an extent. Even if we are spot-on with those reasons, it does not guarantee that the employee stays.

Number two. What does it take for an algorithm to work efficiently? To paraphrase, what is the price we are paying to predict the resignation date? Here are some potential questions and issues we may face.

What data do we need to collect from our employees for the algorithm to work? What input data will it require? Aren’t we getting on our employees’ nerves by, say, having them wear collars and GPS trackers to collect information on their every step around the office or beyond?
Won’t the very fact of being monitored cause the employees’ frustration (the Big Brother effect)? Keeping such tracking activities under wraps, as one might think, would be impossible by any stretch.
Where will the employees be taking their assumptions about the company if it tries to influence them through collecting the data (see point 1)? Won’t they be — whether consciously or not — tweaking their behavior to keep aligned with the system, providing us with an imitation rather than the information useful to our business?

Although each of these three questions warrants a separate discussion, I’d like you to take a step back to where we assumed that our algorithm is 100% accurate. The reality is that it will NOT be that accurate. And since the subject is too sensitive, each error will only undermine confidence in both the algorithm and its developers.

But reservations aside, if such an algorithm were to exist, how could it perform and what would it assess?

Formal take

Technically, it is no big deal on the ML side of things.

We’ve got a historical list of employees (or objects) who can be used for training purposes. This includes all our former employees, a list of features (we’ll discuss this later), and their resignation dates as object-specific “responses.” Besides, we’ve got a list of active employees — apparently with no resignation dates, but with the same features that can be used to predict their resignation through conditional linear regression or a tree ensemble, should we need more explanatory power.

List of features

These are all sorts of dated HR events: hires, transfers, if the transfer is a promotion, part of a reorganization, or a complete career change. Some are sensitive data features, such as a compa-ratio. From my own experience, employee performance ratings show surprisingly low correlations with resignation dates.

The simplest example might look like this:

Here are some typical problems we will face.

Too little data, especially if it’s a fledgling organization and the number of resignations is quite low.
Even less accurate data: the information is stored sloppily, migrating between systems, while the semantics and scales often change. In the best-case scenario, you’ll be able to reproduce the scales and semantics from several years back, based on the historical data. Long story short, preprocessing the features will be a tough nut to crack.
Many events are skewed by the decision-makers’ individual perceptions (performance ratings, resignation causes).

Applicability

Another noteworthy point has to do with the applicability of such algorithms to real business tasks. Let’s say we have collected our features — the ones offering more use than trouble — trained our model on the historical data, applied it to the active employees, and obtained a date (or resignation probability) for each one of them.

What’s next?

As our output, we now have what looks like the most useless model in history: it does predict the resignation dates of all the employees, but our estimate is NOWHERE near accurate for each specific employee. Each employee has their own unique circumstances of which we know nothing, but which can considerably affect their career behavior. For instance, they may be headed for a resignation, but are deterred by their mortgage payment plan, or they are unwilling to risk switching companies. Conversely, they may not exhibit formal signs of quitting while having internalized bad blood with their line manager that we know nothing about.

So is there any use for the collected data?

Firstly, as part of the process, we have gleaned some useful data on our employees’ career behavior and situations that prompt them to walk out. For example, this is how yours truly has found out that the average tenure of a developer at a large Russian IT company is 1,100 hours (roughly 3 years) and there are specific situations that impact the employees’ tendency to quit.

This knowledge can be utilized to create a small and easily explainable rule-based recommendation system, useful to the HR team as a basis for career decisions. For one, they can figure out who to keep an eye on in the short term. Once tested, the results of such a recommendation system can be bundled with specific guidelines and handed over to line managers.

Secondly, we can figure out a task that makes our model’s operational results benefit the business.

To explain the gist of it, we first need to go off on a slight tangent and break down some basic HR processes. Any organization has a number of ways to count their staff. We can just calculate our active employees, and that is our personnel count. But there is a less obvious yet more important personnel count. In Russian practice, it is referred to as “staffing records”, and this implies the number of jobs rather than people. Indeed, we cannot hire more people than what our office or IT infrastructure can accommodate for, or more than our payroll capabilities can afford. Typically, these staffing records are represented by a present or future number, for example, 2022 staffing records.

Our hunch may suggest that the correct filing of the staffing records amounts to 100%, right? Wrong. In reality, staffing records always exceed (or rarely equate to) the personnel count. The bigger the organization, the lower the odds of personnel count matching the staffing record.

In an organization with 1,000 staffing records, the odds of them being filled out 100% are close to zero. For the most robust businesses, the filling rate of staffing records hovers around 90%.

Why is this the case? The answer is simple: employee termination takes far less time than hiring.

Why does it matter?

Each organization runs a process of “planning payroll fund costs,” and more often than not, they use the simplest and the most conservative models that suggest filling out the staffing records to 100%. That is to say, often we literally plan that 100% of our employees will be working 100% of the time, even though we know for a fact that it is mathematically impossible.

That is where data on the employees’ median tenure can come in handy. Aggregated at the department/division level, this data can provide us with a proxy metric of “staffing records fill-out rate in division X with a horizon of one year/quarter/month.” This number within the range of [0,7–0,96] serves as a multiplier for said division’s planned payroll fund costs, making them lower than customary. The resulting difference can be excluded from the planned costs and become available for other purposes.

The key thing about this task is that a miscalculated larger amount of money remaining costs you little: we just leave this amount either as planned or on the company’s accounts. Meanwhile, a miscalculated smaller amount of planned costs can lead to a detrimental gap in company funds. That’s on top of all kinds of looming risks our business is susceptible to. In other words, our model must still be highly conservative.

Conclusions

The basic personnel outflow model cannot be accurate, and using it loosely can bring a company more trouble than benefit. Do not overlook the ethical factor, or your model may end up damaging the business before it can yield its first results. If your data collection resembles surveillance, expect backlash and an adverse reaction from employees. Instead, try looking for alternative application scenarios for the outflow model results in advance, like using them in an aggregated fashion. You can calculate the expected resignation date “under the hood”, but avoid taking it to your product’s forefront and using it as is. A feasible scenario for a real-life usage of such an algorithm is calculating the expected annual payroll fund costs, while factoring in the employees’ median tenure.