Let’s Talk About Money: How DATEV Plans to Use Machine Learning to Offer Top of its Class Salary Predictions

DATEV eG
DATEV TechBlog
Published in
6 min readAug 10, 2022

By: Moritz Mayer & Frank Eichinger

While Openness is Lacking, Interest is Not.

Salary – the amount of money an employee is paid for his or her work. The concept requires little explanation. Yet, the question how high it should be, is a totally different story. How much someone is paid, is information that is kept private like almost no other, especially in Germany.

Only 1 in 3 Germans share information regarding their salaries with friends, even fewer (1 in 6) do so with their co-workers. Even in relationships it remains a delicate subject: only every second German knows the individual compensation of their partner (StepStone, 2019).

At the same time, German companies are facing a shortage of skilled labor. Losing talent hurts. Of those “lost talents”, 1 in 2 say a too low salary was their number one reason for leaving (Randstad, 2018). It’s of critical importance for employers to know the market prices of their employees and their open positions. When they do so, they are able to keep and hire the talent they urgently need to keep their business on track.

First and foremost, online platforms like kununu (employer reviews) and StepStone (job offers) are ramping up their marketing efforts to fill the gap. Advertisement for their respective salary recommendation tools can be seen at virtually every bus stop and train station billboard across the country.

But What About the Data Quality of these Tools?

It’s commonly known that a machine learning model is only as good as the data it is trained on. And this is exactly where these tools struggle. As such platforms do not have access to verified payroll data, to fill up their salary databases, they ask users to enter salary data before they can use the tool for themselves. This is problematic.

Sloppy data entry – users typing in random values just to finish the mandatory initial survey as fast as possible, users including variable salary components (like bonuses or benefits such as company cars) while others don’t or even malicious users/bots entering wrong data on purpose – is a major problem which call the prediction quality of the platform tools into question.

Tackling the Problem with Advanced Machine Learning and High-Quality Data

14 million employees get their payslips processed by DATEV software each month, and we had access to an anonymized sample of this data. A sample unmatched in quantity and quality when it comes to salary data. Unlike the previously mentioned platforms, we can be sure that the salaries we feed into our machine learning models are actually the ones the employees receive.

Data Quantity and Quality Great – So, Building a Great Machine Learning Model in a Flick of a Switch?

Unfortunately, it’s not that easy. The data set can merely be seen as a solid foundation, but as this is real-life and not an Udemy tutorial, even a high-quality data set has dirty spots and specifics that need to be dealt with.

Out of the numerous supervised machine learning techniques for regression, we chose random forests as a starting point, an advanced technique internally employing a multitude of regression tree models.

We made our choice for several reasons:

· Several wide-ranging cross-domain studies proved empirically that random forests are almost always among the top-performing algorithms.

· Random forests have a robust model fit (they hardly overfit) and are said to be good in handling categorical attributes, missing values, and noise.

· Tree-based algorithms are a good choice for stakeholder presentation and discussion as they consist of rules understandable by humans.

Performance Metric and Baseline Model

Two more things need to be done before machine learning experiments can start:

1. Choosing a performance metric to be calculated on a real test dataset and

2. choosing a baseline prediction model.

The “Mean Absolute Percentage Error” (MAPE) says how far off (in percent) a prediction value is from the target. Let’s look at an example:

Actual salary of employee = 40,000 €
Predicted salary by algorithm = 44,000 €
MAPE = (44,000 € - 40,000 €) / 40,000 € = 10% (the difference in relation to the actual salary)

If the prediction had been 36,000 Euros, the error and hence the MAPE would also be 10%. As both predictions are equally wrong, it makes sense to only consider the absolute amount regardless of the plus/minus sign. While there are other error metrics, the MAPE is frequently used and captures our business problem very well.

A baseline prediction model is a useful benchmark to assess whether the business problem at hand requires machine learning or how far you could go if you used a much simpler approach. Our baseline model is straight forward: it uses the median salary of all employees of one profession as the prediction and completely disregards all other features.

Dozens of Experiments

Out of dozens of experiments we have conducted, the table displays the results of the five most important ones, each with plain regression trees and random forests.

To anticipate this, as expected, the comparably simple regression trees we use as a further comparison always have higher error values than the random forests. Apart from that, we start with a vanilla random forest on our data – one model for all professions (Experiment 1). The results are a bit better than the baseline model, but it quickly becomes clear that the tree-based approach is struggling with a particularity of the salary data: a high-dimensional categorical feature with at the same time high feature importance: the profession. Leaving out the profession variable decreases the prediction performance (Experiment 2) even below the baseline, but another approach yields much better results:

An Ensemble of Ensembles

To better capture the important profession of an employee, we adopt our approach: We train a random forest model for each individual profession, resulting in hundreds of random forests, each itself consisting of a multitude of regression trees. So to say, we have an ensemble of ensembles. This increases the prediction quality significantly (Experiment 3). We achieve further large improvements by more comprehensive outlier cleaning (Experiment 4) and additional smaller improvements by tuning hyperparameters of the models (Experiment 5).

17% Error – Good or Bad?

So, is our achieved average prediction error of 17% good or bad? Of course, it would be desirable to predict salaries with no error at all. But can this be done? The short answer is: no. And you probably wouldn’t even want that. Why? Well, this is because the data can only capture reality to a certain extend. In the real world, numerous other factors influence the salary of a person, including the intelligence quotient (IQ), grades from university, employee ratings, gender, and ethnicity. Not all this data can be captured, or some of it just shouldn’t be used in your model for reasons of data ethics (e.g., gender, ethnicity). You certainly don’t want your model to reinforce existing injustices. So, while there is probably still room for improvement, a certain degree of error will remain as otherwise undesirable features would have to be included in the salary prediction model.

One further reason for prediction errors is that there are remaining incorrect data points in our database which can hardly be eliminated automatically. Another reason is that there are certainly quite some employees in the real world being paid too low or earning too much. As we measure the prediction error by comparing predictions with real salaries from our test set (and not with “correct” salaries in line with the market defined by experts), this messes up error values, too. On the other hand, our automated predictions will help to find salaries in line with the market, eventually leading to fairer salaries – even if 17% error on a test set may sound bad in the first moment.

To conclude, we have demonstrated that random forest regression is well suited for salary predictions based on payslip data. The key factor for achieving high-quality results is – besides carefully chosen outlier handling – learning one random forest regression model per profession. At the moment, we are in the process of taking all the hurdles from a proof of concept to productive software by integrating a machine learning based salary prediction approach into our software for salary comparisons.

--

--

DATEV eG
DATEV TechBlog

DATEV eG steht für qualitativ hochwertige Softwarelösungen und IT-Dienstleistungen für Steuerberater, Wirtschaftsprüfer, Rechtsanwälte und Unternehmen.