The World of Salary: Introducing Salary Prediction Models at StepStone

Jul 11, 2022

What brought us here?

The year is 2018. StepStone is a household name in the job market. Users are mostly content with the job search experience, our client base keeps growing, and our business keeps getting more and more profitable. Once we reached this level of expertise and popularity, we could finally afford to address the elephant in every job pursuit room: salary.

The notion of accessible salary information for the general public is a topic of both controversy and intrigue in Germany, the main country we operate on. This attitude is of course largely cultural, as we see that our other markets, like the UK, for instance, operate very differently on this point. In these countries, salary is a well-established aspect in a job application. Most employers will post the salary information on their respective listings themselves.

Given all this, one must not confound accepted social norms with the lack of public interest. As shown by a user survey conducted in 2018 on a sample of candidates using StepStone (see below), salary is both the perceived most useful bit of information during online job searches, and the second most popular reason for employees to seek to change jobs. This, together with the usefulness of having salary information available as additional dimension for matching candidates with the most suitable employees and vice-versa, was crystal-clear evidence that tackling salary information is a duty we simply could not ignore any further.

A user research study conducted in 2018 results show the overwhelming interest in salary information from the candidates.

History: How did we get started?

The initial idea was to create a tool that would give salary estimates for a given set of user characteristics. From the get-go it was obvious that the best way to collect the needed user information would be through a questionnaire; given the answers to relevant questions, we would compute the predicted values for the users’ salary ranges through a machine learning model. There were two things we needed to settle to kickstart the whole project: 1. secure user salary and job-related data, 2. settle on an algorithm to use for the machine learning model.

We could tackle the first issue through acquiring data from an external source. We collected datasets with user data for Germany, Austria, Belgium, UK, and Ireland. The data contained information about the job experiences of users, their job descriptions, as well as company-related information for their current employers. The data was enough to run a few experiments and present the first results to our stakeholders, but it needed a rigorous pre-processing step before it could be useful for any further model training.

When it came to the algorithm of choice to solve this regression problem, we ran a few experiments to test some initial results, but ultimately settled on Microsoft’s LightGBM. This algorithm was at the time truly revolutionary, as it lets you encode model features as categories, while retaining higher model accuracy and optimising on training time. To this day, the algorithm remains state-of-the-art for solving such regression problems at this scale.

With the initial datasets available, and an algorithm of choice, we were good to go with training the initial models. We trained models for all five of different markets to reasonable accuracy results. The idea was to then improve on the initial results by collecting more user data, thereby exploring the job market situation to greater depths. This would of course come as a natural result of collecting users’ answers to our questionnaire, so eventually, we would not need to rely on the external data sources to get data for training our models. Soon enough, we collected enough questionnaire data, and were able to entirely phase out the external data source.

Products: Where are we now?

Right now, the Salary Planner is an established and popular tool, with a myriad of people checking their salaries and thinking about possible career progressions every month. This is not only true for Germany — the Salary Planner is also live in the UK, Belgium and Austria, offering tailored predictions based on the data for these specific markets.

Salary information overview resulted from the completion of the Salary Planner questionnaire

Moreover, and this is really the interesting part for us as a Data Science team, we have a constant stream of up-to-date salary information for the job market which we use to permanently improve and refine our model.

This has even gotten an additional boost this year with the acquisition of, a company specialized in all salary related topics for Germany. After months of discussions, alignments, understanding and comparison between our approaches, a mapped dataset was available to be consumed in the StepStone salary models pipelines. These models are giving our users more accurate salary estimates for their credentials than ever before.

Based on the success of the Salary Planner, and the high interest in this topic we mentioned earlier, we were also able to get buy-in from management for probably one of the biggest changes to in recent history: Salary predictions on listings, a product that was released last year to much media attention.

This enables job seekers to not only assess their own market value, but also to directly see what an adequate compensation for a given listing would be. This product required a huge effort from the Data Science side, as we only had very little available information from companies and other sources about actual salaries. This means we had to extract and pull together relevant data points from listings, company sites etc. and map those to the data we collected through the Salary Planner to be able to train a high accuracy model for this use case.

Salary information viewed on a listing. The range is an estimation made through our salary prediction model for listings.

As a result of this effort, we now have salary predictions on over 80% of the listings on, a huge achievement which adds much more value to these listings for the job seekers. From the initial, already-highly-accurate, first release model, with ongoing improvements to the product and continuous efforts by our customer service teams, we now have reached the point where the salary information is regarded as an established addition to the listing, generating more interest and in turn more applications for the clients.

Tech Stack: How do we do it?

First, everything we do in our team is running on AWS SageMaker, which allows us to quickly and conveniently perform experiments, run training and tuning jobs, and eventually provide prediction endpoints for our internal and external clients. Having everything in the cloud relieves us of a lot of infrastructure headaches and allows us to focus on delivering the best quality models.

As mentioned in the introduction, the foundation for the actual salary estimations is LightGBM. Having a great algorithm will never get you all the way though without the appropriate data (see above) and features specifically engineered to make the most of this data.

Here lies the secret sauce that sets StepStone apart. For a first step of feature enrichment, we use a carefully curated linguistic ontology of over 250.000 jobs, which allows, for instance, to very precisely encode information about (sub-) disciplines and their relationship to each other. In essence, this enables the model to understand the characteristics for specific groups of jobs in the context of the whole job market. Also, we obtain information about seniority for a given job title from this ontology. This fantastic tool is maintained by StepStone’s Linguistic Services Team.

To complement this explicitly encoded information, we also add word embeddings which express yet another layer of similarity between different jobs. These embeddings are based on the well-known BERT model and have been fine-tuned on StepStone listings. This was developed by a different Data Science team at StepStone originally, but in the spirit of great cooperation between the teams we were quickly able to integrate it into our approach.

With all these things in place, our trained model currently has a median proportional error of 13.2%, which is the lowest in the industry for Germany to the best of our knowledge.

Future: What’s next?

With all the exciting things that have been achieved already, we are of course still far from finished with the topic. One continuous effort is the improvement of the prediction quality, either through acquisition of more data and improved cleaning/preprocessing, or through algorithmic improvements we are investigating every day.

Another major topic for Salary on Listings is also to push the coverage towards 100% — right now, there are cases for which we cannot make predictions for technical or other reasons, and we aim to close this gap as much as possible to add value for even more users.

An even bigger undertaking which is currently going on is the integration of our salary models into new use cases and products. For instance, we are developing a new tool in the B2B context for helping companies analyze their own internal salary structure and make more informed decisions when recruiting. This is based on the existing product Compensation Partner from our friends at, and we are excited to take it to the next level.

Finally, this will also enable us to better integrate salary information into our existing matching algorithms between job seekers and listings. If we achieve a significant coverage of reliable salary information on both sides, we will be able to recommend even better fitting jobs — as it was said in the introduction, salary is one of the most important factors when looking for a job, and we are working hard to represent this in all our products to do it justice. After all, finding the right job for everyone is our mission, and salary is a big part of it.

