How Jobtome’s AI-Powered System Reveals Hidden Salaries in Job Postings

Published in

Jobtome Engineering

9 min readMar 13, 2023

Introduction

Our mission at Jobtome is to help people get the right jobs and employers find the right employees, reducing economic insecurity for Blue Collar workers around the world.
Every day, we manage a vast amount of job postings that we collect, evaluate, handle and publish on our platform. Each job posting usually contains essential information like job title, location, and the company name that is hiring. In addition, job descriptions may include other important details such as required qualifications, benefits, and compensation, which are often unstructured and require extraction.

The matter of including job salaries in job offers has become a popular topic in our industry.
Despite some companies adopting a more transparent approach and integrating pay rates into their job advertisements, it can be a hard task for candidates to uncover this crucial information within lengthy job descriptions. Our aim is to enhance the visibility of this data and draw attention to it for job seekers.
To accomplish this objective, we have created a system that automatically extracts job compensation details with a machine learning approach.

The problem

Our primary objective is to identify specific words within a lengthy text as belonging to the salary category. Initially, this task may seem comparable to a Name Entity Recognition (NER) challenge, in which every token in the job description is labeled as either Salary or Not Salary. However, current Natural Language Processing algorithms operate with a limited vocabulary of around 30,000 tokens, which can limit the results when several digits are present. As a result, the job salary may be divided into several sub-tokens, making NER less effective. The following is a typical output of the tokenisation process:

>>> tokenizer('from 23.000$ to 24.500$')
['from', '23', '.', '000', '$', 'to', '24', '.', '500', '$']

In such cases where there are lengthy sequences of tokens, an alternative approach is required. Our problem can be reframed as the extraction of a section of text from a given context. We have chosen to use an extractive Question Answering (QA) algorithm, which is highly suitable for these circumstances.
An informative and straightforward article that explains how extractive Question Answering operates and how to implement it with BERT can be found at the following link. Essentially, the model takes the question and the context as input and assigns a probability to each word in the context, indicating whether it could be the start or end of a potential answer. By selecting the most probable combination of start and end tokens, one can easily extract the text between these two points as the answer.

In the following, we explain some custom implementations that make our model more specific to our task and faster at inference time with respect to current QA implementations.

Training Dataset

Before deep diving into the model, we give a quick introduction of our data. The dataset used for training our model consists of 100,000 samples that include an English job description and the corresponding extracted job salary. Below a random instance:

Job Description:
Packing Line Factory Operative<br /><br />Denby, Derbyshire<br /><br />Can you travel to Denby for 6am start? Looking for packing work?<br /><br />Yes? Keep reading!!<br /><br />Pay Rate:<br /><br />£9.80 per hour<br /><br />Working hours:<br /><br />Rotating shifts, days and afters:<br />Monday to Thursday 6am – 2pm / Friday 6am – 12.30pm<br />Monday to Thursday 2pm – 10pm / Friday 12.30pm – 7pm<br />Must be flexible due to the business needs<br /><br />You do not need any experience for this role as all onsite training is provided for you but if you have worked on Production Lines previously this will be a distinct advantage.<br /><br />Duties will include:<br /><br />Assembling a cardboard box to put bottles in<br />Quality checking labels and date stamps<br />Using a tape machine to seal boxes<br />Stacking boxes onto pallets<br />Cleaning duties<br />Any other general production duties as required<br /><br />Benefits:<br /><br />Onsite parking<br />Subsidised canteen facilities<br />Temporary to Permanent for the right candidate<br /><br />To apply please either:<br /><br />Call Paula on 01773 513 310 between the hours of 8am – 5pm Mon-Fri<br />

Salary: 
£9.80 per hour

One crucial requirement for implementing extractive Question Answering is that the answer must exactly match a portion of the context. However, in our raw data, the labeled job compensation often slightly differed from the text span within the job description. For instance, the salary string “45$ to 23$” in the job description did not exactly match “from 45 to $23” in the salary labeling. To address this issue, we use the fuzzysearch library, which extracts the best match within the context for a given string to search for:

>>> find_near_matches('from 45$ to 23$', 'job description ... from 45 to $23 ...', max_l_dist=5)
[Match(start=25, end=40, dist=3, matched="from 45 to $23")]

Moreover, to account for more salary patterns, we augment our data by artificially generating slight modifications of the salary string, both in the labeling and job description. This includes changing the currency and substituting common patterns such as “per hour” with “p/h” and “per annum” with “per year”, among others.

Question Answering model

As previously mentioned, our task resembles an extractive Question Answering problem: identify the specific text segment that answers a given input question within a provided context.
One of the most effective methods for tackling this involves a three-step process:

firstly, a language encoder is used, which takes in the question-context pair and returns a contextualised embedding vector for each token
next, a classification head that outputs the probability of each embedded token being the start or end position of the answer
afterward, the probabilities of the start and end tokens are combined, and the best combination is selected to determine the span of text that constitutes the answer

Encoder

Our use case is relatively straightforward, as the model is tasked with answering a single hypothetical question: “What is the salary?”
As a result, we have developed a standard QA architecture (which draws inspiration from the 🤗 Hugging Face course, available here), with some modifications to reflect the simplified task. For instance, rather than inputting both the question and context to the model, we only provide the context:

Standard QA input
What is the salary? [SEP] Job description with a salary of 12£ per hour

Our model input
Job description with a salary of 12£ per hour

reducing the complexity to search for the answer only within the context.

In our approach, we utilise the pre-trained base BERT as a feature extractor, while keeping all of its 110 million internal parameters frozen. This is an unusual decision, but is driven by the fact that our task (identifying a salary pattern within a job description) is straightforward and does not require fine-tuning such a large number of parameters. As our job descriptions typically exceed the 512 sequence length that BERT expects, we segment each context into multiple features, each with a length of 100 tokens and overlapping text of 50 tokens between consecutive features. This segmentation approach helps us to work with longer job descriptions while still leveraging the power of BERT as a feature extractor.
Below a simple example:

Input Context 
Job description with a salary of 12£ per hour

Context after segmentation
Feature 1: [CLS] Job description with [SEP]
Feature 2: [CLS] description with a salary of [SEP]
Feature 3: [CLS] with a salary of 12£ per hour [SEP]

Classification head

The 🤗 Transformers library offers the BertModelForQuestionAnswering class, which utilises the BERT encoder along with a linear dense layer to classify the start and end token positions. Since we have chosen to freeze the BERT weights, we have decided to enlarge the classification head in order to provide more flexibility. To accomplish this, we have designed a custom architecture which replaces the conventional classifier with a combination of two stacked layers: a hidden layer with 256 neurons and ReLU activation, and a final linear layer that predicts the logit values for the start and end tokens. Our model’s number of free parameters in the classification head has increased significantly, from 1.5k of the standard BertModelForQuestionAnswering to almost 200k (768 x 256 for the hidden layer, plus 256 x 2 for the last layers, plus bias terms).
Because we have adopted a custom architecture, we cannot start our training from a pre-trained model checkpoint that is designed to answer general questions. Instead, our classification head must be trained from scratch.
Finally, the model’s output is processed to combine the start and end token probabilities, resulting in the best possible answer as shown below:

Feature
[CLS] with a salary of 12£ per hour [SEP]

Model output: Start token probabilities
[0.1, 0.0, 0.0, 0.1, 0.0, 0.6, 0.2, 0.0, 0.0]

Model output: End token probabilities
[0.2, 0.0, 0.1, 0.1, 0.0, 0.0, 0.0, 0.5, 0.1]

Best 'possible' combination
Start token: "12£"
End token: "hour"

Best answer
"12£ per hour"

Since our pipeline is required to process millions of job descriptions daily, the speed of the model is a crucial factor. To make the final step faster than the standard 🤗 QA pipeline, we have developed a custom pipeline that incorporates the following optimisations: we reduce the number of start-end token combinations to just a few of the best candidates from both lists, we avoid converting model logits to probabilities, we discard possible answers with a number of tokens and characters that exceed a certain threshold, and we exclude truncated descriptions that do not contain any digits.
We also handle null answers in a specific way: in accordance with the standard practice for QA models, we train the model to predict the special [CLS] token when an answer is missing. At inference time, we predict a null answer only when the difference between the logit of the best non-null answer and the null-answer prediction is below a certain threshold, as outlined in the BERT original paper (specifically, in section SQuAD v2.0). Continuing the above example:

Null answer, predicting [CLS] both as Start and End token: 
probability: 0.1 x 0.2 = 0.02

Best answer, Start token: "12£", End token: "hour"
probability: 0.6 x 0.5 = 0.3

if 0.3 - 0.02 > threshold, then [best answer] else [null answer]

Results

Our model achieved such excellent results on the validation dataset that we have chosen to forgo testing it on yet another extraction of the training data. Instead, we manually labeled a test set of 1,000 job descriptions that we found on the web and presented it to the model. To evaluate the model’s performance, we have utilised two common metrics that are often employed for QA problems: exact match and f1-score. These metrics measure the level of unigram overlap between the predicted text and the reference text. Below a simple example:

Reference label: "12£ per hour"

Model prediction: "salary of 12£ per hour"

Evaluation: Exact match 0/1, Recall 3/3, Precision 3/5, f1 75%

In addition, we have compared our findings to those of the distilbert-base-cased-distilled-squad model, which is a DistillBERT encoder that has been fine-tuned on the SQuAD dataset. Although the results are worse, we were impressed with the performance of the benchmark. It should be noted that this model was trained to answer general questions, rather than specifically for our task.
It is worth highlighting that our efforts in developing a faster custom model have paid off resulting in a 15-fold reduction in prediction time (note that we executed both models on a machine that has one NVIDIA T4, 4 vCPUs, and 15 GB RAM). It’s worth also mentioning that this improvement is even more impressive given that our BERT encoder is slower than DistillBERT.

 ════════════════════════════════════════════════════════
 Model        Exact Match (%)   f1-score (%)   time (min)  
 ════════════════════════════════════════════════════════
 our               63              82              1        
 benchmark         24              50             15        
 ════════════════════════════════════════════════════════

Comparison between our custom model and the chosen benchmark on a test set of 1,000 instances

After conducting a thorough error analysis on the entire test sample, we discovered that our model mistakes only in predicting null answers instead of job salaries. We prioritise precision over recall in our particular use case, preferring missed over wrong predictions. However, if the accuracy requirements were to change, we could easily modify the null answer threshold to assign greater significance to the recall metric.

Model predictions compared to reference job salaries. Descriptions are truncated for visualisation purposes

Conclusion

After testing phase, the model that automatically extracts job compensation details through machine learning is now being utilised in a real-world, or “production” environment. This means that the model is fully functional and has the potential to enhance the functionality of the Jobtome platform for job seekers.
As a result, the user experience will be improved by incorporating this new feature. The model can also be used to create new search filters and display job postings transparently, making it easier for job seekers to find the salary information.

Overall, this passage indicates that Jobtome is dedicated to improving its platform and providing job seekers with the best possible experience. By leveraging machine learning technology to extract job compensation details, Jobtome is contributing to reducing economic insecurity for blue-collar workers worldwide.