Using NLP at scale to better help people get the right job — Part 1: problem statement and evaluation

Stefano Rota
Jobtome Engineering

--

Our mission at Jobtome is to help people get the right jobs and employers find the right employees, reducing economic insecurity for Blue Collar workers around the world. For this reason, every day, we deal with millions of job offers that we acquire, analyze, process, and publish on our website. Classifying such a composite inventory is a challenging task that we decided to re-modernize by adopting a data-science approach.

The arduous task of classifying a job ad

A job ad usually features a title, a description, a reference to the workplace, and the name of the company looking for candidates: it’s pretty rare to get access to additional information such as classification elements. However, when dealing with a high volume of advertised jobs, the lack of a consistent classification scheme can quickly become a severe limit to a company’s capability to grant efficient access to its inventory: for this reason, we decided to invest in an automated system to detect the job category, i.e., a feature that indicates the type of job position requested by the hiring company.

Our task was to develop a model that can automatically classify an advertised job position into a job category to grant our job seekers a more detailed view of our extensive and dynamic warehouse.

The algorithm should be able to assign the Medical category to a Surgical Nurse position or the IT category to a Software Engineer.

Even though the job category in some cases could be derived from the company name (think of the strong correlation between Amazon and warehouse workers as an example), we determined that the job title and description are the most relevant features to extract the category information. Consequently, we decided to focus on the title to speed up the computation and simplify the problem: dealing with the few words composing a job title is much faster and simpler than processing hundreds of words in a job description.

This article is the first of a series of three: in the following ones, we’ll detail the steps to build a meaningful training dataset and, last but not least, some insights about the model itself.

Let’s start by deep-diving the problem and the overall requirements.

The requirements

Our context forced us to deal with strict requirements that significantly impacted the creation of the training dataset and the setup of the final model. Among them:

  • Multi-language text
    We run our business in several countries, adopting a dozen languages: the input string could be in one of those different languages. As an additional complexity, many job ads adopt an English-language-based jargon without a corresponding counterpart in the original language. For example, one could easily read an Italian ad such as Ricerchiamo un account manager con sede a Roma, where the expression account manager is not translated from English into Italian.
  • High speed
    Given the impressive number of jobs our systems manage every day, the model should be fast enough not to impact the overall processing time dramatically. Consequently, the resulting model should enforce a clear trade-off between high accuracy and execution speed.
  • Simple categorization
    Even for humans, categorizing a job position is a complex task; job titles could be misleading or belong to more than one category. For instance, a Legal Secretary could be classified as an administrative job, a legal one, or even both. Complex hierarchical category structures have been developed to represent the actual ramification of job positions on the market, such as ESCO, ISCO, etc. However, adopting one of those classification systems would turn into a more complex data management without a clear achievement in terms of clarity and easiness of use. Most of our clients use a more straightforward approach and define a flat categorization based on a few classes. Somehow we can think of the category as a simplistic view of the world, and the low confidence of the model is a consequence of such an imperfect definition of the job category structure more than its inaccuracy. What follows is the list of job categories used in our company and that our model should predict:
Accounting, Administration, Analyst, Construction, Consulting, Customer Service, Design, Education, Engineering, Finance, Food, HR, Healthcare, Hospitality, IT, Insurance, Legal, Logistics, Management, Manufacturing, Marketing, Media, Medical, Retail, Sales, Science, Security, Services, Skilled Labor, Social Care, Sports, Transportation

Our strategy for a meaningful training dataset

We constructed from scratch an entire training dataset composed of input job titles and output job categories. The linked post explains our approach to creating a high-quality training dataset, resulting in 2 million job titles and their corresponding job category. Half of the data are in English, while the other half is composed of Italian, German, and Portuguese titles mainly obtained by translating a subset of the English job titles.

We trained a Natural Language Processing classifier model on the above dataset to learn and generalize the connection between a job title and its category. The fulcrum of the model is a multilingual embedding encoder capable of returning close embeddings across multiple languages for text with similar meanings. This approach helped us generalize our model to cover languages that are not included in the training data (for a more in-depth explanation of the model definition and training, take a look at this article).

The test dataset and the model performance

Despite the cleanup process, the training dataset’s noisy nature discouraged the extraction of a subsample for testing purposes. Therefore, to test the performance of the job category classifier model, we set up a custom evaluation dataset of thousands of manually labeled job titles in multiple languages, including English, Italian, German, and Portuguese. To further extend the dataset and prepare an evaluation dataset per language (while still limiting the manual effort), we translated the English job titles to other languages such as French, Dutch, and Polish. Finally, each dataset was class balanced with hundreds of job titles per category. We also combined pure job titles, such as Warehouse worker, with less specific but more common ones, such as Warehouse worker needed in Los Angeles. This approach was key to testing the model performance when dealing with attributes not directly related to the job category information.

It is worth mentioning that even our team’s effort to label the evaluation dataset manually led to a discrepancy of 10%: in one over ten cases, there was no agreement on the correct category to apply due to the loose boundaries of their definition.
As a result, the need for a more flexible approach was evident when evaluating the model’s performance.

Below is the resulting performance in terms of percentage accuracies within the top 1, 3 and 5 guesses:

╔════════════╦════════════╦════════════╦════════════╦════════╗
║ language ║ accuracy@1 ║ accuracy@3 ║ accuracy@5 ║ titles ║
╠════════════╬════════════╬════════════╬════════════╬════════╣
║ English ║ 88 ║ 97 ║ 99 ║ 1953 ║
║ Italian ║ 82 ║ 95 ║ 97 ║ 704 ║
║ German ║ 83 ║ 94 ║ 97 ║ 1704 ║
║ Portuguese ║ 81 ║ 95 ║ 97 ║ 688 ║
║ Dutch ║ 78 ║ 91 ║ 94 ║ 1922 ║
║ French ║ 81 ║ 93 ║ 96 ║ 1935 ║
║ Polish ║ 76 ║ 92 ║ 95 ║ 687 ║
╚════════════╩════════════╩════════════╩════════════╩════════╝

With a good result of 88% of accuracy, we reached the highest results for the English language and a slightly lower value for Italian, German, Portuguese, and the other languages. This was somehow expected, reflecting the choices we made: an English-based training dataset, a data enrichment performed by translating the texts into Italian, German, and Portuguese, and the responsibility to predict other languages on the shoulders of the multilingual embedding.

Even considering the human-experienced discrepancies, the model accuracy was very high, especially for the top four languages. Moreover, the accuracy@3 is >90% for all the tested languages, meaning that the model predicted the correct job within three guesses.

A remarkable result was the generalization obtained in languages without data in the training dataset, such as Dutch, Polish and French. This result was entirely due to the multilingual capability of the embedding encoder and made us confident to apply the job categorizer model to other, not yet tested languages.

We didn’t see any difference in accuracy between clean and dirty titles in the evaluation dataset: the model was good enough to focus on specific words that were more representative of the job position information, giving less importance to terms unrelated to the job category, such as the workplace of the offer, benefits, or other stop-words.

Analyzing misclassifications, we noted that most of the time, when the model was wrong, its predictions made sense:

╔══════════════════════════════╦═════════════════╦═════════════════╗
║ Title ║ True ║ Predicted ║
╠══════════════════════════════╬═════════════════╬═════════════════╣
║ Corporate Tax Assistant ║ Accounting ║ Administration ║
║ Analytics Consultant ║ Analyst ║ Consulting ║
║ Director Architecture ║ Construction ║ Management ║
║ Lineman Aerial / CDL Class A ║ Construction ║ Transportation ║
║ Inspection Electrician ║ Construction ║ Skilled Labor ║
║ Scaffolder ║ Construction ║ Logistics ║
║ Interior Designer/Architect ║ Design ║ Construction ║
║ Child Care Teacher. ║ Education ║ Social Care ║
║ Art/Photography Teacher ║ Education ║ Design ║
║ WARRANTY CLAIMS ENGINEER ║ Engineering ║ Insurance ║
║ Equity Release Advisor ║ Finance ║ Consulting ║
║ CREDIT ADMIN SPECIALIST ║ Finance ║ Accounting ║
║ cooks /servers ║ Food ║ Hospitality ║
║ Oncology Nurse Specialist ║ Healthcare ║ Medical ║
║ diet therapy ║ Healthcare ║ Medical ║
║ Catering Assistant (Casual) ║ Hospitality ║ Food ║
║ Digital Copyright Assistant ║ Media ║ Marketing ║
║ Part time Childminder ║ Services ║ Social Care ║
║ CNC Pattern Makers ║ Skilled Labor ║ Manufacturing ║
╚══════════════════════════════╩═════════════════╩═════════════════╝

It is indeed hard to consider wrong prediction cases such as a Child Care Teacher classified as Social Care instead of Education or an Oncology Nurse Specialist as Medical instead of Healthcare. This effect was also evident in the confusion matrix where most of the off-diagonal non-zero elements belonged to similar job categories such as Healthcare and Medical, Food and Hospitality, Manufacturing and Skilled Labor:

Confusion matrix for a subsample of classes

More information about the model implementation is available here.

Conclusion

Given those results, we determined that to improve performances, more effort should be put into a more structured category definition instead of improving the quality of training data or developing a more complex model.

In the end, our model reached a satisfying accuracy and is now used in production to enrich every day millions of job ads with job category information and ultimately help people get the right job.

The contributors of the article are: Federico Guglielmin, Silvio Pavanetto, Stefano Rota and Paolo Santori.

--

--