Using NLP at scale to better help people get the right job — Part 2: the importance of a high-quality dataset
This is the second of a series of articles on how we at Jobtome classify millions of job ads daily to enrich our inventory and help people get the right job. You can read part one here.
In our previous post, we described the need at Jobtome to automatically classify generic job postings into a well-defined set of job position categories. The model used to classify such jobs should therefore associate the job title to its corresponding category, attributing, as an example Security Officer to the corresponding Security class or a Payroll Clerk position to the Accounting category.
If I had eight hours to chop down a tree, I’d spend six hours sharpening my axe
President Abraham Lincoln used to say, and we tend to believe that this also applies to Machine Learning projects: a model is indeed as good as its training data. But collecting good data is complex, especially for a classification problem, so one of the main challenges we faced was the lack of a solid training dataset. Consequently, our approach was to invest time in building our own list of job titles associated with the corresponding job categories. What follows is the description of the steps we took to create the dataset: if you want to learn more about the actual training and model setup phases refer to the last article of the series.
Gather well-known data
We started by gathering job titles in English from our clients and crawling job posting websites, exploring and getting evidence of existing categorizations. Our job category definition is pretty unique, so our first task was to map those categories into ours.
Our initial hypothesis at this stage was to use the resulting dataset to train our English classification model. However, the dataset proved to be heavily noisy and strongly imbalanced. Therefore, we opted for an additional meticulous cleaning process to improve the overall data quality before training our model.
A correct label attribution
The collected dataset contained several wrong attributions due to two significant problems: ambiguity in mapping the original category into our own and poor labeling of the available data sources.
To solve the first issue, we created several clusters for each category to group titles by semantic meaning. Initially, we computed a sentence embedding for each title by summing each word’s embeddings (using the Word2Vec embedding model, trained on Google News in English). We then used a k-means algorithm to form 10 clusters per category. Finally, we extracted the top keywords from each cluster’s job titles to quickly identify the corresponding class and manually labeled each cluster. Labeling almost 350 clusters (N categories x N clusters) was tedious but helped us correct approximately 15% of job categories, fixing obvious inaccurate attributions.
As an additional improvement, we applied a k-nearest neighbors algorithm to the generated sentence embedding to identify inconsistency in the classification of a title with respect to its closest neighbors. In such cases of inconsistency, we assumed a wrong attribution and switched the category.
Finally, we noticed some misclassifications that could be easily fixed with the help of a simple combination of regex rules and if-then conditions: as an example, the title Data Scientist was often labeled as Science instead of our Analyst category. These last two steps improved the resulting labeling for about 10% of the training set.
Clean job titles
As a third step, we focused on improving the quality of input job titles. We applied standard text pre-processing techniques to normalize our job titles: lowercase transformation, removal of punctuations and single chars, cleanup of specific tokens (such as HTML tags).
We skipped the step of discarding common stop-words (conjunctions, prepositions, etc.) to avoid removing critical information from the job title. One additional concern we had about stop-words was the multi-languages nature of our exercise, as stop-words might have different significance and relevance in other languages.
Discard low-quality data
At this stage, the English training dataset was almost ready. The last step was to discard low-quality data resulting in a smaller but more accurate dataset.
We adopted the classical approach of considering the word frequency in the whole dataset. We discarded titles with infrequent words (<5 occurrences in the entire dataset) to prevent the model from learning connections between a rare word and a category.
An additional — and more advanced — step was to train a simple Random Forest Classifier on the English training dataset after converting job titles into sentence embedding with the help of the pre-trained Word2Vec model. We then processed each title with the same model and discarded those with low prediction confidence scores. The whole process helped us eliminate non-relevant titles, such as Caesars rewards representative variable, Home visit crd hfa help me grow futures, Batch record reviewer.
Deal with multi-language
When compared to English, other languages lack the availability of rich datasets. It was also evident that repeating the above steps would have required a considerable effort and a good knowledge of each language. Therefore, we decided to collect more unique data only for the Italian language and translate a subsample of the English training dataset (10k instances per category) to three other languages relevant to us: German, Portuguese, and (again) Italian. This multilingual enrichment step was not strictly necessary because of the multilingual capability of our embedding encoder (more details in the model definition). Nevertheless, it was helpful to improve the performance of the model slightly.
Balance classes
Due to its setup process, the dataset was highly class imbalanced, reflecting the job market and, more specifically, our business. For example, we handled classes with millions of jobs such as Healthcare and IT and categories with only hundreds of jobs such as Sports. As a result, we had both oversampled and under-sampled classes.
Instead of randomly discarding job titles, we created many clusters per category and downsampled titles from each group to account for oversampled classes. In such a way, a category was not overfilled with common job positions but composed of heterogeneous job titles. For example, the Nurse job position was predominant in the Healthcare category. Still, by adopting the above method, we discarded several Nurse-like job offers, giving more weight to less common titles in the Healthcare class, such as Massage Therapist.
Similar to the computer vision field approach, we augmented our text data for the under-represented classes by generating slight variations of the original titles. For example, we translated a job title from language A to language B and then back. In some cases, this led to a new title using different synonyms. For example, the Italian job title, Addetto vendita all’ingosso, could be translated in English as Wholesales and back to Italian as Grossista, enhancing the Retail class with one more Italian title. Of course, the pre-trained embedding encoder should already account for synonymous words, given its ability to return similar embeddings for similar terms, as we will explain in the linked post. However, with this method, we noted an improvement in the model predictions for under-sampled classes compared to the result obtained with classical oversampling techniques.
Summary
All the above techniques, although intuitive, have been tested on the evaluation dataset, and all of them showed a positive impact in improving the resulting model accuracy. For example, we tested the multilingual enrichment by generating two models, one trained only with English data and one with English and Italian job titles. The last model prevailed over the first one when tested on the Italian evaluation dataset even if, as already said, we didn’t expect an improvement given the multilingual nature of the embeddings.
The total size of the dataset was 2 million rows with 1 million English job titles, 400k in Italian, 300k in German, and 300k in Portuguese. Below, we report a subsample of the training dataset showing a random English job title per category:
╔═══════════════════════════════════════════════╦══════════════════╗
║ Title ║ Category ║
╠═══════════════════════════════════════════════╬══════════════════╣
║ Payroll Clerk - DOE up to $55k+ Benefits ║ Accounting ║
║ Administrative Assistant ║ Administration ║
║ Vendor Invoicing Analyst ║ Analyst ║
║ Electrician Foreman / Journeyman ║ Construction ║
║ Senior Consultant, Change Management ║ Consulting ║
║ Customer User Support Specialist ║ Customer Service║
║ Graphic Designer - University of Tennessee ║ Design ║
║ Maths Teaching Assistant, Bushey ║ Education ║
║ Industrial Engineer - Clinical Laboratories ║ Engineering ║
║ Group Financial Controller (Maternity cover) ║ Finance ║
║ Chef Dining Director ║ Food ║
║ Staffing Recruiter (Bilingual) ║ HR ║
║ Travel Nurse PACU RN Day Shift Job ║ Healthcare ║
║ Bartender - NC State ║ Hospitality ║
║ Remote Full Stack React/Node Engineer ║ IT ║
║ Claims Specialist ║ Insurance ║
║ Government Lawyer, Public Law ║ Legal ║
║ Forklift Operator Sit Down - 1st shift ║ Logistics ║
║ Senior Product Manager ║ Management ║
║ QC Machine Operation - sewing/trimming ║ Manufacturing ║
║ UK Community Manager (m/f/d) ║ Marketing ║
║ Senior Video Producer - Branded Content ║ Media ║
║ Excellent Midwest Urology Opportunity ║ Medical ║
║ Senior Merchandiser - Women & Youth ║ Retail ║
║ Senior Sales Negotiator, Marylebone ║ Sales ║
║ Laboratory Assistant ║ Science ║
║ Security Officer- Luxury Retail Mall ║ Security ║
║ Nanny/Babysitter - Jamie Woods ║ Services ║
║ Maintenance Tech II - Equipment ║ Skilled Labor ║
║ Residential Childcare Worker (Grantham) ║ Social Care ║
║ Self - Employed Personal Trainer ║ Sports ║
║ Hiring CDL A Owner Operator - CDL-A Required ║ Transportation ║
╚═══════════════════════════════════════════════╩══════════════════╝
Despite some random noise leftover, the resulting training dataset was way more reliable than the initial raw data and could finally be used to train our model. Learn more about the final model here.
The contributors of the article are: Federico Guglielmin, Silvio Pavanetto, Stefano Rota and Paolo Santori.