Using NLP at scale to better help people get the right job — Part 3: model definition and training

Stefano Rota
Jobtome Engineering

--

This is the last of a series of articles on how we at Jobtome classify millions of job ads daily to enrich our inventory and help people get the right job. You can read part one here and part two here.

In a previous post, we described the need at Jobtome to create a model to classify job postings into job position categories, i.e., return — given a job title as input — the most probable category among tens of classes. We also discussed the steps to create a high-quality training dataset.

This article details our classification model — technically, a Short Text multi-class classification in the Natural Language Processing framework — and some choices we made during the training phase.

Due to our business, some strict requirements highly impacted the model creation phase: the need for blazing-fast predictions, the capability to handle multilingual input data, and a non-precise (and therefore vague) definition of job categories. In addition, we also had to consider some specific features of the input training data, i.e., a small and noisy dataset.

A small training dataset
Our training dataset covered only a tiny percentage of all the possible job positions available on the market. Consequently, the model could be asked to predict a job title composed of words never seen during the training process. Moreover, some languages were under-sampled, and others were not even present. Imagine training a model only on Software Developer job offers before asking to predict what a Code Programmer is, or (even worse) showing the model only Accounting job positions in English and pretending to make a good prediction on a Księgowy / Księgowa job (Accountant in Polish).

Noisy training dataset
We didn’t have a first-class labeled dataset. We set up our own from scratch using multiple data sources, including poorly reliable ones. Even though we applied a cleaning process to improve the quality, we still had a noisy training dataset, including a subset of wrongly labeled data.

Model

We tested multiple text classification models, from classical Machine Learning algorithms to more advanced neural network architectures. However, all the implemented models had in common the same structure:

  • pre-processing — from input raw text to cleaned text
  • embedding — from cleaned text to vector of numbers
  • classifier — from vector to class attribution

While creating the training dataset, we developed a simple model to classify English job titles without paying too much attention to accuracy and speed. Following the above schema, the model was composed of three modules. The first was a classical text pre-processing step with stop-word removal and lower-case application, applied to input job titles with standard Python libraries such as spaCy and NLTK. The second step converted the clean text into sentence embedding, aggregating (with a sum) each word embedding from pre-trained Word2Vec, invoked using the Gensim library. Finally, the central part of the classification module was a Random Forest Classifier trained using an intermediate training dataset and implemented with the scikit-learn library. This simple model was focused on discarding unusable data from the training dataset without much effort into model tuning and optimization.

On the contrary, the final model had to fulfill all the abovementioned requirements. We decided to implement all three steps with the Keras framework because of the great community and the vast number of pre-trained models available on TensorFlow Hub.
Here is a deeper explanation of every single step of the final model.

Pre-processing
We adopted a pre-processing step composed of standard techniques: applying lower case to job titles, punctuations and single char removal, and some specific pattern deletion like HTML tags. We also dropped some common patterns, given their poor informative value. For example, we discarded sentences like Hiring Immediately, Experience required, Part/Full time, or tokens such as the expected salary. We also chose not to remove the most common stop-words (conjunctions, prepositions, etc.) because we noticed that this step did not improve the model’s performance.
We wrote a single custom function in Keras responsible for applying all the cleaning steps and transformed it into a Lambda layer of our model.

Embedding
Most of the time, text embeddings pre-trained on a large corpus highly improve the model’s accuracy in a text classification task:

the use of pre-trained embeddings could help our model achieve better performance even on unseen data, compensating for the reduced size of our training dataset.

On TensorFlow Hub, we selected a pre-trained encoder that could satisfy our multilingual requirement: the Multilingual Sentence Encoder. The encoder accepts a variable-length text (in our case, a job title) and outputs a 512-dimensional vector, namely a sentence embedding.
The essential feature of the encoder is multilingualism: the encoder was trained so that text with similar meanings will have close embeddings across different languages. This capability helped our model generalize even on languages not included in the training data.
As an additional advantage, the language of the input text was not a mandatory parameter: this allowed us to skip a time-consuming language detection step.

The embedding encoder is available in two architectures: Convolutional Neural Networks and Transformers. We adopted the first one for speed reasons: the CNN architecture is much lighter and faster than the Transformers architecture. Moreover, job titles are simple sentences with no more than 10–12 words, often without a verb (ex. Full time Data Scientist). With such input, we decided to address the problem with a Bag-of-words approach where the sequence of words does not carry additional information. The capability to handle sequentiality in phrases — such as the one of Transformers — was not an added value in our specific use case.
A more detailed description of the encoder model can be found in the link.

Classifier
The output of the previous step became the input of a classifier model. Next, we added to the CNN architecture three hidden dense layers, composed of 256, 128, and 64 neurons, all with a Relu activation function. The fourth and last dense layer was a Softmax, with one neuron per class and the responsibility to output the predicted job category with the corresponding score. The Softmax layer gave the highest score to the most probable job category and a lower score to other classes.

Putting all together
The three steps above represent the essence of our text classification model. The main code of the Keras Sequential model is shown below:

# Input Layer
tf.keras.Input(shape=(), dtype="string")
# Lambda Layer
tf.keras.layers.Lambda(lambda title: preprocess(title))
# Sentence Embedding Layer
hub.KerasLayer(encoder_url, trainable=False)
# Classifier Layers
tf.keras.layers.Dense(256, activation="relu")
tf.keras.layers.Dropout(0.1)
tf.keras.layers.Dense(128, activation="relu")
tf.keras.layers.Dropout(0.1)
tf.keras.layers.Dense(64, activation="relu")
tf.keras.layers.Dropout(0.1)

# Softmax Output Layer (n classes = 32)
tf.keras.layers.Dense(32, activation="softmax")

It is worth mentioning that in the training phase, we applied the pre-processing and embedding layers only once to the training dataset, discarding both layers from the model. In such a way, we avoided pre-processing data at each epoch, dramatically lowering training time.
On the contrary, three dropout layers were introduced only in the training phase to reduce overfitting.

Finally, we split the dataset into a train and validation set to evaluate the accuracy metric and stopped the training process when convergence was reached. A comparison between the accuracy obtained during the training phase on the validation dataset, 70%, and the performance received predicting the evaluation dataset in our previous article, >80%, confirmed the noise in our data as reported at the beginning. Hyper-parameters, such as the number of dense layers and neurons in each dense layer, dropout percentages, gradient descent optimizer, and batch size, have been estimated by training multiple models and evaluating the accuracy performance on the validation set.

Noteworthy was our choice to keep fixed the encoder weights during the training phase.

Usually, adjusting embedding weights is advisable to make the encoder more specific to the task domain. The motivation behind our choice was the reduced size of our training dataset and its noisy data. Furthermore, the pre-trained embedding encoder was critical and allowed our model to generalize better on unseen text. Consequently, we wanted to avoid fine-tuning our data.

A toy model showing our Keras model implementation and training on a subsample of our training dataset is available here.

The contributors of the article are: Federico Guglielmin, Silvio Pavanetto, Stefano Rota and Paolo Santori.

--

--