Domain-specific NLP models

Francisco ESPIGA
RavenPack
Published in
6 min readOct 1, 2021

Using Masked Language Modeling to train BERT from scratch

Introduction

At RavenPack, our Machine Learning team always reviews the latest innovations in the field of NLP to assess if they can bring differential value to our clients.

One of our lines of research is creating domain-specific models for the Financial News universe, for a diverse set of downstream tasks, like sentiment analysis, NER, etc. and we plan to share what we have discovered along the way in a series of articles.

In this first article of this series regarding domain-specific NLP models, we will walk through the process of how we created a BERT-based language model using our own corpus of data, the rationale behind it, and all the steps necessary to create yours.

Language models

Language models are a key building block in modern Natural Language Processing. Mathematically, language models are a probability distribution over words and sequences of words.

In recent years, there has been extensive work in the NLP community towards building models for every language, like German, Spanish, and even Esperanto, as you can see in the tutorial that inspired this article.

Domain-specific language models

One could think of language models as a mapping between the numerical representation of words and a latent space.

In the general case (English, Spanish, …), we should expect language models to be broader, as they try to create a latent space able to capture all these different language nuances (elevated, formal, mundane and even vulgar). We will not delve into multi-language models, although the principle would be similar to some extent.

Namely, we will focus on the opposite, creating a language model for a very specific knowledge domain, in our case, Financial News. Although there was already some previous work for this particular domain, we wanted to leverage our massive corpus of data to create a model that we could completely tailor to our needs.

Our data

Through more than 18 years serving the clients of RavenPack we have built a rich corpus of annotated Financial News data comprising billions of stories.

This annotated data contains entities, events derived from our own taxonomy proprietary, parts-of-speech tagging and everything one could dream of as a Machine Learning Engineer.

Naturally, this is a powerful asset to bring state-of-the-art NLP to our products and something that we will use in the next articles of the series. But, for the particular case of language models, we can directly leverage our data in an unsupervised manner: Masked Language Modeling.

Translating the data

The first step to building a language model is to translate the words into tokens that the model can interpret and learn from.

Financial news vocabulary is very distinctive, but also very rich. To provide flexibility to this numerical translation process, or tokenization, we worked at the subword level, that is, decomposing a word into smaller chunks of letters, instead of indexing every word with a different integer number.

This results in a smaller vocabulary and fewer occurrences of unknown words, as these smaller building blocks could virtually translate every word if each letter of the alphabet was a token. Of course, this is not reasonable because then the sequences will have the length of the number of characters within a sentence and being able to predict the next token in a sequence will be a much harder, if not impossible, task.

Prior to this tokenization, we preprocessed our corpus removing accents, lowercasing and splitting by whitespace as a first step.
Our vocabulary size was 50265 distinct tokens, with 5 positions reserved for special tokens like <mask> (important for Masked language modeling) or the beginning and end of a sentence.

Thanks to the HuggingFace tokenizer library, creating our own was very easy. Starting from a collection of files (.txt with a sentence per line), the tokenizer was automatically created.
The library also provides wrappers to train from memory, but as we use a fraction of our corpus, consisting of ~20million sentences that potentially could be expanded anytime with more data points, we preferred to have the data stored in files with the tradeoff of slightly longer training times.

Creating the model

With the tokenizer now ready, it was time to build and train the language model. We also used the transformers helper functions and chose a RoBERTa model as the base, with some changes in the configuration to make it smaller.

Configuration

As most of our sentences have around 60 words, we decided to reduce the sequence length from 512 tokens to 128. This would work for the majority of the cases, considering subword tokenization and impact the training time and model size, reducing both.

We also changed the number of attention heads and hidden layers from the original 12 to 8 and 6 respectively.

Preparing the dataset

As mentioned earlier, we used Masked Language Modelling to train our language model. This is a smart and unsupervised way to train the model, where part of the input sequence (tokens) are masked.

This forces the model to learn the inherent relationships within the tokens (context), as some of the inputs are “noise”, reconstructing the true sequence using the available information.

Language model training using MLM

In the original BERT paper, 15% of the tokens, at random, were masked. The majority of them are replaced by the <mask> token and a small percentage (10%) to another token at random.

This can be easily implemented, but HuggingFace already brings those functionalities to the table. We used the DataCollatorForLanguageModeling from the transformers library to cleanly and easily replace the tokens for the MLM training.

Moreover, using the datasets library is an efficient way to store the data in .arrow format and (re)load it, in case the training is interrupted.

Training the model

Now that the data is ready, it is time to create our language model. As with the rest of the steps of this article, we leveraged HuggingFace using the wrappers and helper functions.

Our advice, given the state of the library when we train our model, is to train the models in the PyTorch version, as most of the features are released first for that framework and later to TensorFlow.
At RavenPack, we use TensorFlow for our deployments, but storing the model in PyTorch and then Migrating it to ONNX or TensorFlow is easy.

We chose the larger batch size possible and trained the model for 4 epochs from scratch. The rest of the parameters were the default.

Once the trainer helper functions are set up, we can start the training loop. We save a checkpoint every 10000 steps, as this is a lengthy process (around 100h).

Evaluating the results

training logs

Inspecting the logs in TensorBoard, we can see how the training loss decreases steadily up to a point when it stabilizes.

Unfortunately, the evaluation logs were overwritten in the process, but prior to building this graph they were inspected and no discrepancies in the logs that could point to overfitting were detected.

Recap and next steps

  • In this article, we have walked through the necessary steps to create a custom language model with your own corpus.
  • Using subword tokenization brings flexibility to the translation process, reducing the vocabulary size and the number of unknown words in the corpus.
  • Using masked language modeling enables training the model with a large, unlabelled corpus of data.
  • In our next article, we will show how, thanks to this model, we were able to create meaningful sentence embeddings that are relevant contextually and domain-wise.

Have we sparked your curiosity about what we do at the RavenPack ML team? We are hiring! Check our open positions and reach out to us if you think you can be a contributor to the team.

--

--

Francisco ESPIGA
RavenPack

Data Science & AI Tech Lead@SANDOZ and Teacher@ESIC. Reinforcement learning, optimization and AI enthusiast building a more interesting world 1 epoch at a time.