How DBS extends the pre-training of Google BERT with Treasury & Markets domain-specific vocabulary

Dejie Lin
DBS Tech Blog
Published in
7 min readAug 16, 2022

This method can also be expanded to be included in other domain-specific use cases

By Dejie Lin

Large pre-trained models, like Google BERT (Bidirectional Encoder Representations from Transformers), have achieved great success in many natural language processing (NLP) tasks. However, when they are applied in specific domains, these models suffer from domain shift and are not able to achieve satisfying results. While there are many domain-specific BERT models published by research institutes, and corporates, such as BioBERT, SciBERT, FinBERT, and exBERT, there isn’t a BERT that focusses on the Treasury & Markets (T&M) domain.

T&M focusses on pricing, booking, trading, payment, and settlement activities across different asset classes and products. These are just some of the many T&M-specific terms that the generic BERT model can’t understand well. Take for example knowing the meaning of long, short, SGD, USD, Murex, and bond in T&M contexts.

In this article, I will share how to further extend the pre-training of Google BERT’s base model with T&M domain-specific vocabulary, to create a new domain-specific BERT model: T&M BERT.

Once the T&M BERT model is built, it can be used for T&M related NLP tasks such as classification, Named Entity Recognition (NER), question answering, and text summarisation. As these tasks use the same BERT code base, this methodology can be easily extended to the other domain-specific use cases. For example, it can be applied to assist the annotation introduced in this article: Using Semantic Search to Drive Smart Annotations for Chatbot Models.

Transfer Learning And Domain Adaptation

For businesses that specialise in providing technical solutions that have to do with text mining (or text analytics), which uses NLP to transform unstructured text into structured data suitable for analysis, the introduction of transfer learning in NLP represents a major paradigm shift in the development and training of deep learning models for NLP.

In theory, transfer learning and, in particular, domain adaptation, would drastically reduce the time required for producing a new model. Instead of training a model for each specific task from scratch, the main idea is to train a base model using additional data that is better suited for the task. In simple terms, transfer learning — an emerging technique in machine learning (ML) — helps us solve new tasks, using previous knowledge obtained from an older task.

The key goal of domain adaption is to train a neural network on a general dataset, and attain accuracy when used with a specific dataset. To get there, we need to ensure that we have the relevant vocabulary, as this is the cornerstone of many NLP applications. Studies show that the insertion of domain-specific vocabulary as an adaptation strategy leads to better performance of their language models.

Methodology

All instances of BERT in this writing refer to Bert-base-uncased. Retrieving the vocabulary can be done in three steps.

1) Check vocabulary of a BERT tokenizer: subwords and words

BERT uses a WordPiece tokenizer, meaning that it works using words and a subword-based tokenization algorithm. A subword is a set of characters which is associated with one or several others to form a word. If it doesn’t match the start of a word, it begins with ##. For example, the downloaded BERT model vocabulary doesn’t contain the words Murex and SGD. It then tokenizes them with subwords as follows: three tokens for the word Murex, and two for SGD.

print(tokenizer.tokenize(‘Murex’))

print(tokenizer.tokenize(‘SGD’))

[‘mu’, ‘##re’, ‘##x’]

[‘sg’, ‘##d’]

However, it is expected that a natural language model specialised in the T&M field would have these two words in its vocabulary without needing subwords to tokenize them. Thanks to the tokenizer.add_tokens() method presented above, it is easy to insert these two words into the existing vocabulary with the following method:

Let’s increase the vocabulary of BERT model and the tokenizer:

New_tokens = [‘murex’, ‘sgd’]

Num_added_toks = tokenizer.add_tokens(new_tokens)

print(tokenizer.tokenize(‘Murex’))

print(tokenizer.tokenize(‘SGD’))

[‘murex’]

[‘sgd’]

2) Obtain the list of tokens from the T&M domain, and add only words, not subwords

A common technique is to train a tokenizer of the same nature (e.g. a BERT WordPiece tokenizer), which makes it possible to obtain a vocabulary that’s specific to this corpus. The vocabulary can then be added to the existing list.

In this instance, we need whole words, not subwords. For this step, instead of using WordPiece , which is default for BERT, use a well-known word tokenizer, like spaCY, to find new tokens.

Here are the main steps in this process:

1) Compile a list of tokens from T&M documents using spaCY. T&M documents can be collected from sources including emails, chat messages, contracts, pdf files, etc;

2) Get the IDFs (Inversed Document-Frequency) of word tokens and arrange them in descending order according to frequency of usage;

3) Add the new tokens to the BERT tokenizer vocabulary; and

4) Adjust the model embeddings matrix to the new embedding vectors size.

From this study, we identified 400 T&M domain-specific tokens which were added to the BERT tokenizer vocabulary.

3) Further Pre-Train BERT with Masked Language Modelling (MLM)

MLM will randomly mask some tokens by replacing them with [MASK]. Details of the masking procedure for each sentence from the training dataset takes place in this manner:

1) 15% of the tokens are masked.

2) In 80% of the cases, the masked tokens are replaced by [MASK].

3) In 10% of the cases, the masked tokens are replaced by a random token that’s different from the one they replace.

4) In the 10% remaining cases, the masked tokens are left as is.

We use data_collator function as it is responsible for taking the samples and batching them. But because we want random masking for each training epoch, we invoke the library DataCollatorForLanguageModeling, and adjust the probability of the masking with the following code snippets:

from transformers import DataCollatorForLanguageModelingdata_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

We kick off the training by passing everything to Trainer:

trainer = Trainer(model=model,args=training_args,train_dataset=lm_datasets["train"],eval_dataset=lm_datasets["validation"],data_collator=data_collator,)

It is important to note that after adding additional words, since the embedding has been resized, the weights are initialised randomly and need to be relearnt via the pre-training step. If this is not done properly, the resulting model would not have better performance compared to the original BERT, since the weights from word embeddings are affected.

Performance Comparison

Here’s the comparison of BERT base model against T&M BERT model by predicting the words from T&M context.

BERT MLM: Further pre-training without extending BERT base vocabulary, showing Google BERT predication on the masked token:

True Labels: settlement, date, fund

T&M BERT MLM: Further pre-training with extended T&M vocabulary, showing T&M BERT predication on the masked token:

True Labels: settlement, date, fund

Contextual Understanding Performance Comparison

T&M BERT shows significant advantage in understanding T&M context related documents/emails. It achieved 13% more accuracy than Google BERT, and 16% more than FinBERT.

Conclusion

Having a relevant set of vocabulary is the cornerstone of many NLP applications. BERT is a state-of-the-art generic language model released by Google in 2018. It is an extremely large neural network model that pre-trained over a 3.3 billion English word-corpus extracted from Wikipedia and the BookCorpus. It is well-fitted for most generic NLP use cases, but not for domain-specific use cases. In this study, we developed T&M BERT model, which applied transfer learning and further pre-trained Google BERT with T&M domain-specific vocabulary.

T&M BERT was trained using 200,000 emails, and it can already outperform Google BERT and FinBERT when it comes to understanding T&M trading and settlement context. I am confident that DBS’ T&M BERT will be the foundation for all future T&M NLP use cases.

Dejie Lin is a veteran solution architect, lead data scientist, and seasoned staff-plus engineer. He currently focusses on building internal expertise and driving the data scientist team to apply trending technology, e.g, NLP, NER, sentiment analysis, text summarisation, topic modelling, content classification, Chatbot, OCR (Optical Character Recognition), time series regression models, traditional machine learning models as well as deep learning models.

--

--