Training A Custom NER Model to Auto-detect Trade Economics

Dejie Lin
DBS Tech Blog
Published in
6 min readSep 15, 2022

How we built a custom named entity recognition (NER) model trained using human-annotated data

Unstructured text such as text messages, emails, and chat logs contain an abundance of information that represents a huge untapped opportunity for organisations. At DBS, there is a vast amount of textual data from Treasury & Market (T&M) residing in traders’ chat conversations, email communication with counterparties, and deal information exchanged between the front office and back office, among others.

If we could analyse this data using an automated system to extract pertinent trade details, it would be possible to build an STP (straight-through processing) with minimum human interaction, saving much human effort and boosting both efficiency and productivity. One way to unlock this opportunity is to use named entity recognition (NER), also known as entity identification or entity extraction that other tools can leverage to extract meaning.

But how can we use NER to extract trade economics from T&M correspondence, and what are the inherent challenges that we need to overcome?

Building A Custom NER Model

NER categorises entities into predefined categories such as locations, names of persons, events, time, and organisations, which must first be defined. But available open-source NER models can only support limited categories, while trade economics covers a large variety of products such as FX, bonds, IR, equities, derivatives, and more, with each product having its own economics. Crucially, these are not part of the categories defined in existing NER models, which means we will have to train a custom NER model ourselves.

As our NER model scans the entire text for named entities, it ascertains sentence boundaries using language rules such as capitalisation to find and extract relevant information. However, a single T&M correspondence could contain information about multiple trades. We hence need to first identify the number of trades from the conversation, and associate extracted trade economics to each trade — again, a feature that is not available from existing NER models.

Of course, training our NER model requires a large amount of ground truth data, which must first be annotated or labelled. This is manpower intensive and not a trivial task. To facilitate this, we will need to build a user-friendly labelling tool that human annotators can use to accurately label trade economics from free text.

Common Trade Economics

Below are seven trade economics entities that represent the common denominators for our various asset classes. These are the fields we want our custom T&M NER model to extract.

Below are the four key steps needed to create our custom NER Model.

1) Build A Labeling Portal

The validity of our model would be questionable in the absence of accurate human-annotated data for our training and test sets (ground truth). With this in mind, we developed an easy-to-use NER labeling portal to facilitate the data annotation work and ensure that data is consistently interpreted and labelled by different human annotators.

Our labelling portal allows users to perform the following functions:

Figure 1: Uploading of all raw data

1a) Users first upload raw data that’s to be sorted. Emails are in .PST while chats are in .CSV

Figure 2: Labeling each email and chat log

1b) The user can label emails and chat messages for ease of browsing

Figure 3: Adding of tags

1c) Relevant trade economics, which have been added prior to this, will automatically be highlighted and labelled in the dataset

Figure 4: Entitles are linked

1d) Unique entities related to one deal must be linked and labelled by product. For FX trades specifying more than one deal, the deals would be labelled as FX 1, FX2, etc, and linked accordingly.

2) Annotate Relevant Emails And Chat Data

Wrongly labelled data or in-complete labelled data can negatively impact the entire training and learning process, which means that accuracy is vital. With our labelling portal, the annotation job was vastly simplified for faster, more accurate labelling. It took two weeks for one resource to annotate more than 4,000 emails, which we used to train our NER model.

3) Train custom NER model using labelled data

There are several popular NER models in the market such as spaCy and NLTK. More recently, a new natural language processing (NLP) model architecturev was proposed called DeBERTa (Decoding-enhanced BERT with Disentangled Attention) that improves on the BERT and RoBERTa models using a disentangled attention mechanism.

We decided to use spaCy and DeBERTa to train two different models to predict our seven base entities, using the former for the base model and the latter as an alternative model.

· Training dataset period: 01 Jan 2022 to 28 Feb 2022

· Evaluation data set period: 01 Mar 2022 to 31 Mar 2022

4) Verify performance and accuracy of the trained models

For both models, we used RockNER, which is a simple method to create adversarial examples for evaluating the robustness of a NER Mode, to conduct entity and context attack. The model trained with DeBERTa was compared with the model trained using spaCy.

Below is the NER accuracy from the spaCy model using the test dataset:

Below is the NER robustness from the spaCy model using the rock dataset:

Below is the NER accuracy from the DeBERTa model using the test dataset:

Below is the NER robustness check from the DeBERTa model using the rock dataset:

By comparing the NER accuracy of both models, it is evident that spaCy offers slightly better accuracy over the DeBERTa model at around 4% better.

Conclusion

Although the study shows that DeBERTa underperforms compared to spaCy, I suspect that DeBERTa might perform better in production, especially if the actual content in emails varies quite a bit compared to the training data.

The above example shows us how to develop a generic NER model, which can extract all kinds of trade economics as long as there is sufficient training data. This use case can be extended to other products, which can carry out without the need for model development. The next objective would thus be to monitor the performance of the NER model, and the shift in data.

Dejie Lin is a veteran solution architect, lead data scientist, and seasoned staff-plus engineer. He currently focusses on building internal expertise and driving the data scientist team to apply trending technology, e.g, NLP, NER, sentiment analysis, text summarisation, topic modelling, content classification, Chatbot, OCR (Optical Character Recognition), time series regression models, traditional machine learning models as well as deep learning models.

--

--