FinBERT: Financial Sentiment Analysis with BERT

Zulkuf Genc
Prosus AI Tech Blog
9 min readJul 31, 2020

--

Doğu Tan Aracı, Zulkuf Genc

Shares of food delivery companies surged despite the catastrophic impact of coronavirus on global markets.

The statement above carries a clear positive message about food delivery companies next to a negative one about global markets. It is not an easy task for computers to extract this positive message because it requires understanding of what is positive and negative in the financial context without getting confused with the rest of the sentence.

In this blog post, we will share a simplified story of how we solved this financial sentiment classification problem with Bidirectional Encoder Representations from Transformers (BERT) and improved the state-of-the-art performance by 15 percentage points. In addition, we will also share our production settings, code, data and guidelines for you to train and use our financial sentiment classifier.

Why did we do this?

Prosus is one of the largest technology investors in the world. On a daily basis, we need to sift through an overwhelming amount of information, largely in form of unstructured text data, about the sectors and companies of our interest. Financial sentiment analysis is one of the essential components in navigating the attention of our analysts over such continuous flow of data.

We quickly noticed that naive approaches such as bag-of-words would simply not cut it, as they generally disregard essential context information in the text. We needed more advanced techniques to crack the financial context and understand what is positive and what is negative from the financial point of view.

The current state-of-the-art approach to natural language understanding is using pre-trained language models by fine-tuning them for specific (downstream) tasks such as question answering or sentiment analysis.We followed that recipe and developed FinBERT as a BERT-based language model with a deeper understanding of financial language and fine-tuned it for sentiment classification.

A New Era in NLP

If we were trying to tackle this problem in let’s say 2017, we would have had to come up with a custom model and train it from scratch. In that case we would have needed very large amounts of labeled data for a somehow acceptable performance. Labeled data with good quality isn’t very easy to acquire, especially in a niche domain like finance, where expertise is required and definitely does not come cheap.

Luckily, NLP’s ImageNet moment happened in 2018. Starting with ULMFit, researchers cracked how to do transfer learning efficiently for NLP problems. The idea is simple: First, get textual data that is available in abundance, like Wikipedia. Then, train a language model with that data, which is basically predicting the next word in a sentence. Finally, fine-tune the language model for your task with one or several task-specific layers. The advantage of this approach is that you don’t need a huge dataset for fine-tuning, because the model learns about language during the original language model training. The heavy lifting is already done!

Many other language models followed ULMFit path but used different training schemas, most significant being BERT. BERT was the language model that made the whole concept of pre-training and fine-tuning flow very popular. BERT brought two core innovations to language modelling: (1) It borrowed the transformer (T of BERT) architecture from machine translation, which does a better job of modelling long-term dependencies than RNN-based ones (excellent overview here). (2) It introduced the Masked Language Modelling (MLM) task, where a random 15% of all tokens are masked and the model predicts them, enabling true bi-directionality (B of BERT). Intuitive explanations of how transformers and BERT can be found here and here.

BERT achieved state-of-the-art performance in almost all of the down-stream tasks such as text classification and question answering it was applied to. But more importantly, it was a huge step towards democratisation of NLP. Now that the compute-heavy (Google used 16 TPU’s for 4 days to pre-train BERT) language model training is already done, anyone with a decent computing power could train a very accurate NLP model for their niche task based on a pre-trained language model.

FinBERT

BERT was perfect for our task of financial sentiment analysis. Even with a very small dataset, it was now possible to take advantage of state-of-the-art NLP models. But since our domain — finance is very different from the general purpose corpus BERT was trained on, we wanted to add one more step before going for sentiment analysis. Pre-trained BERT knew how to talk, but now it was time to teach it how to talk like a trader. We took the pre-trained BERT and then further trained it on a purely financial corpus. That corpus is Reuters TRC2, and available upon request here. Our purpose was to achieve better domain adaptation by exposing the model to financial jargon, before fine-tuning for the final task. We further pre-trained BERT using Hugging Face’s excellent library transformers (back then it was pytorch-pretrained-bert) and scripts they provided.

The training steps in FinBERT

Once we had the pre-trained and domain-adapted language model, the next step was to fine-tune it with labeled data for financial sentiment classification. We used the dataset of Financial Phrasebank. It is a very well thought-out and carefully labeled albeit a small dataset. Researchers extracted 4500 sentences from various news articles, which include financial terms. Then 16 experts and master students with finance backgrounds labeled them. They didn’t only report labels but also inter-annotator agreement level for each sentence, which means how many experts labelled as positive, neutral and negative.

Some examples from Financial PhraseBank

Fine-tuning a transformer-based language model for classification is a straight-forward process. A classification layer is added after BERT’s special token [CLS], which is used for sequential tasks like sentence classification or textual entailment. Then the whole model is fine-tuned with classification losses. A visual representation of this structure is on the picture below.

The overview of the steps to train FinBERT

Results

The results were surprisingly good, even for a zealous believer of pre-trained language models. We achieved 97% test-set accuracy in the full inter-annotator agreement part of Financial PhraseBank. That was six percentage points higher than the previous state-of-the-art (FinSSLX). On the dataset, including sentences without full inter-annotator agreement, accuracy was 86%, 15% percentage points higher than the previous state-of-the-art (HSC).

We also evaluated how FinBERT performs compared to other deep learning models, with or without transfer learning;

  1. A plain LSTM model with GloVe embeddings
  2. LSTM model with ELMo embeddings
  3. ULMFit

On the table below you can see the results. FinBERT is the best model, though ULMFit is also impressively competitive, considering its much smaller model size.

Experimental results on the Financial PhraseBank dataset

And here are some examples of sentences from financial news scored by FinBERT.

Some prediction examples from FinBERT

Where does FinBERT fail?

With 97% accuracy on the subset (100% annotator agreement) of Financial PhraseBank, our curiosity drove us to look into the cases where the model failed. As the PhraseBank paper indicates, most of the inter-annotator disagreements are between positive and neutral labels (agreement for separating positive-negative, negative-neutral and positive-neutral are 98.7%, 94.2% and 75.2% respectively). It is because of the difficulty of distinguishing “commonly used company glitter and actual positive statements” and companies trying to spin an objectively neutral statement into a positive one.

As with most deep learning models, it is not very easy to intuit on failure modes of FinBERT. We still realised there are some re-occurring themes in the errors and thought it would be interesting to share them.

The first example is actually the most common type of failure. The model sometimes fails to do the math in which figure is higher, and in the absence of words indicative of direction like “increased”, might make the prediction of neutral.

A. Pre-tax loss totaled euro 0.3 million, compared to a loss of euro 2.2 million in the first quarter of 2005 .

True value: Positive
Predicted: Negative

However, there are many similar cases where it does make the true prediction too. Examples B and C are different versions of the same type of failure. The model fails to distinguish a neutral statement about a given situation from a statement that indicated polarity about the company. In the third example, information about the company’s business would probably help.

B. This implementation is very important to the operator, since it is about to launch its Fixed to Mobile convergence service in Brazil

True value: Neutral
Predicted: Positive

C. The situation of coated magazine printing paper will continue to be weak

True value: Negative
Predicted: Neutral

73% of the misclassifications of FinBERT are between positive and neutral labels, while the same number is 5% for negative and positive. That is consistent with both the inter-annotator agreement numbers and common sense. It is easier to differentiate between positive and negative. But it might be more challenging to decide whether a statement indicates a positive outlook or merely an objective observation.

Deployment

Just having a model with good accuracy was not enough for us. We also wanted it to be used in valuable ways. We came up with and implemented three applications so far:

  1. A text editor that processes text and highlights sentences according to their predicted sentiment. It can be used to refine the message before publication or to assess sentiment in corporate publications, financial disclosures or business blogs. We also shared this text editor as a web interface with our colleagues from other departments, for them to play with FinBERT, and the response was quite positive.
  2. A Kibana dashboard that gets tweets from financial news outlets, scores them and generates a live overview of the market in terms of financial sentiment. We thought this type of information might provide an additional perspective for our colleagues.
  3. And finally an inference service deployed on Kubeflow, that can be queried by anyone with the endpoint. This enables any data scientist on our portfolio companies to develop their own application with FinBERT, without thinking about deploying the model.
The FinBERT text editor
The overview of our dashboard pipeline

Final Words

Our FinBERT journey started as a curiosity-driven small research project after we noticed the existing methods are not performing good enough in financial sentiment classification. Throughout this journey we learned a lot and ended up at a point where we have the state-of-the-art sentiment analyser deployed and creating actual value to our company. We have many ideas for future directions but are also open to your thoughts and suggestions. Please feel free to share in the comments section. Meanwhile, if you want to try it at home (still in lockdown), you can find all the material related to the data and model in our Github.

We’d like to thank Nishikant Dhanuka and Liesbeth Dingemans, our colleagues from the Prosus AI team, for their suggestions and help in editing. Please check the paper, if you want to learn more about FinBERT. Also if there are any further questions or suggestions, feel free to reach out to us at datascience@prosus.com.

--

--