Geek Culture
Published in

Geek Culture

NLP in Financial Institutions


Natural Language Processing (NLP) is one of the upcoming fields in the realm of Artificial Intelligence. The term Natural Language Processing (NLP) is meant to be the ability of a computer program to understand human language as it is spoken and written, referred to as natural language [12]. Any technology that has a significant impact on civilization’s fundamentals, such as finance, will be considered beneficial to humanity. In recent years, deep learning (DL) and machine learning (ML) have made enormous contributions to finance. Financial Analysis has grown tremendously in the past few years due to new technologies. There are many financial institutions that store structured data, including Financial Reports and Economic Reports. Moreover, the data can also be viewed in an unstructured way, i.e. they do not have a defined model nor are they organized. For example, email, call notes, etc. Consequently, in this case Natural Language Processing (NLP) comes into play which organizes the data in a structured way and analyzes the data to produce the desired result [13]. Machine Learning algorithms are used for this analysis. The use of machine learning algorithms in finance includes fraud detection, automated trading activities, and investor advisory services [31].

Aside from finance, natural language processing also has applications in the health sciences, search engines, language translations, etc. NLP can be used in health science to analyze X-rays and MRI reports and predict diseases based on those [35]. Search engines like Google and Yahoo also use NLP for better out-turns [34]. NLP has the ability to stem a word and give a result in a few seconds. Furthermore, it has the ability to remove stop words from the sentence which can increase the accuracy of the search. Language translation is one of the applications where NLP and deep learning neural networks work hand in hand [33]. When it comes to language translation, recurrent neural networks (RNN) are very effective, and with NLP algorithms, it is easy to translate any complex word and also provide the desired output.

One of the renowned companies named EY can be viewed as a practical case study in order to gain a fuller understanding of NLP. A financial institution that it served received more than five thousand complaints, the company mentions on its website [1]. Using natural language processing algorithms, EY identified sentiments from the recorded calls by analyzing the information and recognizing the speech of borrowers. [1].

Financial institutions invest in companies to gain profit and increase shareholders’ value. Their investments are based on financial reports (including notes on accounts & Management Discussion & Analysis — MD&A), investors’ calls, investors’ meetings, press releases, and information provided to the stock exchange/SEBI [3]. All of these parameters can be evaluated using NLP algorithms [4] and speech recognition algorithms. Using the GPT-2 algorithm, Dhaval Dangaria and his Stanford University colleagues have proposed a method for generating fluent descriptions in which the phrases in the text that do not appear in the summary are laconically inserted [4]. In addition to FinBERT from Hugging Face, sentiment analysis can also be performed [5]. This particular paper focuses on the use of NLP in financial institutions, as well as the methods of analysis, summarization, and sentiment analysis employed in NLP.

Understanding NLP

Natural language processing is a branch of artificial intelligence concerned with teaching computers to read and derive meaning from language [18]. Since language is so complex, computers have to be taken through a series of steps before they can comprehend text. The following is a quick explanation of the steps that appear in a typical NLP pipeline.

Fig 1 : NLP pipeline[18]

Sentence Segmentation: The text document is segmented into individual sentences.

Tokenization: Once the document is broken into sentences, we further split the sentences into individual words. Each word is called a token, hence the name tokenization.

Parts-of-Speech-Tagging: We input each token as well as a few words around it into a pre-trained part-of-speech classification model to receive the part-of-speech for the token as an output.

Lemmatization : Words often appear in different forms while referring to the same object/action. To prevent the computer from thinking of different forms of a word as different words, we perform lemmatization, the process of grouping together various inflections of a word to analyze them as a single item, identified by the word’s lemma (how the word appears in the dictionary).

Stop Words: Extremely common words such as “and”, “the” and “a” don’t provide any value, so we identify them as stop words to exclude them from any analysis performed on the text.

Dependency Parsing: Assign a syntactic structure to sentences and make sense of how the words in the sentence relate to each other by feeding the words to a dependency parser.

Noun Phrases: Grouping the noun phrases in a sentence together can help simplify sentences for cases when we don’t care about adjectives.

Named Entity Recognition: A Named Entity Recognition model can tag objects such as people’s names, company names, and geographic locations.

Coreference Resolution: Since NLP models analyze individual sentences, they become confused by pronouns referring to nouns from other sentences. To solve this problem, we employ coreference resolution which tracks pronouns across sentences to avoid confusion.

Use Cases and Industrial Applications of NLP in Financial Institutions

1. Audits and Accounts

Audits have been greatly aided by Natural Language Processing. An advantage of this field is that transactions and anomalies can be tracked with ease. The Big Four consultancies use natural language processing to analyze transactions, review documents, and create long-term procurement agreements. Consulting firms also use its Audit Command Language (ACL) for producing more efficient audit reports [6]. By summarizing notes, we can extract meaningful information from them. All these analyses could be performed utilizing specific NLP techniques. NLP uses some pre-processing techniques to restructure the data and organize it in a particular way before analyzing it. In recent years, these techniques have been immensely helpful to organizations. Listed below are four techniques of NLP that are beneficial to Finance organizations.

Fig 2 : Four major applications of NLP for auditors[14]

Text Classification

The aforementioned figure illustrates that NLP has a powerful contribution to auditing, as for Know-your-client (‘KYC’), NLP’s text classification ability can detect negative and positive sentiments in news reports. As an additional benefit, NLP can process textual information in any language, which reduces translation time or the cost of hiring a professional translator [14].

Information Retrieval

The capabilities of optical character recognition (OCR) for converting hard copies into machine-readable formats and Natural Language Processing (NLP) for retrieving key information from documents, such as invoices and delivery orders, make it possible to automate the vouching process and free up auditors’ time for higher-value tasks [14]. Automation of information extraction and validation, made possible by NLP, can not only improve audit efficiency but also drastically reduce human errors, increasing data entry accuracy.

Natural Language Generation (NLG)

The NLG subfield of NLP focuses on the development of computer systems that are capable of writing understandable text in human languages3 [14]. Audit report generation is one of the applications of NLG. The use of business intelligence tools such as Tableau and Power BI to assess and present audit and value-added results is common today. NLG can help transform structured data behind charts and graphs into text for better clarity.

Natural language understanding (NLU)

As a powerful application of Natural Language Processing (NLP), NLU understands the actual meaning of the text [14]. Using such a feature, it is possible to extract a large amount of information, then discard the irrelevant information and then provide us with the content that matters.

2. Risk Assessments

The management of risks associated with investment strategies, due diligence procedures, or a company’s reputation can be enhanced with NLP [15]. NLP tools can help asset managers evaluate and optimize investment strategies based on news articles, social media comments, business-internal documents, and other material. Specifically being a banking financial institution, it is very important to analyze customer relationships, their sources of funds, investments, and expenses. It is extremely difficult to manage risks in these domains manually. Using NLP, it becomes easy to create automated risk assessments applied to data appearing in the notes on accounts and other data in the financial reports [6].

An NLP analysis of financial data, company governance documentation, internal documents, legal texts, and contracts can also help law firms and the legal departments of companies minimize risks by assessing discrepancies and noncompliance in due diligence processes. Finally, NLP can also be utilized to monitor public sentiment about a company in order to map out potential reputational risks for a firm. Mexican-based beer brand Corona [15], for example, may have benefitted from an early analysis of the reputational risk that the company faces since the global spread of the coronavirus COVID-19 in early 2020. While the beer brand has no relation to the virus, a survey in the US showed that sales dropped because consumers associated the brand with the virus.

A variety of steps can be performed by using NLP, such as identifying a risk, risk assessment, and so on [8]. A lot of care needs to be taken while collecting critical data, such as information that identifies the organization and its name. This is accomplished with the help of a technique called Named Entity Recognition (NER). Recently, AI-based technologies have increased the accuracy of risk assessments to a point where they are better than human assessments. But the problem with most financial institutions is they do not adopt these technologies which increase the quality of assessment.

3. Financial Sentiment analysis

Sentiment Analysis has been heavily used by businesses for social media opinion mining, especially in the service industry, where customer feedback is critical [16]. Numerous experiments have been conducted on this topic. NLP can be used to analyze the sentiment of a particular stock which makes it easy for a financial institution to predict the sentiment of the stock just by providing text data about the company. The most common use of Sentiment Analysis in the financial sector will be the analysis of financial news, in particular to predict the behavior and possible trend of stock markets. The data may include financial reports (notes on accounts & MD & A), investor calls, investor meetings, press releases, and information provided to the stock exchange/SEBI. All the data is summarized and the sentiment could be analyzed in a few seconds.

In order to predict financial sentiment, FinBERT [9], also known as Financial BERT, an NLP model proposed by Hugging Face, is used. Just like BERT, it is a pre-trained model. There are also other methods like Recurrent Neural Networks (RNN) which were used before FinBERT was introduced. The following diagram illustrates the flow of Sentiment Analysis application in the financial world.

Fig. 3 Flow Diagram of Sentiment analysis of financial text data [16]

4. Portfolio Selection and optimization

Portfolio optimization in finance is the technique of creating a portfolio of assets, for which your investment has the maximum return and minimum risk. An investor’s portfolio basically is his/her investment in different kinds of assets from different companies [16]. The goal of investors is to increase its capital over time. Using the collected past data, NLP can help predict the trade period and portfolio. And using this data, the investors can smartly distribute their capital among the assets currently available.

The process of selecting a portfolio may be divided into two stages [26]. The first stage starts with observation and experience and ends with beliefs about the future performances of available securities. The second stage starts with the relevant beliefs about future performances and ends with the choice of portfolio.

NLP can be utilized for semi-log-optimal portfolio optimization [6]. Semi-log-optimal portfolio selection is a computational alternative to log-optimal portfolio selection. With its help, the maximum possible growth rate is achieved when the environmental factors are uncertain. Data envelopment analysis can be implemented in portfolio construction by measuring stocks’ efficiency to recognize good stocks and filter bad stocks [32].

5. Stock behavior predictions

Predicting time series for financial analysis is a complicated task because of the fluctuating and irregular data as well as the long-term and seasonal variations that can cause large errors in the analysis. However, deep learning combined with NLP outmatches previous methodologies working with financial time series to a great extent [6].

Within the financial domain, recurrent neural networks (RNN) are a very effective method of predicting time series, like stock prices. RNNs have inherent capabilities to determine complex nonlinear relationships present in financial time series data and approximate [17] any nonlinear function with a high degree of accuracy. These methods are viable alternatives to existing conventional techniques of stock indices prediction because of the high level of precision they offer. NLP and deep learning techniques are useful to predict the volatility of stock prices and trends, and also is a valuable tool for making stock trading decisions.

By analyzing financial documents such as 10-k forms, we can forecast stock price movements through Natural Language Processing (NLP). 10K forms are annual reports that are filed by companies to provide a comprehensive overview of their financial performance (these reports are required by the Securities and Exchange Commission) [18].

Popular Machine Learning and Deep Learning Algorithms used for Natural Language Processing

Table 1. Popular Machine Learning and Deep Learning Algorithms

The above table highlights different Machine Learning and Deep Learning algorithms that are employed in Natural Language Processing. In terms of text data sets, it is possible to classify them as labeled or unlabeled. The two classes are predicted by using two features of the labeled text datasets. Fake financial news, data on fraudulent financial transactions. Here are some examples of labeled data sets.

Same way, there are also unlabeled datasets that use unsupervised learning techniques. This majorly includes the clustering algorithms which identify a specific pattern in the data and form the clusters based on the patterns. This technique is used to handle unstructured financial data.

Algorithms for Applications of NLP

Table 2. Applications of different Machine Learning and Deep Learning algorithms in Natural Language Processing

NLP algorithms employ ML/DL algorithms in a number of different ways, as appeared in table no. 2. These algorithms are widely used for analysis in financial organizations. We will examine the implementation of these machine learning and deep learning algorithms in Natural Language Processing (NLP) in the next section.

Sentiment analysis using Doc2vec and LSTM

LSTM and Doc2vec are one of the latest sentiment analysis techniques which will be used for sentiment analysis of financial news in this article. There are 3 different levels of analysis i.e., Document-level, sentence-level and Entity/Aspect level [29]. In the document level the document expresses the positive and negative sentiment level. Example, if a document consists of all the financial news, the document level will analyze the overall sentiment of the document. On the other hand, the sentence level analysis analyzes the sentences line by line and determines the sentiment of every sentence. This is the more improvised version of document level analysis where we have positive, negative and neutral sentiments. The third analysis is entity level analysis where the opinion is taken into consideration.

Fig 4 : Doc2vec model [27]

The image of the Doc2vec model above is known as the Distributed Memory version of Paragraph Vector (PV-DM) [28]. The model is very similar to the word2vec model where each word is taken and the best match of word with the context is considered. Doc2vec is an advanced version of this model where it takes the whole document as well as words in input so that it does not miss any context.

Fig 5 : Long Short-Term Memory (LSTM) [30]

Long Short-Term Memory (LSTM) is a recurrent Neural Network (RNN) which can learn long term dependencies and can determine the polarity [30]. This is done by a special architecture which was developed by Hochreiter & Schmidhube. As shown in the figure, the LSTM consists of 3 gates i.e., Input Gate, Output Gate and Forget Gate. The gates update the memory of the cell i.e., ct where the input is at xt and the output is at the hidden ht-1 which will use recurrent neural networks to continue to change the value of ct.

Fig. 6 : Implementation of the traditional method of sentiment analyzing the process of financial text data

The above figure represents the traditional method followed for analyzing the sentiments of financial text data. The sentences will be tokenized before analysis. Tokenization involves breaking up the sentences into chunks. Once the sentences are tokenized, stop words can be removed. These stop words can be displayed using the code below. By accessing the .txt file, you can also modify the stop words. Depending on the language, stemming or lemmatization can be done after the stop words have been removed. Lemmatization is preferred over stemming when it comes to the English language.

Subsequently, clustering can be performed and the data can be categorized into positive and negative sentiments. Following that, Support Vector Machine, Logistic Regression, Neural Networks, etc. can be applied to predict sentiment. Further evaluation can be done through evaluation matrices like accuracy, precision, recall, ROC, etc.

Modern Approach for Identifying Financial Sentiment

In this approach, FinBERT will be used to show how to train and use the FinBERT pre-trained language model for financial sentiment analysis [19]. FinBERT is a pre-trained NLP model to analyze sentiment of financial text. It is built by further training the BERT language model in the finance domain, using a large financial corpus and thereby fine-tuning it for financial sentiment classification [9]. FinBERT is a pre-trained NLP model to analyze sentiment of financial text. It is built by further training the BERT [20] language model in the finance domain, using a large financial corpus and thereby fine-tuning it for financial sentiment classification.

Install the dependencies by creating the Conda environment finbert from the given environment.yml file and activating it.

conda env create -f environment.yml
conda activate finbert

Models used in this analysis are Language model trained on TRC2 [21] and Sentiment analysis model trained on Financial PhraseBank [22]. For both of these model, the workflow should be like this:

  • Create a directory for the model. For example: models/sentiment/<model directory name>
  • Download the model and put it into the directory you just created.
  • Put a copy of config.json in this same directory.
  • Call the model with .from_pretrained(<model directory name>)

There are two datasets used for FinBERT. The language model further training is done on a subset of Reuters TRC2 dataset [23]. For sentiment analysis, Financial PhraseBank is used from [24]. To train the model on the same dataset, after downloading it, three files should be created under the data/sentiment_data folder as train.csv, validation.csv, test.csv. Following steps to create these files:

  • Download the Financial PhraseBank[24].
  • Get the path of Sentences_50Agree.txt file in the FinancialPhraseBank-v1.0 zip.
  • Run the dataset script [25]: python scripts/ — data_path <path to Sentences_50Agree.txt>


from pathlib import Path
import shutil
import os
import logging
import sys
from textblob import TextBlob
from pprint import pprint
from sklearn.metrics import classification_report
from transformers import AutoModelForSequenceClassificationfrom finbert.finbert import *
import finbert.utils as tools
%load_ext autoreload
%autoreload 2
project_dir = Path.cwd().parentpd.set_option(‘max_colwidth’, -1)logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
datefmt = '%m/%d/%Y %H:%M:%S',
level = logging.ERROR)

Prepare the model

Setting path variables:

  1. lm_path: the path for the pre-trained language model (If vanilla Bert is used then no need to set this one).
  2. cl_path: the path where the classification model is saved.
  3. cl_data_path: the path of the directory that contains the data files of train.csv, validation.csv, test.csv.
lm_path= project_dir/'models'/'language_model'/'finbertTRC2'
cl_path= project_dir/'models'/'classifier_model'/'finbert-sentiment'
cl_data_path= project_dir/'data'/'sentiment_data'

Configuring training parameters


bertmodel = AutoModelForSequenceClassification.from_pretrained(lm_path,cache_dir=None, num_labels=3)

config = Config( data_dir=cl_data_path,
max_seq_length = 48,
train_batch_size = 32,
learning_rate = 2e-5,

finbert is our main class that encapsulates all the functionality. The list of class labels should be given in the prepare_model method call with label_list parameter.

finbert = FinBert(config)
finbert.base_model = 'bert-base-uncased'

Fine-tune the model

train_data = finbert.get_data('train')
model = finbert.create_the_model()
# This is for fine-tuning a subset of the model.

= 6

for param in model.bert.embeddings.parameters():
param.requires_grad = False

for i in range(freeze):
for param in model.bert.encoder.layer[i].parameters():
param.requires_grad = False


trained_model = finbert.train(train_examples = train_data, model = model)

Test the model

bert.evaluate outputs the DataFrame, where true labels and logit values for each example is given

test_data = finbert.get_data('test')results = finbert.evaluate(examples=test_data, model=trained_model)

Prepare the classification report

def report(df, cols=['label','prediction','logits']):
#print('Validation loss:{0:.2f}'.format(metrics['best_validation_loss']))
cs = CrossEntropyLoss(weight=finbert.class_weights)
loss = cs(torch.tensor(list(df[cols[2]])), torch.tensor(list(df[cols[0]])))
print("Accuracy:{0:.2f}".format((df[cols[0]]==df[cols[1]]).sum() /df.shape[0]) )
print("\nClassification Report:")
print(classification_report(df[cols[0]], df[cols[1]]))
results['prediction'] = results.predictions.apply(lambda x: np.argmax(x,axis=0))report(results,cols=['labels','prediction','predictions'])

Getting predictions

With the predict function, given a piece of text, we split it into a list of sentences and then predict sentiment for each sentence. The output is written into a dataframe. Predictions are represented in three different columns:

  1. logit: probabilities for each class
  2. prediction: predicted label
  3. sentiment_score: sentiment score calculated as: probability of positive — probability of negative

Below we analyze a paragraph taken out of an article from The Economist. For comparison purposes, we also put the sentiments predicted with TextBlob.

text = "Later that day Apple said it was revising down its earnings expectations in \
the fourth quarter of 2018, largely because of lower sales and signs of economic weakness in China. \
The news rapidly infected financial markets. Apple's share price fell by around 7% in after-hours \
trading and the decline was extended to more than 10% when the market opened. The dollar fell \
by 3.7% against the yen in a matter of minutes after the announcement, before rapidly recovering \
some ground. Asian stockmarkets closed down on January 3rd and European ones opened lower. \
Yields on government bonds fell as investors fled to the traditional haven in a market storm."
cl_path = project_dir/'models'/'classifier_model'/'finbert-sentiment'
model = AutoModelForSequenceClassification.from_pretrained(cl_path, cache_dir=None, num_labels=3)
import nltk'punkt')
blob = TextBlob(text)
result['textblob_prediction'] = [sentence.sentiment.polarity for sentence in blob.sentences]
print(f'Average sentiment is %.2f.'%(result.sentiment_score.mean()))

Benefits of using NLP in Financial Institutions

NLP is a leading technique when data is readily available in an unstructured format. Some of the analyses performed by NLP are more accurate than those performed by human analysis. There has never been a better time to build NLP solutions for finance, then now [6]. It is an automated technique which nearly eliminates human intervention. These algorithms offer highest accuracy, almost scaling between 93% and 95%. Consequently, Natural Language Processing techniques have proven useful to the financial world.

Data enrichment is another benefit of NLP in financial institutions. One of the examples taken concerning the used cases in Risk Management where the Named Entity recognition was used. The data was easily traced using NER, which makes NLP a valuable asset in the finance sector. This also protects the financial institutions as well as the customers of the financial institutions from fraudulent activities like identifying fake financial news.

The tasks performed by Natural Language Processing are all automated. Hence, NLP is cost and time efficient giving the desired output. All these benefits of NLP used in financial institutions prove that it is the most accurate and efficient — both in terms of time and costs.

Limitations of NLP in Financial Institutions

Even though NLP is capable of handling most complex financial problems, it has some limitations. A major problem that can be faced is ambiguity i.e., one sentence can have two meanings. Speech recognition can take inputs as wrong spelled words and can interpret incorrect results. This can lead to multiple issues while recording the voice calls. Sarcastic and ironic sentences can lead to mis-results while generating the text. The auto generation of text in the backend for analyzing the sentiments can predict wrong sentiments. Besides all their limitations, the machine will give a very good accuracy which can mis-lead the customer or financial institution depending on the situation. The research is being conducted on these limitations to improve the quality of results or prediction of the NLP model.


With the advent of Deep Learning and Machine Learning techniques, Natural Language Processing emerges as an efficient technique followed by all of them. Several benefits are associated with NLP, including the reduction of manpower and efficient processing of text data. The blog focuses on used cases and implementation of NLP in financial institutions. In this blog, we examined some effective pre-processing techniques in NLP used in finance. Further, we discussed some NLP models like BERT which were also part of previous research. Additionally, we presented the approach for solving the financial sentiment analysis using the FinBERT model from Hugging Face combined with the traditional approach.

The use of NLP in finance is this need of the generation. As discussed in the above section, NLP can also help the customers to get protected from financial frauds. Further research can also result in concurrent risk assessment which may be implemented in future. The concepts presented in this blog gave a brief overview of NLP in financial institutions and how they can be utilized effectively.





[4] Dhaval Dangaria, Riccardo Giacomelli, Wilfrido Martinez, BigBirdFLY: Financial Long text You can read, Stanford CS224N Custom Project — Mentor: Rui Yan




[8] Biplav Srivastava, Javid Huseynov, Managing Risks to Assets in Corporate Finance with NLP and Planning, AI Institute, University of South Carolina, Columbia, NY, USA 10504



[11] Prasad Seemakurthi, Shuhao Zhang, and Yibing Qi, Detection of Fraudulent Financial Reports with Machine Learning Techniques, University of Virginia.



























Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store