Financial Sentiment Analysis Using SparkNLP Achieving 95% Accuracy

Abdullah Mubeen
spark-nlp
Published in
4 min readMay 29, 2024

Sentiment Analysis is an NLP technique used to interpret the emotions, comments, and reviews (Positive, Negative, or Neutral) behind text data. It helps understand the underlying sentiment expressed in the text, which can be crucial for various applications such as customer feedback analysis, market research, and social media monitoring.

Why Analyze Financial Data?

financial data analysis​ (ar5iv)​

Financial sentiment analysis involves extracting and analyzing sentiment from financial texts such as news articles, social media posts, and company reports. This analysis helps understand market trends, and investor sentiment, and make informed financial decisions. By leveraging SparkNLP, we can efficiently process large volumes of economic data, providing valuable insights that can guide strategic decisions and enhance financial analysis.

The ‘ bert-base-finance-sentiment-noisy-search’ Model

A fine-tuned version of the BERT (Bidirectional Encoder Representations from Transformers) model. Initially, “bert-base-uncased” was fine-tuned on Kaggle’s finance news sentiment analysis dataset, achieving an accuracy of about 88%.

To further enhance the model’s performance, a logistic regression classifier was employed on the same dataset. By inspecting the coefficients contributing to the “Positive” and “Negative” classes, the top 25 bi-grams for each class were identified. These bi-grams were then used as search terms to retrieve up to 50 news items per bi-gram phrase using Bing News Search. This method, termed “noisy-search,” presumes that positive bi-grams (e.g., “profit rose,” “growth net”) yield positive examples, while negative bi-grams (e.g., “loss increase,” “share loss”) yield negative examples. Despite not testing the validity of these assumptions, this approach facilitated the collection of additional training data.

The model was then trained on this noisy data and applied to a held-out test set from the original dataset. Training with several thousand noisy “positive” and “negative” examples resulted in a test set accuracy of about 95%, demonstrating that automatically collecting noisy examples using search can significantly boost accuracy performance. For optimal results, it is recommended to feed the classifier with the title and either the first paragraph or a short news summarization, ideally up to 64 tokens.

Please refer to the attached PDF for more detailed results, including the accuracy comparisons between Logistic Regression (LR) and BERT (base-cased).

Let's look at the Data we’ll be using for this project

This FinancialPhraseBank dataset from Kaggle provides sentiment labels for financial news headlines, focusing on the viewpoint of a retail investor. It includes two columns: “Sentiment” (negative, neutral, or positive) and “News Headline.” The dataset was created by Malo et al. (2014) for detecting semantic orientations in economic texts.

This will be perfect for our project!

Let's load our data and get started with the sentiment analysis!

# Import necessary libraries
import os
import pandas as pd

# List all files in the input directory
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
df = spark.read.csv('/kaggle/input/sentiment-analysis-for-financial-news/all-data.csv', header=False, inferSchema=True).toDF("label", "text")
sentiment-analysis-for-financial-news

Let's create a Spark NLP pipeline with the following stages:

from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

tokenizer = Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")

sequenceClassifier = BertForSequenceClassification.pretrained("bert_classifier_base_finance_sentiment_noisy_search","en")\
.setInputCols(["document","token"])\
.setOutputCol("class")

pipeline = Pipeline().setStages([document_assembler, tokenizer, sequenceClassifier])

Run the pipeline and get our predictions

result = pipeline.fit(df).transform(df)
result.select('text', 'label', 'class.result').show()
result

And that’s it! pretty easy right?

Now let’s see how accurate this was for that I’ll be using Scikit-learn for the evaluation

#Lets convert our Spark DataFrame to Pandas as Scikit-learn uses that

preds = result.select('text', 'label', 'class.result').toPandas()
preds['result'] = preds['result'].str[0]

Now let's run the metrics...

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
y_true = mlb.fit_transform(preds['label'])
y_pred = mlb.transform(preds['result'])

accuracy = accuracy_score(y_true, y_pred)
report = classification_report(y_true, y_pred)

print("Accuracy:", accuracy)
print("\nClassification Report:\n", report)
Classification Report/Accuracy

For further reading and resources, consider exploring the following:

--

--