Use NLP to Take on Wall Street
How to Build an End to End Production Machine Learning Pipeline to Track Sentiment of Financial News Headlines

What Happened With GameStop?
In January, 2021, retail investors - Robinhood army - came together on Reddit’s Wall Street Bets group and other social media outlets to take down prominent hedge funds by causing a short squeeze and pushing up GameStop’s stock price by 400% in just one week¹. This amount of volatility is not normal, the retail investors were urged on by the Reddit group to punish hedge funds that had taken an outsized short bet against GameStop. Tracking market sentiment can be a powerful tool for investors because understanding the mood of where the market is going can allow one to capitalize from the changing direction. Combining market sentiment with market fundamentals will result in more sound investments.
I was fascinated by the showdown between Wall Street and Reddit and inspired to understand how machine learning (ML) can be used to track market sentiment. For my capstone project at University of California San Diego’s Machine Learning Engineering Bootcamp, I decided to build a sentiment classifier to determine the polarity of financial news headlines. The post goes through steps required to build an end to end pipeline — building and selecting the model that will be deployed to production, using Docker to build images, and using Kubernetes to manage the deployed containers and Flask application. In addition, this post will provide some background about what is machine learning, what is natural language processing(NLP), compare three state of the art NLP models: VADER, Google BERT, and Google XLNet, and explain how Google Cloud Storage is utilized.
What is Machine Learning?
Machine learning² is a subset of artificial intelligence (AI) and computer science which uses data and algorithms to imitate the way humans learn, gradually improving its accuracy. Previously, researchers gave the machines data and a set of rules to determine the Y output. However, recently the biggest breakthroughs have come from giving machines tons of data and the Y output, and letting the machine find the rules required to get that end result. It is a very important distinction to understand.
What is NLP? How do Machines Understand What Humans are Talking About?
Natural language processing³ (NLP) is another subset of artificial intelligence (AI) and computer science that seeks to give computers the ability to understand text and spoken words in the same way humans can. NLP combines computational linguistics — rule based modeling — with statistics, machine learning, and deep learning to understand the meaning of the text. For my capstone project, I compared three ML models — VADER, retrained Google’s BERT model, and retrained Google’s XLNET model.
The Financial Phrase-Bank Dataset was Used for Training of The Models⁴
It is a dataset consisting of 4,840 sentences from English language financial news headlines categorized by sentiment. The selected collection of sentences were annotated by 16 people with background in financial markets.
VADER Model⁵ — Rule Based Model
VADER ( Valence Aware Dictionary for Sentiment Reasoning) is a model used for sentiment analysis that determines both polarity (positive/negative) and intensity (strength) of emotion. The compound score is the sum of positive, negative, and neutral scores which is normalized between -1 (most extreme negative) and +1 (most extreme positive). Below is an example of what the output for the VADER model looks like.
To assess the performance of the VADER model, I looked at the confusion matrix and we can see that the VADER model had an accuracy of 60% when tested on the Financial Phrase-Bank dataset.
When building a ML pipeline it is important to try the easiest approach first and after trying the rule based model, we can see that it did not perform well and only got a 60% accuracy. So, the next step is to try machine learning approaches: instead of training a model from scratch, I decided to use a pre-trained general purpose language model and fine tune it on our specific task to determine polarity of news headlines.
General Purpose Language Models — Transfer Learning⁶
One of the biggest challenges in Natural Language processing (NLP) is the lack of training data. NLP is a diversified field with many different tasks, and datasets for those specific tasks are hard to come across and may only contain a few hundred thousand human-labeled training examples. However, modern deep learning models need large quantities of data as it leads to significant model performance when trained on millions or billions of data points. To bridge this gap, researchers have started using general purpose language representation models that are trained on an enormous amount of unannotated text during pre-training for simple tasks. Then, these general purpose pre-trained models can be fine-tuned on specific tasks which result in substantial accuracy improvements compared to training on just those datasets from scratch. Thus, knowledge gained from solving one problem can be applied to solving another different but related problem.
Google’s BERT Model⁷
Google’s Bidirectional Encoder Representations from Transformers (BERT) model is deep neural network technique for NLP pre-training. It is designed to pre-train deep bidirectional representations from unlabeled text by conditioning both left and right context in all layers. BERT outperformed other state of the art models by incorporating both left and right contexts into predictions. BERT’s training objective was to recover words in a sentence which have been masked. In a sentence, some tokens are replaced with a generic [mask] token, and the model is asked to recover the originals⁸.

BERT is trained on a plain text corpus — Wikipedia. What makes BERT different? Pre-trained representations can be context-free or contextual. For example, the word “bank” would have the same context free representation in “bank account” and “bank of the river.” On the other hand, contextual models take into account other words in the sentence. Thus, what makes BERT special is that it is bidirectional. Below is a picture of BERT and other models architecture⁷.
The BERT model achieved an accuracy score of 51% without retraining when tested on the Financial Phrase-Bank dataset.

The BERT model can be fine-tuned with just one additional output layer to create state of the art models for a wide range of tasks⁹. Thus, I retrained Google’s BERT model for sentiment analysis on the Financial Phrase-Bank dataset and achieved an accuracy score of 81%. These notebooks can be accessed on my GitHub.

Google’s XLNet Model¹⁰
The XLNet model builds upon BERT’s bidirectional contexts, but does not rely on corrupting the input with masks, like BERT does since it neglects dependency between masked positions. XLNet is an autoregressive language model that 1) learns bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and 2) overcomes BERT’s limitations thanks to its autoregressive formulation.
XLNet’s main contribution is not its architecture, but it’s modified training objective which learns conditional distributions for all permutations of tokens in a sequence.

XLNet is an auto-regressive over all permutations, it can calculate the probability of token X given preceding tokens from any order. This is done using permutation modeling and it’s two-stream self attention architecture. I have attached a picture below for the architecture and for more details about the inner workings of the XLNet model please refer to the research paper¹⁰.

The XLNet model achieved an accuracy score of 43% without retraining on the Financial Phrase-Bank dataset.

After retraining Google’s XLNet on the Financial Phrase-Bank dataset, the model achieved an accuracy score of 86% and this is the model I used to deploy in production.

Time to Deploy the ML Model in Production
The goal of the project was to develop an end to end pipeline and deploy it into production. Thus, the next step was to turn our Jupyter notebooks into Python scripts that can be used for production. I deployed the scripts as different services so they would not be dependent on each other, allowing for easier scalability.
Script 1 — Get the Stock Prices and Headlines
- It gets the stock price information and financial news headlines for the stocks in the Dow Jones Index.
- First, it pulls the stock price information for the 30 stocks in the index using Yfinance’s API. That data is stored in a data frame and a CSV file is created.
- Second, we use Finnhub’s API to get news headlines relevant to the stocks in the Dow Jones Index and those headlines are stored in another data frame and a CSV file is created for them.
- Third, the stock price CSV and news headlines CSV file are both uploaded to Google Cloud storage.
# Script 1 - gets news headlines and stock price data# Import necessary packagesimport yfinance as yf
import json
import datetime
import requests
import pandas as pd
import pytz
# Set the start and end datestart_date = datetime.datetime.now(pytz.timezone('US/Pacific')).strftime('%Y-%m-%d')
end_date = (datetime.datetime.now(pytz.timezone('US/Pacific')) + datetime.timedelta(days=1)).strftime('%Y-%m-%d')
# Opening JSON file - it has all the stock tickersf = open("config.json", )
# Returns JSON object as a dictionaryconfig = json.loads(f.read())
# Function to get stock pricesdef get_stock_data(stockticker, startdate, enddate):
data = yf.download(stockticker, startdate, enddate)
data['name'] = stockticker
return data
# Function to get news headlinesdef get_news_data(stockticker, startdate, enddate):
url = f"https://finnhub.io/api/v1/company-news?symbol={stockticker}&from={startdate}&to={enddate}&token=c2mnsqqad3idu4aicnrg"
r = requests.get(url)
response = r.json()
if not response:
return pd.DataFrame(index=['datetime', 'headline', 'related', 'source'])
r2 = pd.DataFrame(response)
df = r2[['datetime', 'headline', 'related', 'source']]
return df
# Get stock information about multiple stocksstock_data_list = []
for ticker in config["stockticker"].split():
tmp = get_stock_data(ticker, start_date, end_date)
if not tmp.empty:
stock_data_list.append(tmp)
stock_data = pd.concat(stock_data_list)
# Get news information about multiple stocksnews_data_list = []
for ticker in config["stockticker"].split():
tmp = get_news_data(ticker, start_date, end_date)
if not tmp.empty:
news_data_list.append(tmp)
news_data = pd.concat(news_data_list)
# Upload CSV files to Google cloud
from google.cloud import storage
client = storage.Client.from_service_account_json(json_credentials_path='yourfile.json')
bucket = client.get_bucket('yourbucket1')
object_name_in_gcs_bucket = bucket.blob('stock_data.csv')
df = pd.DataFrame(data=stock_data).to_csv(encoding="UTF-8")
object_name_in_gcs_bucket.upload_from_string(data=df)
object_name_in_gcs_bucket = bucket.blob('news_data.csv')
df = pd.DataFrame(data=news_data).to_csv(encoding="UTF-8")
object_name_in_gcs_bucket.upload_from_string(data=df)
Script 2 — Determine Sentiment of Headlines & Create Processed File
- It first pulls the news headlines CSV file from Google Cloud.
- Then, we load Google’s XLNet model with our pre-trained weights from our checkpoint file that we created when we fine-tuned the model for sentiment analysis based on the Financial Phrase-bank dataset. Then we do some data-preprocessing steps because the model takes input in a certain way (tokenize the inputs) and then the predict_sentiment function determines the polarity of a news headline.
- Then, the second stock price CSV file is pulled from Google Cloud. The two data frames are concatenated and a completed processed CSV file is uploaded to another bucket on Google Cloud.
- Then, on the Google Cloud Platform we used the Big Query database and linked it to Google storage buckets so the completed processed file was uploading data to the database.
# Model Inference
import pandas as pd
import torch
import torch.nn.functional as F
from transformers import XLNetModel, XLNetTokenizer, XLNetForSequenceClassification
from keras.preprocessing.sequence import pad_sequences
import time
import io
# Get CSV files from Google Cloudfrom google.cloud import storage
client = storage.Client.from_service_account_json(json_credentials_path='yourfile.json')
bucket = client.bucket('yourbucket1')
blob = bucket.blob('news_data.csv')
blob.download_to_filename('data.csv')
df = pd.read_csv('data.csv')
# Load the XLNET model and pre-trained weightsmodel = XLNetForSequenceClassification.from_pretrained("xlnet-base-cased", num_labels=3)
model.load_state_dict(torch.load("model_with_retraining.ckpt", map_location=torch.device('cpu')))
# keep map_location
# model.cuda()
# Prediction function to determine sentiment of news headlinesdef predict_sentiment(text):
review_text = text
encoded_review = tokenizer.encode_plus(
review_text,
max_length=MAX_LEN,
add_special_tokens=True,
return_token_type_ids=False,
pad_to_max_length=False,
return_attention_mask=True,
return_tensors='pt',
)
input_ids = pad_sequences(encoded_review['input_ids'], maxlen=MAX_LEN, dtype=torch.Tensor ,truncating="post",padding="post")
input_ids = input_ids.astype(dtype = 'int64')
input_ids = torch.tensor(input_ids)
attention_mask = pad_sequences(encoded_review['attention_mask'], maxlen=MAX_LEN, dtype=torch.Tensor ,truncating="post",padding="post")
attention_mask = attention_mask.astype(dtype = 'int64')
attention_mask = torch.tensor(attention_mask)
input_ids = input_ids.reshape(1,128).to(device)
attention_mask = attention_mask.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
outputs = outputs[0][0].cpu().detach()
probs = F.softmax(outputs, dim=-1).cpu().detach().numpy().tolist()
_, prediction = torch.max(outputs, dim =-1)
target_names = ['negative', 'neutral', 'positive']
return probs, target_names[prediction]
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
MAX_LEN = 128
probs_list = []
prediction_list = []
for sentence in df['headline']:
probs, prediction = predict_sentiment(sentence)
probs_list.append(probs)
prediction_list.append(prediction)
probs_df = pd.DataFrame(probs_list)
probs_df.columns = ['negative', 'neutral', 'positive']
prediction_df = pd.DataFrame(prediction_list)
prediction_df.columns = ['Sentiment']
# Classified news headlinesfinal_df = pd.concat([df,probs_df,prediction_df], axis=1)
final_df["datetime"] = pd.to_datetime(final_df["datetime"], unit='s').dt.strftime('%Y-%m-%d')
final_df = final_df.rename(columns={"datetime":"Date"})
final_df = final_df.rename(columns={"related":"name"})
# Get Stock Price CSV and concatenate stock price and processed headlines into one CSV fileblob = bucket.blob('stock_data.csv')
blob.download_to_filename('stock1.csv')
hist = pd.read_csv('stock1.csv')
pd.set_option('display.max_columns', None)
complete_df = pd.merge(final_df, hist, how='left', on=['Date', 'name'])
# Posting data to Google cloudbucket = client.get_bucket('yourbucket2')
object_name_in_gcs_bucket = bucket.blob('complete_df_'+ time.strftime('%Y%m%d')+'.csv')
df = pd.DataFrame(data=complete_df).to_csv(encoding="UTF-8")
object_name_in_gcs_bucket.upload_from_string(data=df)
Flask Script
The flask script is used to build a web application that lets the user select one of the 30 stocks in the Dow Jones Index and returns the results from Google’s Big Query database. The returned result will show the date, stock, headline, sentiment, closing stock price, and volume of stock being traded.
# Load Libraries from google.cloud import bigquery
import json
# Function to return data from Big Query Database def query_stackoverflow(stock):
client = bigquery.Client.from_service_account_json(json_credentials_path='yourfile.json')
query_job = client.query(
"""
SELECT * FROM `sunlit-inquiry-319400.ucsdcapstonedataset.StockData` WHERE name = '"""+stock+"""' LIMIT 1000
"""
)
results = query_job.result()
htmlmsg = "<html><body><table border=\"1\" style=\"border-collapse:collapse;\"><tr><td>Date</td><td>Headline</td><td>Name</td><td>Sentiment</td><td>Close</td><td>Volume</td></tr>"
for row in results:
htmlmsg += "<tr><td>" + str(row[2]) + "</td><td>" + str(row[3]) + "</td><td>" + str(row[4]) + "</td><td>" + str(row[9]) + "</td><td>" + str(row[13]) + "</td><td>" + str(row[15])+ "</td></tr>"
htmlmsg += "</table></body></html>"
return htmlmsg
from flask import Flask, render_template, request, jsonify
app = Flask(__name__)
# Flask home page @app.route('/')
def home():
return render_template('index.html')
@app.route('/songs', methods=['POST', 'GET'])
def get_info():
stock = request.form.get("stock")
print(stock)
return query_stackoverflow(stock)
if __name__ == '__main__':
app.run(host='0.0.0.0')
Docker¹¹
Docker is a tool that makes it easier to create, deploy, and run applications by using containers¹². Containers package up an application with all of its necessary components, libraries, and dependencies and ship it all out as one package. Containers make it possible to run applications reliably on another computer’s environment. The following steps are required to build a docker container:
- Create a docker file which is the blueprint for building images
- Build a docker image which the blueprint for the container
- Run the container to see the output of the packaged product
Docker takes over the responsibility of configuring the environment and standardizing the deployment pipeline. This allows for faster deployment and scaling time since the build only happens once during CI.
I turned all three scripts into their individual docker images and ran the docker containers locally to make sure everything was working as intended. Then each of the docker images were pushed to the Docker hub.
Kubernetes¹³
Kubernetes is a tool used to deploy and manage containerized web applications. First, a Kubernetes cluster is created, which is a set of nodes that run containerized applications. Then, under workloads, docker images are pulled from the docker hub and deployed as pods. A cron job was scheduled for Script one and Script two to run the ML pipeline on a daily basis and build our database over time. The Flask application was deployed and the user in real time could select one of 30 stocks from the Dow Jones and get returned its stock price and sentiment.


Summary
To summarize how we developed an end to end ML pipeline and deployed it into production: we started with our problem statement of trying to build a sentiment classifier that would determine the polarity of news headlines. We compared the performance of three models and went with Google’s XLNet model after we fine tuned it on the Financial Phrase-Bank dataset. Then we converted our Jupyter notebooks into Python Scripts and built Docker images and pushed them to the Docker Hub. Finally, we used Kubernetes to deploy and manage our containers.
How to use this to make better informed investment decisions & further exploration
Algorithmic trading is a difficult task because so many different variables are involved in the real world. Positive and negative sentiments from media releases have a substantial impact on the stock price of a company. Stock price prediction is very complex and volatile and has been an interesting and dynamic field of research. Data is becoming increasingly voluminous and crucial to all businesses, manual analysis is no longer feasible in today’s fast moving world. Most traders get their information from news, which makes it an influential factor in forecasting change in the stock market. Using a ML model like this in real time to track sentiment for a stock, the volume of shares being traded, and taking into account historical stock prices can lead to a sharpe ratio greater than 2.0¹⁴.
Where the Real Magic Lies — Reinforcement Learning
If I had more time I would have loved to build upon this work and utilize reinforcement learning and train a bot to trade stocks based on time series data, sentiment data, and knowledge graphs. I have put a research paper in references that was able to achieve a sharpe ratio greater than 2.0¹⁴. If interested in building upon please refer to my Github, and I would recommend using Open AI’s gym toolkit library¹⁵. Please reach out to me on LinkedIn if you have any questions. Cheers!
References
- https://www.cnbc.com/2021/02/01/gamestop-shares-reddit-trader-frenzy-continues-into-february.html
- https://www.ibm.com/cloud/learn/machine-learning
- https://www.ibm.com/cloud/learn/natural-language-processing
- https://huggingface.co/datasets/financial_phrasebank
- https://www.researchgate.net/publication/275828927_VADER_A_Parsimonious_Rule-based_Model_for_Sentiment_Analysis_of_Social_Media_Text
- https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
- https://arxiv.org/pdf/1810.04805.pdf
- https://www.borealisai.com/en/blog/understanding-xlnet/
- https://github.com/Shivampanwar/Bert-text-classification
- https://arxiv.org/pdf/1906.08237.pdf
- https://docs.docker.com/get-started/
- https://towardsdatascience.com/deploy-machine-learning-pipeline-on-cloud-using-docker-container-bec64458dc01
- https://kubernetes.io/
- https://arxiv.org/pdf/2001.09403.pdf
- https://gym.openai.com/
- https://imgur.com/gallery/ymJEdr2
- https://github.com/Aditya-Bahl/UCSD_Machine_Learning_Engineering/tree/main/Capstone