Stock Return Prediction using Transfer Learning on Textual Data

Rohit Beri
Institute for Applied Computational Science
18 min readDec 28, 2020

Authors: Eduardo Peynetti, Rohit Beri, Jessica Wijaya, Stuart Neilson

This article was produced as part of the final project for Harvard’s AC295 Fall 2020 course.

Introduction

Transfer Learning, combined with open-sourced state-of-the-art (SOTA) neural network models has democratized and hastened the adoption of machine learning in a variety of fields ranging from medical science to astronomy. In this article, we explore the applicability of SOTA Natural Language Processing models for sentiment analysis in the financial domain (news, press releases, 10-Ks) to build profitable trading strategies.

We explore three different models, Loughran-McDonald, BERT, and Fin-BERT. We fine-tune Fin-BERT and take a look into their inner workings. We construct long/short portfolios of stocks based on the sentiment signal from each of these models and look at their performance. We find that sentiment in the news is predictive of future stock returns. In the end, we discuss the learnings, challenges, and the path ahead.

Problem Overview

Analyzing textual documents is an important part of the investment analysis process. Traditional investment analysts typically study a wide-ranging class of documents like 10K’s, analyst reports, earning transcripts, industry reports, press releases, credit rating reports, etc. to form an opinion about a company. This exercise involves hundreds of person-hours of research for any given target investment.

Given the very large available quantity of such financial texts, as well as their time-sensitive release, methods for extracting an automatic signal rather than relying on human readers are highly sought after. Modern Natural Language Processing can be a useful tool in the hands of investment managers, saving considerable research effort for traditional portfolio managers, and adding another source of signal for quantitative investment managers¹.

Before the deep learning era, a popular approach for automatic processing of text has been a “dictionary-based” approach — i.e. labeling a vocabulary of words as either “positive” or “negative” based on their meanings, then assessing the sentiment of a document by counting the number of words from this vocabulary in the document (often adding a weighting scheme such as TF-IDF). This approach was bolstered by the availability of domain-specific dictionary resources — such as Loughran-McDonald which is specific to financial data². However, such methods suffer from the shortcoming that they ignore context. Advancements in text processing technology from the last few years enable us to overcome this hurdle.

In this article, we explore how pre-trained deep-learning-based NLP models can help extract information relevant to investment decision making from textual datasets. The main idea behind these models is that by training a neural network on very large corpora and taking advantage of word vector representations as well as mechanisms such as attention, these models can learn how to represent semantic information. We can then use the pre-trained weights of this model and fine-tune the network on a down-stream task, such as sentiment prediction.

BERT (Bidirectional Encoder Representations from Transformers) is one of such models, that was released in 2018 by Google. It has revolutionized the NLP space and launched huge amounts of research into even more complex models. We explore the use of BERT for financial text analysis in this article.

Data Sources

For this project, we focus on the following sources of information:

  • News Articles
  • Key Development
  • 10-K’s Summaries

Neural network models can profit from large amounts of data in the training and evaluation stages. Also, since, we are looking to build long/short daily portfolios in a large universe of stocks (we are looking at S&P1500 as our trading universe), we require having news for a significant number of stocks each day to make decisions based on their sentiment. We found the following datasets to be very helpful for this objective:

  • Tiingo³ — Summaries of over 27 million news articles since 1996 tagged to one or more stock or sector
  • FinnHub⁴ — Summaries of over 3 million news articles since 2000 tagged to a stock
  • Quandl⁵ — Fundamental financial data and stock prices from the “Sharadar” database
  • S&P Capital IQ⁶ — Key developments including press releases from the companies on issues such as mergers and acquisitions, share buy-backs, new product launches, etc.
  • SEC Edgar⁷ — 6000+ 10K’s for companies in the S&P500 universe for the last 20 years

Data Preprocessing

The quality of the data is critical to the performance of the model and any analysis thereof. Hence, we pay keen attention to the cleaning and filtering of our data.

News Dataset

Exploratory data analysis (EDA) reveals some of the challenges with the news datasets. Currently, both Tiingo and FinnHub release thousands of news every day, timestamped and tagged by stock. However, these datasets started getting actively compiled about 5 years ago, and thus they suffer from sparsity as we go back in time. The problem is particularly acute before 2011. Therefore, we decide to analyze data from the period starting Jan 1st, 2011. Further, to ensure that the sentiment in an article is correlated to stock, we filter out the news articles which either have multiple stock tags or no stock tagged to them. Finally, we remove duplicate articles. This reduces our data set from over 30 million news summaries to just over 3.7 million news summaries. We concatenate the title with the summary to have as much information as possible.

10K’s

A 10K is a comprehensive report, filed annually by publicly traded companies about their financial performance. It is a structured document with standardized sections. Pre-processing of 10K’s requires an additional step of parsing the text from the 10K and summarizing the text. We focus specifically on item 1a (Risk Factors), item 7 (Management’s Discussion and Analysis of Financial Condition & Results of Operations), and item 7a (Quantitative and Qualitative Disclosures About Market Risk). We use regex to match the item headline/subtitle and extract the content out of these sections. The contents are then saved (in the form of a dictionary) for further processing and exploration.

Item 1a
Item 7
Item 7a

Further, we realize that the extracted portions are long (up to 150,000 words), which creates a challenge for tokenization and modeling. Hence, we summarize each of the 3 sections’ content (item 1a, 7, and 7a) using Latent Semantic Analysis (LSA)⁸.

Word Length of Items from 10-Ks

Fundamental Financial Data and Stock Prices

We obtain daily price data, as well as quarterly financials from the “Sharadar” database from Quandl. This database contains daily open/close/high/low price and volume data for over 14,000 US-based companies and 150 financial indicators.

We aim to simulate a portfolio of the 1500 largest stocks by market capitalization, a universe that varies through time. We also wish to analyze stock returns in different timeframes, such as daily/weekly/monthly/quarterly, while taking into account dividends, stock splits, and de-listings. Finally, we seek to analyze our portfolio by looking into different ways of aggregating its returns, such as by company size, stock price volatility, sector, and industry.

To achieve all these objectives, we use Zipline, which is an open-source library from Quantopian geared towards the analysis of large portfolios of stocks. We process the raw price data from Quandl into a SQL database that is highly optimized for required calculations such as sorting, moving averages, etc. Finally, we create a pipeline to load the data into Zipline and calculate any financial information that we require, such as stock returns, market capitalization, volatility, and ranking of sentiment from our different models.

Baseline — Loughran-McDonald Sentiment (LM)

We use a word count using the Loughran-McDonald sentiment dictionary as our baseline model. Since it’s a dictionary-based model, there is no training requirement. it can be considered a transfer learning model as well. To enhance the dictionary, we add the top 50 positive and negative words provided in the paper¹ “Predicting Stock Returns With Text Data”.

LM is a dictionary-based sentiment model. It is constructed by analyzing word frequencies in 10-Ks and 10-Qs (More information).

Several studies have shown that dictionary-based sentiment models built from financial texts show statistical significance in predicting stock returns. For a summary of different methods, please refer to “Loughran, Tim, and Bill Mcdonald, 2016, Textual Analysis in Accounting and Finance”.

Following the suggestions by Loughran and McDonald, we use the positive and negative vocabulary in their dictionary and calculate text frequency/inverse document frequency (TF-IDF). We sum the positive “TF-IDF” scores and subtract the negative “TF-IDF” scores.

Transfer Learning Models

BERT Overview

BERT Model — Source

BERT stands for Bidirectional Encoder Representations from Transformers. This model was released in 2018 by Google and, together with other models like OpenAI’s GPT, started smashing benchmarks in a multitude of textual problems without requiring lengthy fine-tuning.

BERT is a language model. It is pre-trained on huge amounts of textual corpora and provides an output of high-quality features that summarizes semantic and contextual meaning from documents. These features can then be used on down-stream tasks such as classification, entity recognition, translation, etc.

It makes use of an architecture known as the Transformer, which makes use of a mechanism known as attention. An excellent introduction to these concepts can be found in Illustrated Bert, Illustrated Transformer, and BERT Fine-Tuning.

BERT is basically a language model that consists of a set of Transformer encoders stacked on top of each other. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a broad range of NLP tasks, such as text classification, question answering, and language inference, without substantial task-specific architecture modifications⁹.

BERT for Sequence Classification — Source: Google

We use the following version of BERT:

  • BERT-base: 12 encoder layers, hidden size of 768, 12 multi-head attention heads, and 110M parameters in total

Both of these models have been trained on BookCorpus¹⁰ and English Wikipedia, which has in total more than 3.5B words¹¹.

For our analysis, we use transfer learning from two models based on the “BERT-base” architecture to extract financial sentiment from textual data. The first model is trained for generic sentiment, while the second is explicitly trained for the financial sentiment.

BERT Sentiment Model

We use the default sentiment model from the Huggingface library, based on BERT¹², which was fine-tuned on the Stanford Sentiment Treebank dataset¹³ for sentiment analysis. This model provides a binary classification of positive or negative for any document.

We use this as a baseline transfer model as it wasn’t fine-tuned with financial data, which we could then compare its results to a version of the BERT-base model fine-tuned with financial data.

Fin-BERT Sentiment Model

The Second Transfer Learning model we use is Fin-BERT¹⁴. This model is built using a pre-trained BERT-Base model which is further fine-tuned on financial documents, i.e., TRC2-financial¹⁵ and Financial PhraseBank¹⁶. It classifies the text into three sentiment categories, i.e. negative, neutral, and positive.

Fin-BERT Fine-Tuned with news data

Stocks with positive returns are usually paired with positive news on the same day and vice versa. What if we use this as a proxy of sentiment? It’s easy to label every news this way. We label a text 1 if it had a positive return and a -1 if it had a negative return. To measure neutral sentiment, we assign a 0 to all news that doesn’t have any words in a sentiment dictionary. We fine-tune the last layer of Fin-BERT using news as input and these labels as a target.

We use data from 2011–2017 as our training set. This comprises 1.9 million documents. We use 2018 as a validation set for hyper-parameter tuning, and 2019–2020 as our test set. The model is never trained with data from after 2018.

Performance Results

Financial data is very noisy. When we look at the accuracy of all models predicting today’s return with today’s sentiment, we find accuracies around 51–52%. It’s hard to make a judgment based on this information. To have a better measure of performance, we do the following: on a given day, buy stocks with positive sentiment, sell stocks with negative sentiment, and see how they perform the next day. We label news articles that are timestamped from 9 AM of the previous day until 9 AM of the current day as the same day’s news, and trade at the market open at 9:30 AM NY time.

The below chart is an out-of-sample projection of returns on a portfolio based on trading according to the sentiment predictions generated from each of the four model types discussed above, as well as an ensemble model that combines all of them. Specifically, the weights of the portfolio are z-scores of the average sentiment score given by a model for a particular stock on a particular day. The portfolios are dollar neutral.

We observe that the base Fin-BERT and the fine-tuned Fin-BERT are the best performers, and are similar to each other. The “ensemble” model is the average of all weights and has lower performance than Fin-BERT but better than either regular-BERT or Loughran-McDonald.

1-Day Froward Returns

The below chart is the same as above, except that it is allowing trading on the same day, so it includes information from the future, and could not actually be used in real-life trading. As may be expected, it shows even stronger performance, as well as being highly steady and consistent — illustrating that there is a strong overall connection between stock performance and news sentiment as measured in our data pipeline.

Same-day Returns — Sentiment

In the financial research literature, a more standard way of assessing prediction models for financial data series is to use the Sharpe Ratio exhibited by a trading strategy that uses the model’s prediction as a trading signal. the Sharpe Ratio is the average return divided by the standard deviation of returns¹⁷. This measure is used to reflect most market participants’ need to limit volatility in their portfolios.

The Sharpe ratios (based on trading only after the news has been published) indicate the best performance is for the Fin-BERT base model, and the Fin-BERT Fine-Tuned model is the second best.

Sharpe Ratio (Jan 2019 — Sept 2020)

We note that the Sharpe ratio for the dictionary-based model is in-line with those obtained by the paper¹. Originally, the Sharpe we observed for Fin-BERT was over 5 Sharpe. A significant portion of news had a midnight timestamp, which didn’t make it clear if the news came before or after the market closed. After we removed these documents, we obtained the above figures. Also, note that we don’t incorporate transaction costs into account. The turnover for this strategy is about 85%, which means that transaction costs should degrade the performance in a significant manner. At the same time, we didn’t spend much attention on portfolio construction, and choosing the right weighting scheme for the signal could provide a boost to the numbers.

The below chart examines trading performance for trading days surrounding news days — both before and after. Only the positive numbered days, in which you trade after the publication of the news, are valid for a trading application. However, the very strong performance at day zero and for a few days before provides further illustration of the relationship between stock price movements and the sentiments signals that we extract from the texts.

Sharpe Ratio Comparison

Fin-BERT Fine-Tuned performance

Our fine-tuned model decreases in performance versus that of Fin-BERT. We believe that using returns as a proxy of sentiment was a very noisy signal that affected the fine-tuning of the neural network. We observed multiple times that the networks would degrade in their performance, known as catastrophic forgetting. Still, we obtain a model that still has strong performance compared to other benchmarks and provides different predictions from the original model.

Fin-BERT Sentiment Model Results — A Deeper Dive

The next few figures show some additional deeper dives into the returns generated by using the Fin-BERT base model.

The model shows stronger performance on the companies with larger market capitalization, as seen in the figure below — and then the COVID-19 event has halted all positive returns on the smaller companies but not the larger ones. This result is different from the one found in the paper¹, which shows that smaller cap companies have higher predictability. This makes sense since lower cap stocks tend to have less news written about them, and any news that comes out about them might be meaningful. 2020 was, however, an exceptional period in markets.

Fin-BERT: Returns vs Size

The model shows a stronger performance when trading on companies with higher prior (6 months) volatility, but this is matched by higher volatility on the model’s performance, as seen below. This result matches the observation made by the authors of the paper¹. We note that the Sharpe ratio for both strategies is roughly the same. Low volatility stocks don’t benefit from a similar signal as much as high volatility stocks as the expected movement is much higher for high volatility stocks.

Fin-BERT: Returns vs Volatility

The below chart shows results dividing up the returns based on quantiles of the strength of the sentiment signal extracted from the text. As expected, trading on stronger sentiment signals generally provides higher returns, although the highest strength sentiment category does not always provide the highest returns.

Fin-BERT: Returns by Quantile

The below chart shows the overall volatility of the trading strategy based on our sentiment signals, covering the full study period — i.e. the training (2011–2017), validation (2018), and test (2019–2020) periods. The higher volatility in the earlier periods is due to the lower quantity of new articles available from those years in our data sources. In the more recent years, the volatility is consistently low, until it increases sharply at the time of the onset of the COVID-19 pandemic, which saw major increases in market volatility overall that would affect most trading strategies.

Fin-BERT: 6-month Rolling Volatility

Factor Analysis

We regress the returns of the FinBERT model vs the returns of portfolios of stocks created by ranking stocks by different measures:

  • Momentum: 11-month return, starting 1 year in the past
  • Reversal: 5-day return
  • Volatility: 6-month standard deviation of returns
  • Size: Market Capitalization
  • Value: (Assets — Liabilities) / Market Capitalization

By doing this regression, we can get an idea of how these factors explain the returns of our signal. Below are the coefficients for the regression, using returns for Today, Yesterday, and 2 days before respectively

Factor Beta’s

We see a clear effect coming from Momentum, Reversal, and Volatility. The Reversal effect is the most persistent as you go back in time, while momentum and volatility fade. This could be a way to see that some news are old, some are already priced in, and some news articles are truly market surprising and moving.

BERT Versus Fin-BERT: Attention Vectors

The reason behind the superior performance of Fin-BERT relative to BERT in generating predictions is illustrated by the below model visualizations, generated using the Captum¹⁸ library.

Example 1 is a negative news story in which oil prices fell, and the word “fell” is key to identifying the story as negative. We find that the regular BERT attention heads fail to place much emphasis on “fell” whilst the Fin-BERT does do so.

Example 1

BERT
Fin-BERT

Similarly, Example 2 and Example 3 are positive news stories in which keywords that are influential for the sentiment (“more than doubled” and “a large deal”) have been captured much more effectively in the attention vectors for Fin-BERT than for BERT.

Example 2

BERT
Fin-BERT

Example 3

BERT
Fin-BERT

By visualizing the attention given by BERT and Fin-BERT on several examples of financial news article summaries/headlines above, we can see that Fin-BERT is capable of identifying the relevant keywords in financial domains much better than BERT, thus justifying the superior performance for our Fin-BERT model prediction we saw earlier.

Towards a More Comprehensive Model

Our comprehensive model is a work in progress. We have seen that Fin-BERT does a good job of capturing the market sentiment. We are building a model that takes the advantage of Fin-BERT while aggregating the signal from multiple sources. Our final model comprises of three stages:

  • In the first stage, we extract Pooled Hidden States from fin-BERT for all our text datasets (10K’s, key developments data, and news articles data).
  • In the second stage, we generate an aggregated hidden state — as any given stock may have more than one news on any given day.
  • Finally, we combine the hidden states from three datasets in a feature space along with the financial and the technical features to predict the returns.

We share some of our work towards a comprehensive model below.

A less noisy target

Our training suffered from training on a sentiment proxy based on stock returns. However, we find that using the outputs of all these models yielded a very robust methodology for detecting relevant and market-moving news automatically. We believe that using these insights to manually label news would help us fine-tune our model. FinBERT sentiment is fine-tuned with only 5000+ manually labeled phrases. Also, the aggregation of signals in clever manners could reduce the effect of noise in training.

Incorporating Key Developments data

In an attempt to explore causality between key development articles and expected returns, we feed the texts to Fin-BERT and extract out the feature space (i.e. pooled hidden states). We then add some top layers (dense + dropout + output layer) to predict the binary return (i.e. positive/negative return). The result looks promising, although future works still need to be done to better fine-tune the model and aggregate this with the other data sources.

Fin-BERT Fine-tuning — Key Developments

Incorporating 10K’s

We explore a similar relationship between 10K’s documents and expected returns. We extract items 1a, 7, and 7a from the 10K’s and feed them to Fin-BERT. We then concatenated the feature spaces from each of items 1a, 7, and 7a and added some top layers (dense + dropout + output layer) to predict the binary return (i.e. positive/negative return). The result again looks promising, although future works still need to be done to better fine-tune the model and aggregate this with the other data sources.

Fin-BERT Fine-tuning — 10-K’s

Taking Advantage of Meta-Labels

For this project, we filtered news by choosing those that only had 1 stock tagged to them. There is significant information that can be obtained from Tiingo’s labeling by sector, or by multiple stocks. Current NLP should be able to detect binary sentiment in the text (eg Tesla surges while Apple stalls).

Conclusions

  1. Natural Language Processing with Transfer Learning shows promising results in the field of stock return prediction.
  2. Domain Adaptation is important for both feature space and target space. Fin-BERT has been adequately fine-tuned on the financial text and hence can capture the sentiment of the news article very reliably. On the other hand, Fin-BERT was not trained or fine-tuned for stock return prediction and needs further work in that area.
  3. Simple fine-tuning of a “pre-trained model” is not enough when the task is to predict stock returns
  • Multiple conflicting news i.e. noise in the feature space
  • Inherent stochasticity of the stock returns i.e. noise in the target space
  • Many other variables that impact stock returns that we don’t consider while fine-tuning

References

  1. For related analysis, see Ke, Zheng and Kelly, Bryan T. and Xiu, Dacheng, Predicting Returns with Text Data (September 30, 2020). The University of Chicago, Becker Friedman Institute for Economics Working Paper №2019–69, Yale ICF Working Paper №2019–10, Chicago Booth Research Paper №20–37, Available at SSRN: https://ssrn.com/abstract=3389884 or http://dx.doi.org/10.2139/ssrn.3389884
  2. The Loughran McDonald word lists can be acquired from httpsL//sraf.nd.edu/textual-analysis/resources/. The technique is discussed in Loughran, Tim, and McDonald, Bill, Textual Analysis in Accounting and Finance: A Survey (May 20, 2016). Available at SSRN: https://ssrn.com/abstract=2504147 or http://dx.doi.org/10.2139/ssrn.2504147
  3. “Tiingo” (www.tiingo.com)
  4. Finnhub (www.finnhub.io) offers a web-based API for extracting company news as well as a variety of market and fundamental data, It is recommended by the Medium best stock API Guide https://mylinh19662.medium.com/a-comprehensive-guide-to-stock-market-apis-free-and-paid-a09df68a88eb
  5. “Sharadar” and Quandl (www.quandl.com)
  6. https://www.spglobal.com/marketintelligence/en/solutions/sp-capital-iq-platform
  7. SEC (www.SEC.gov)
  8. http://www.kiv.zcu.cz/~jstein/publikace/isim2004.pdf
  9. https://arxiv.org/pdf/1810.04805.pdf
  10. Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books. (Jun 2015). arXiv:1506.06724 http://arxiv.org/abs/1506.06724
  11. https://github.com/google-research/bert
  12. https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english
  13. https://nlp.stanford.edu/sentiment/index.html
  14. The Fin-BERT model is discussed in https://arxiv.org/abs/1908.10063 and can be acquired from https://huggingface.co/ipuneetrathore/bert-base-cased-finetuned-Fin-BERT
  15. Dataset consists of approx. 2 million news articles published in 2008–2010, the corpus can be obtained from https://trec.nist.gov/data/reuters/reuters.html
  16. https://www.researchgate.net/publication/251231364_FinancialPhraseBank-v10
  17. A complete treatment of the Sharpe ratio would deduct the available risk-free rate of return, though this would have a trivial effect on our one-day trades. It would also consider the transaction costs of executing the trading strategy. In a real-world trading application, adjustments would be made to the approach to have lower portfolio turnover.
  18. https://captum.ai/

--

--