Sentiment Analysis for Trading with Reddit Text Data

Arjun R
Analytics Vidhya
Published in
7 min readJul 29, 2020

In this article I explore using Reddit sentiment data to inform trading strategies. I derive market sentiment in two ways using the wallstreetbets subreddit:

  1. Collecting comments from daily discussion submissions then running the VADER sentiment model to assess overall daily positive/negative sentiment.
  2. Collecting all submission titles per day then assessing daily bullish/bearish sentiment using keyword analysis.

In the featurization phase I apply Fourier transforms to smooth the two very noisy time-series datasets. Finally, in the strategy development phase I explore two possible strategies. The first involves exploiting the spread between SPY (SPDR S&P 500 Trust ETF) price and daily positive/negative sentiment. The second strategy involves training a LSTM (long short term memory) model to predict the next day’s SPY price based on bullish/bearish sentiment.

Let’s jump into the code walk through!

First I import all relevant libraries.

In the cell below I import TensorFlow which will be necessary for strategy development later in the article. The last line of code checks to make sure TensorFlow is connected to a graphics card, however, this is not necessary if not applicable to you.

Positive/Negative Sentiment Analysis Using VADER

First, I prepare the positive/negative sentiment dataset.

Data source- The wallstreetbets subreddit is a community of stock market enthusiasts with 1.3 million members. I use pushshift.io and PRAW (the reddit API) to gather relevant data.

The code below collects the daily discussion thread submission titles (thanks to Rare Loot for the article on using pushshift to extract reddit submissions- https://medium.com/@RareLoot/using-pushshifts-api-to-extract-reddit-submissions-fb517b286563).

Next, I collect comments from the daily discussion thread submissions using PRAW with the code below. See the following article on how to set up PRAW- https://towardsdatascience.com/scraping-reddit-data-1c0af3040768. I have also provided the prepared positive/negative sentiment dataset on my GitHub since the following ingest code takes exceedingly long to run (https://github.com/awrd2019/Reddit-Sentiment-NLP-for-Trading).

Here, I run the VADER (Valence Aware Dictionary for sEntiment Reasoning) sentiment analyzer on comments from each day. VADER is a parsimonious rule-based model developed by a group of Georgia Tech researchers for sentiment analysis of social media text. See their excellent paper for more information (http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf). The sentiment scores from all the comments of each day are then summed to create a daily sentiment score.

Next, I merge the positive/negative sentiment data with the corresponding SPY prices. I also select only the relevant columns, set the date column as the index, and drop any empty rows. I then save the dataframe before going on to apply Fourier transforms.

Here, I plot the sentiment score against the SPY price.

Next, I apply Fourier transforms to smooth the sentiment data. Fourier transforms approximate a function using a series of sine waves. I plot the sentiment data and the Fourier transforms as well.

Below, I plot the Fourier transform of the sentiment data with 20 components and the SPY price.

Next, I normalize the data. The plot depicts the normalized spy prices, normalized sentiment scores, and normalized Fourier transforms of the sentiment data.

I have finished preparing the positive/negative sentiment data. Let’s move on to preparing the bullish/bearish sentiment data.

Bullish/Bearish Sentiment Analysis Using Keywords

Again, I use pushshift to collect submission titles. I do not need to use PRAW like earlier, however. All that I need to collect is the submission titles to conduct keyword analysis on. I have again provided the prepared dataset on my github since the ingest code takes long to run (https://github.com/awrd2019/Reddit-Sentiment-NLP-for-Trading).

Here, I use a small collection of keywords to classify submission titles as bullish, bearish, or neutral. I also use regex to detect positions in the submission titles, eg AAPL 350c (this is an Apple call with strike price of 350). Titles containing call positions are classified as bullish while titles containing put positions are classified as bearish.

Next, I remove any submission titles that are not of the following flairs; DD (due diligence), Discussion, YOLO, Fundamentals, or Stocks.

Here, I sum the bullish and bearish sentiment scores from all the submission titles for each day. Then, I divide this number by the total number of submissions for each day.

Now, I merge the bullish/bearish sentiment data with SPY price data as I did earlier with the positive/negative sentiment data. I also again save the dataframe.

Here, I plot the bull scores vs SPY and the bear scores vs SPY for the past year.

Now, I apply Fourier transforms to both the bull and bear scores.

Here, I plot the bull and bear scores and their Fourier transforms. It’s evident that earlier in the data the scores fluctuated far more. This is most likely due to less members being present in the wallstreetbets subreddit. Therefore, there were less submissions. As the subreddit grew, the scores understandably began to fluctuate less.

I again normalize each column of the dataframe.

Below, I plot the normalized spy price and Fourier transforms of both the bullish and bearish sentiment scores.

The bullish/bearish sentiment data has been prepared. Let’s move on to the strategy development phase.

Strategy One: Exploit spread between SPY price and positive/negative sentiment

The first strategy involves exploiting the spread between the positive/negative sentiment data. For example, if the positive/negative sentiment drops dramatically (more negative) but SPY continues to rise, then this could be a good indication to open a short position. If you observe the earlier plots, this pattern emerges a few times, most prominently right before the COVID-19 crash. To implement this strategy, I use the rolling correlation between positive/negative sentiment and SPY price. Instead of using the raw sentiment data I use the less noisy Fourier transform with 20 components. I first check the correlation between the normalized SPY prices and the normalized 20 component Fourier transform of the sentiment data.

Next, I create a dataframe with the rolling correlation using a window of 14 days. I also save the actual correlation, the mean of the rolling correlation, and the standard deviation of the rolling correlation as variables.

I plot the rolling correlation with a red axis line at the actual correlation value.

Lastly, I plot the rolling correlation over the normalized SPY price with red and black axis lines at the mean and mean minus standard deviation of the rolling correlation.

Strategy Two: Train LSTM to predict next day’s SPY price based on bullish/bearish sentiment

The second strategy involves training a LSTM neural network model to predict the next day’s SPY price based on the bullish and bearish sentiment scores from the previous 14 days. First, I define a function which takes a 2d numpy array as an input and returns the array with the first element removed from each array within the array. This function will be necessary for preprocessing in the proceeding cell.

Here, I select the relevant columns from the bullish/bearish sentiment dataframe and convert to a numpy array. I then create an array of 15-day windows of the data. This array has 3 dimensions: the first is the number of windows, the second is the number of days in each window (15), and the third is the number of features (7). I then divide the array into a train and test set and shuffle the test set. Next, I divide the train and test set into X train and y train, and X test and y test, to map inputs (raw bullish/bearish sentiment data and Fourier transforms) to output (price).

In this cell I construct and train the LSTM neural network.

Finally, I plot the predictions of the model against the true SPY price and calculate the model’s test MEA (mean absolute error).

Conclusion

I conduct sentiment analysis on text data from Reddit to inform trading strategies.

Thank you for reading! I hope you enjoyed the article and would welcome any comments and suggestions.

Links

  1. https://github.com/awrd2019/Reddit-Sentiment-NLP-for-Trading
  2. http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf

Disclaimer: This article does not constitute as a recommendation that any investment or trading strategy be implemented. Securities and derivatives trading carry substantial risk of loss.

--

--