Finding correlations between Reddit and the Stock Market using Statistics

Using VADER (Valence Aware Dictionary and sEntiment Reasoner), Pushshift API, Spacy’s Named Entity Recognition, and Spearman rank-order correlation test.

Sean Yap

Published in

Analytics Vidhya

12 min readJun 26, 2021

Introduction

In early 2021, Reddit r/wallstreetbets has been at the epicenter of one of the biggest movements in the financial market and showcases the power of social media. While it seems like an unlikely source of movement, it’s hardly surprising in hindsight due to Robinhood giving retail traders easy access to the stock market with zero commission fees. With that in mind, it got me thinking about whether I could determine a correlation between:

Number of mentions of a particular stock with stock trading volume
Post sentiments of a particular stock with stock prices

In this post, I will be scraping data from r/wallstreetbets using Pushshift API and using that data to test the hypothesis mentioned above.

How Pushshift API works

The Pushshift API is a great resource for scraping large amounts of Reddit data. One way of using the API is through the https://api.pushshift.io/ endpoints. The base foundation of the pushshift URL to access Reddit would be https://api.pushshift.io/reddit/search/.

With parameters, we will access r/wallstreetbets subreddit between 2 dates in Unix timestamp format (April 2020 and April 2021): https://api.pushshift.io/reddit/search/submission/?subreddit=wallstreetbets&after=1585670400&before=1618110903&size=100

size — increase limit of returned entries to 100
after — date to start the search
before — date to end the search
subreddit — to narrow it down to a particular subreddit

The URL above will return a JSON response of the results. For explanation sake, I reduce the size to 1 for better visualization of the JSON response. It will look something like this:

{"data": [{"all_awardings": [],"allow_live_comments": false,"author": "br0kencircuit","author_flair_css_class": null,"author_flair_richtext": [],"author_flair_text": null,"author_flair_type": "text","author_fullname": "t2_20jrdmhz","author_patreon_flair": false,"author_premium": false,"awarders": [],"can_mod_post": false,"contest_mode": false,"created_utc": 1585670411,"domain":"self.wallstreetbets","full_link": "https://www.reddit.com/r/wallstreetbets/comments/fsfhli/the_four_most_expensive_words_in_the_english/","gildings": {},"id":"fsfhli","is_crosspostable": false,"is_meta": false,"is_original_content": false,"is_reddit_media_domain": false,"is_robot_indexable": false,"is_self": true,"is_video": false,"link_flair_background_color": "#7b2e00","link_flair_css_class": "shitpost","link_flair_richtext": [{"e": "text","t": "Shitpost"}
],"link_flair_template_id": "50c5e166-b861-11e5-bc53-0e60c810ce03",
"link_flair_text":"Shitpost","link_flair_text_color":"light","link_flair_type": "richtext","locked": false,"media_only": false,"no_follow": true,"num_comments": 0,"num_crossposts": 0,"over_18": false,"parent_whitelist_status": "no_ads","permalink": "/r/wallstreetbets/comments/fsfhli/the_four_most_expensive_words_in_the_english/","pinned": false,"pwls": 0,"removed_by_category": "moderator","retrieved_on": 1585675830,"score": 1,"selftext": "[removed]","send_replies": true,"spoiler": false,"stickied": false,"subreddit": "wallstreetbets","subreddit_id": "t5_2th52","subreddit_subscribers": 1065713,"subreddit_type": "public","suggested_sort": "confidence","thumbnail": "self",
"title": "The four most expensive words in the English language are \"this time it\u2019s different.\" - Sir John Templeton",
"total_awards_received":0,"url":"https://www.reddit.com/r/wallstreetbets/comments/fsfhli/the_four_most_expensive_words_in_the_english/",
"whitelist_status": "no_ads","wls": 0}]}

In Python, JSON objects can be translated into dictionary types and in this case would be held under the dictionary key “data”, followed by a list of nested dictionaries with what you see above. The key points that I needed are bolded and to access them would be like so:

Title: ["data"][0]["title"]
Post text: ["data"][0]["selftext"]
Post created: ["data"][0]["created_utc"]

A more detailed explanation of how the API works can be found at https://github.com/pushshift/api

Building functions to scrape Reddit data

I needed to first create a function that sends HTTP requests to Reddit’s server and get back a JSON response of the requested content with a tolerance of 5 failed HTTP request attempts.

With that, I created another function to pull the HTTP request and append key data points into a list. Along with that is another function that works with the earlier mentioned function in a while loop iteration, taking the last created_utc from the list and brings it back into the loop to add submissions continuously until the parameter before_date is fulfilled. I also added in tqdm module and a print function for the last created_utc to check on the progress of extracting the data.

I executed the functions with the required parameters and did a simple text preprocessing to eliminate any posts that got deleted or removed by the administrators.

Results

The whole process took a whopping 13 hours and scrapped a total of 1,104,216 posts, 143,510 posts after preprocessing. My laptop’s Intel Core i7–6700HQ sure took a beating.

A peek at the dataset collected. Image by Author

Exploratory Data Analysis (EDA)

Before I begin analyzing posts sentiments against market data, I performed what is known as Exploratory Data Analysis on the data to gain a better understanding of the data I just collected.

Exploratory Data Analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

1. Volume of posts across the months

The first thing I did was to picture the distribution of the posts collected. Immediately, I could see a correlation between the number of posts and the surge in hype indicated by the sudden rise of posts in the month of January 2021.

Graph depicting the number of posts across the months. Image by Author

2. Sentiment polarity score distribution

The next thing I did was to visualize the sentiments of the post using VADER(Valence Aware Dictionary sEntiment Reasoner). To have better accuracy in determining a post sentiment, I updated VADER’s lexicon with a dictionary of positive and negative sentiment stock related words and terms commonly used by users in the subreddit such as YOLO, HODL, TO THE MOON, etc

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()def sentiment_analyzer(text):
    score = analyser.polarity_scores(text)
    return scoresentence = "In the case of Apple, I am extremely bullish on its price. So far it has been an upward trend"
print(sentiment_analyzer(sentence))
OUTPUT: 
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0} # Neutral# Update lexicon for better accuracy
positive_words = 'buy bull long support undervalued underpriced cheap upward rising trend moon rocket hold hodl breakout call beat support buying holding high profit stonks yolo'negative_words = 'sell bear bubble bearish short overvalued overbought overpriced expensive downward falling sold sell low put miss resistance squeeze cover seller loss 'pos = {i: 5 for i in positive_words.split(" ")}
neg = {i: -5 for i in negative_words.split(" ")}
stock_lexicons = {**pos, **neg}
analyser.lexicon.update(stock_lexicons)print(sentiment_analyzer(sentence))
OUTPUT:
{'neg': 0.0, 'neu': 0.6, 'pos': 0.4, 'compound': 0.9325} # Positive

Graph depicting sentiment polarity distribution of posts. Image by Author

3. Word count distribution of posts

I then visualize the word count per post using a scatter plot. The majority of the posts lie below 1000 words with a few crossing the 5000-word count, reflective of some user’s extensive DD (Due Diligence) where they analyze deeply into a stock and share their findings.

Scatter plot depicting the distribution of word count of posts. Image by Author

Finding commonly discussed stocks

Before proceeding with the next step which is to analyze posts sentiments against stock price and market volume, I would need to first identify stocks that I would focus on using the following methods:

4a. Topic modeling using Latent Dirichlet Allocation (LDA)

Graphs depicting various topics identified by LDA. Image by Author

4b. Spacy’s Named Entity Recognition (NER)

I narrowed my search by focusing Spacy’s NER on entities classified as Organizations(ORG) and blacklisted common stock brokerages and media companies like Robinhood, CNBC.

Topic Modelling using Latent Dirichlet Allocation produces some suitable topics like Gamestop, Blackberry, and AMC, but it's not as effective in determining commonly mentioned stock tickers compared to Spacy’s NER. I decided to go with the top 3 stocks to carry out analysis on Gamestop, AMC, and Tesla.

Understanding correlation coefficients

Correlation measures the degree of monotonic association between two variables. A monotonic relationship between two variables is when:

The value of one variable increases so does the value of the other variable
As the value of one variable increases, the other variable decreases

Important note: Monotonic relationships between two variables is only linear when a change in one variable is followed by a proportionate change in the other variable. As depicted in Figures A & B below.

What are correlation coefficients? In the broadest sense, they are used to measure the degree of association between two variables. The coefficient values are scaled such that it ranges from -1 to 1, where -1 indicates a perfect negative correlation, 1 indicates a perfect positive correlation, and 0 indicating no linear or monotonic relationship between the two variables. The two most popular coefficients are Pearson product-moment correlation and Spearman rank-order correlation.

Hypothesis tests can be applied to the data to address the statistical significance of the results and the strength and direction of the relationship between the two variables

Pearson product-moment correlation

Also known as Pearson’s r, it measures the strength and direction of linear correlation between two variables. Below is the formula to determine Pearson’s r value:

The figure below showcases scatterplots of data sampled from simulated bivariate normal distributions with varying Pearson correlation coefficients (r).

Let’s take the top-left graph as an example, let’s call it F. It has an r-value of 0.7, which indicates the sampled data has a positive correlation as the r-value is > 0 and it has a strong correlation between the two variables.

It’s also important to note that Pearson’s correlation test has two other outputs: R² and p-value.

R² refers to the coefficient of determination, it is an absolute value between 0 and 1. For graph F’s data, it has an R² of 0.49, which equates to 49.0% by multiplying 100 to the value. This indicates the amount of variance shared by the two variables. With that, it can be explained that 49.0% of the variability in the X variable is explained by the variability in the Y variable and vice versa. This also explains that the other 51.0% of the variance is explained by unknown factors not measured during the correlation test

p-value refers to the output given when performing two-tailed analysis on the null and alternate hypothesis with a significance threshold of 0.05. The null hypothesis will be accepted if the p-value > 0.05, or the alternate hypothesis will be accepted if the p-value < 0.05.

Important note: If hypotheses further state the direction of correlation, a one-tail analysis will be performed instead.

That being said, calculation of Pearson’s r value requires certain assumptions to be fulfilled for accurate inference on the strength of association between the two variables:

The sample should be truly random and representative of one population of interest
Both variables should be measured on a continuous scale and not on an ordinal scale
Data must come with both an X and Y variable
Variables should exhibit an approximate normal(Gaussian) distribution
Variables must exhibit linear correlation

Spearman’s rank-order correlation

Also known as Spearman’s ρ, it measures the strength and direction of monotonic association between two ranked variables. Take note of the differences in definitions.

Similar to Pearson’s r, The fundamental difference between both correlation coefficients is that the Pearson coefficient works only with data with linear relationships while the Spearman coefficient works with monotonic relationships.

It is also important to note that Spearman correlation not only works with continuous data but also ordinal data as it is based on the ranked data instead of the raw data as shown below:

Example of ranked data for Spearman correlation test. Image by Author

The Spearman correlation coefficient value is interpreted in the same way as in Pearson’s correlation coefficient.

The Assumptions for Spearman correlation is fairly similar to Pearson correlation with a few differences:

Variables must exhibit a monotonic association
Both variables should be measured on a continuous scale or on an ordinal scale
Data does not have to be sampled from a normal distribution

Number of posts vs the trading volume of a stock

I decided to use the Spearman correlation test as it does not assume that the data is normally distributed. The first hypothesis I will be testing is whether the number of posts related to a particular stock has any form of correlation with its trading volume.

Null Hypothesis: There is no correlation between the number of posts and the trading volume of stock
Alternate Hypothesis: There is a correlation between the number of posts and the trading volume of stock

I started off by preprocessing the data first.

Preprocessing

Split the entire corpus into each individual stock DataFrame for further preprocessing

Scrape stock data for trading volume and stock price each day

Visualization

I decided to plot the number of posts and the stock trading volume to better visualize the relationship between the two variables. I used the rolling operator to get the average of 10 days window to have a clearer picture of the overall trend.

Here are the results:

Evaluation

All 3 stock’s p-values are well below the significance threshold of 0.05, disproving the null hypothesis and accepting the alternate hypothesis. Based on their correlation coefficients, it indicates that both GME and AMC have a moderate correlation between the number of mentions to the trading volume of each respective stock, while Tesla has a low correlation.

Post sentiments vs price of the stock

The next hypothesis I will be testing is whether sentiments of posts on a particular stock have any effect on the price of the stock.

Null Hypothesis: There is no correlation between post sentiments with the price of the stock
Alternate Hypothesis: There is a correlation between post sentiments with the price of the stock

Visualization

I created a function that uses the dataset processed just now to calculate the number of positive and negative posts based on the value provided by VADER and plot everything on 2 different subplots.

Here are the results:

Evaluation

Both TSLA and GME p-values are well below the significance threshold of 0.05, disproving the null hypothesis and accepting the alternate hypothesis, indicating that there is a low correlation between stock’s post sentiments to the value of each respective stock. While AMC p-value is greater than the significance threshold, accepting the null hypothesis that there is no correlation between the two variables.

Ending notes

Before I end, I would like to add a disclaimer:

The results obtained from my project should be taken lightly if possible. I am not a financial advisor and the analysis used in this article is not meant to be used to speculate the market

There were quite a few things that could have been done better

While I used the Spearman correlation test as it does not assume that the data is normally distributed, the characteristics of the data did not satisfy all of the assumptions for the Spearman correlation test. Upon inspection of the scatter diagrams of the data, it shows that certain sections of the data did not exhibit a monotonic relationship. I could have explored more into different statistical tests such as the Wilcoxon Mann-Whitney test and Kendall’s tau.

While Spacy’s Named Entity Recognition (NER) did capture a significant amount of stock tickers. A deeper analysis showed that NER did not manage to classify a good proportion of the data leading to a smaller dataset. I could have compromised on speed by using the Regular Expressions operators to accurately determine stock tickers of every post collected.

Furthermore, my analysis is only based on the 3 selected stocks due to their popularity in r/wallstreetbets subreddit. It does not represent the entire Stock Market. I could have dwelled into other popular stock-related subreddits to get a better variety of stocks.

All codes are available on my Github here:

S3annnyyy/wallstreetbetsstockanalysis

A personal project where I perform statistical analysis on popular stocks talked about in subreddit r/wallstreetbets…

github.com

LinkedIn Profile: Sean Yap

Cheers!

References

[1] Schober, Patrick MD, Ph.D., MMedStat; Boer, Christa Ph.D., MSc; Schwarte, Lothar A. MD, Ph.D., MBA, Correlation Coefficients: Appropriate Use and Interpretation(2018), Anesthesia & Analgesia

[2] Dr. Steven Bradburn, Pearson correlation explained, Spearman Correlation explained(2020), TipTopBio.com

Finding correlations between Reddit and the Stock Market using Statistics

Using VADER (Valence Aware Dictionary and sEntiment Reasoner), Pushshift API, Spacy’s Named Entity Recognition, and Spearman rank-order correlation test.

Introduction

How Pushshift API works

Building functions to scrape Reddit data

Results

Exploratory Data Analysis (EDA)

1. Volume of posts across the months

2. Sentiment polarity score distribution

3. Word count distribution of posts

Finding commonly discussed stocks

4a. Topic modeling using Latent Dirichlet Allocation (LDA)

4b. Spacy’s Named Entity Recognition (NER)

Understanding correlation coefficients

Pearson product-moment correlation

Spearman’s rank-order correlation

Number of posts vs the trading volume of a stock

Preprocessing

Visualization

Evaluation

Post sentiments vs price of the stock

Visualization

Evaluation

Ending notes

S3annnyyy/wallstreetbetsstockanalysis

A personal project where I perform statistical analysis on popular stocks talked about in subreddit r/wallstreetbets…

References

Published in Analytics Vidhya

Written by Sean Yap