Finding correlations between Reddit and the Stock Market using Statistics
Using VADER (Valence Aware Dictionary and sEntiment Reasoner), Pushshift API, Spacy’s Named Entity Recognition, and Spearman rank-order correlation test.
Introduction
In early 2021, Reddit r/wallstreetbets has been at the epicenter of one of the biggest movements in the financial market and showcases the power of social media. While it seems like an unlikely source of movement, it’s hardly surprising in hindsight due to Robinhood giving retail traders easy access to the stock market with zero commission fees. With that in mind, it got me thinking about whether I could determine a correlation between:
- Number of mentions of a particular stock with stock trading volume
- Post sentiments of a particular stock with stock prices
In this post, I will be scraping data from r/wallstreetbets using Pushshift API and using that data to test the hypothesis mentioned above.
How Pushshift API works
The Pushshift API is a great resource for scraping large amounts of Reddit data. One way of using the API is through the https://api.pushshift.io/ endpoints. The base foundation of the pushshift URL to access Reddit would be https://api.pushshift.io/reddit/search/.
With parameters, we will access r/wallstreetbets subreddit between 2 dates in Unix timestamp format (April 2020 and April 2021): https://api.pushshift.io/reddit/search/submission/?subreddit=wallstreetbets&after=1585670400&before=1618110903&size=100
- size — increase limit of returned entries to 100
- after — date to start the search
- before — date to end the search
- subreddit — to narrow it down to a particular subreddit
The URL above will return a JSON response of the results. For explanation sake, I reduce the size to 1 for better visualization of the JSON response. It will look something like this:
{"data": [{"all_awardings": [],"allow_live_comments": false,"author": "br0kencircuit","author_flair_css_class": null,"author_flair_richtext": [],"author_flair_text": null,"author_flair_type": "text","author_fullname": "t2_20jrdmhz","author_patreon_flair": false,"author_premium": false,"awarders": [],"can_mod_post": false,"contest_mode": false,"created_utc": 1585670411,"domain":"self.wallstreetbets","full_link": "https://www.reddit.com/r/wallstreetbets/comments/fsfhli/the_four_most_expensive_words_in_the_english/","gildings": {},"id":"fsfhli","is_crosspostable": false,"is_meta": false,"is_original_content": false,"is_reddit_media_domain": false,"is_robot_indexable": false,"is_self": true,"is_video": false,"link_flair_background_color": "#7b2e00","link_flair_css_class": "shitpost","link_flair_richtext": [{"e": "text","t": "Shitpost"}
],"link_flair_template_id": "50c5e166-b861-11e5-bc53-0e60c810ce03",
"link_flair_text":"Shitpost","link_flair_text_color":"light","link_flair_type": "richtext","locked": false,"media_only": false,"no_follow": true,"num_comments": 0,"num_crossposts": 0,"over_18": false,"parent_whitelist_status": "no_ads","permalink": "/r/wallstreetbets/comments/fsfhli/the_four_most_expensive_words_in_the_english/","pinned": false,"pwls": 0,"removed_by_category": "moderator","retrieved_on": 1585675830,"score": 1,"selftext": "[removed]","send_replies": true,"spoiler": false,"stickied": false,"subreddit": "wallstreetbets","subreddit_id": "t5_2th52","subreddit_subscribers": 1065713,"subreddit_type": "public","suggested_sort": "confidence","thumbnail": "self",
"title": "The four most expensive words in the English language are \"this time it\u2019s different.\" - Sir John Templeton",
"total_awards_received":0,"url":"https://www.reddit.com/r/wallstreetbets/comments/fsfhli/the_four_most_expensive_words_in_the_english/",
"whitelist_status": "no_ads","wls": 0}]}
In Python, JSON objects can be translated into dictionary types and in this case would be held under the dictionary key “data”, followed by a list of nested dictionaries with what you see above. The key points that I needed are bolded and to access them would be like so:
Title: ["data"][0]["title"]
Post text: ["data"][0]["selftext"]
Post created: ["data"][0]["created_utc"]
A more detailed explanation of how the API works can be found at https://github.com/pushshift/api
Building functions to scrape Reddit data
I needed to first create a function that sends HTTP requests to Reddit’s server and get back a JSON response of the requested content with a tolerance of 5 failed HTTP request attempts.
With that, I created another function to pull the HTTP request and append key data points into a list. Along with that is another function that works with the earlier mentioned function in a while loop iteration, taking the last created_utc from the list and brings it back into the loop to add submissions continuously until the parameter before_date is fulfilled. I also added in tqdm module and a print function for the last created_utc to check on the progress of extracting the data.
I executed the functions with the required parameters and did a simple text preprocessing to eliminate any posts that got deleted or removed by the administrators.
Results
The whole process took a whopping 13 hours and scrapped a total of 1,104,216 posts, 143,510 posts after preprocessing. My laptop’s Intel Core i7–6700HQ sure took a beating.
Exploratory Data Analysis (EDA)
Before I begin analyzing posts sentiments against market data, I performed what is known as Exploratory Data Analysis on the data to gain a better understanding of the data I just collected.
Exploratory Data Analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.
1. Volume of posts across the months
The first thing I did was to picture the distribution of the posts collected. Immediately, I could see a correlation between the number of posts and the surge in hype indicated by the sudden rise of posts in the month of January 2021.
2. Sentiment polarity score distribution
The next thing I did was to visualize the sentiments of the post using VADER(Valence Aware Dictionary sEntiment Reasoner). To have better accuracy in determining a post sentiment, I updated VADER’s lexicon with a dictionary of positive and negative sentiment stock related words and terms commonly used by users in the subreddit such as YOLO, HODL, TO THE MOON, etc
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()def sentiment_analyzer(text):
score = analyser.polarity_scores(text)
return scoresentence = "In the case of Apple, I am extremely bullish on its price. So far it has been an upward trend"
print(sentiment_analyzer(sentence))
OUTPUT:
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0} # Neutral# Update lexicon for better accuracy
positive_words = 'buy bull long support undervalued underpriced cheap upward rising trend moon rocket hold hodl breakout call beat support buying holding high profit stonks yolo'negative_words = 'sell bear bubble bearish short overvalued overbought overpriced expensive downward falling sold sell low put miss resistance squeeze cover seller loss 'pos = {i: 5 for i in positive_words.split(" ")}
neg = {i: -5 for i in negative_words.split(" ")}
stock_lexicons = {**pos, **neg}
analyser.lexicon.update(stock_lexicons)print(sentiment_analyzer(sentence))
OUTPUT:
{'neg': 0.0, 'neu': 0.6, 'pos': 0.4, 'compound': 0.9325} # Positive
3. Word count distribution of posts
I then visualize the word count per post using a scatter plot. The majority of the posts lie below 1000 words with a few crossing the 5000-word count, reflective of some user’s extensive DD (Due Diligence) where they analyze deeply into a stock and share their findings.
Finding commonly discussed stocks
Before proceeding with the next step which is to analyze posts sentiments against stock price and market volume, I would need to first identify stocks that I would focus on using the following methods:
4a. Topic modeling using Latent Dirichlet Allocation (LDA)
4b. Spacy’s Named Entity Recognition (NER)
I narrowed my search by focusing Spacy’s NER on entities classified as Organizations(ORG) and blacklisted common stock brokerages and media companies like Robinhood, CNBC.
Topic Modelling using Latent Dirichlet Allocation produces some suitable topics like Gamestop, Blackberry, and AMC, but it's not as effective in determining commonly mentioned stock tickers compared to Spacy’s NER. I decided to go with the top 3 stocks to carry out analysis on Gamestop, AMC, and Tesla.
Understanding correlation coefficients
Correlation measures the degree of monotonic association between two variables. A monotonic relationship between two variables is when:
- The value of one variable increases so does the value of the other variable
- As the value of one variable increases, the other variable decreases
Important note: Monotonic relationships between two variables is only linear when a change in one variable is followed by a proportionate change in the other variable. As depicted in Figures A & B below.
What are correlation coefficients? In the broadest sense, they are used to measure the degree of association between two variables. The coefficient values are scaled such that it ranges from -1 to 1, where -1 indicates a perfect negative correlation, 1 indicates a perfect positive correlation, and 0 indicating no linear or monotonic relationship between the two variables. The two most popular coefficients are Pearson product-moment correlation and Spearman rank-order correlation.
Hypothesis tests can be applied to the data to address the statistical significance of the results and the strength and direction of the relationship between the two variables
Pearson product-moment correlation
Also known as Pearson’s r, it measures the strength and direction of linear correlation between two variables. Below is the formula to determine Pearson’s r value:
The figure below showcases scatterplots of data sampled from simulated bivariate normal distributions with varying Pearson correlation coefficients (r).
Let’s take the top-left graph as an example, let’s call it F. It has an r-value of 0.7, which indicates the sampled data has a positive correlation as the r-value is > 0 and it has a strong correlation between the two variables.
It’s also important to note that Pearson’s correlation test has two other outputs: R² and p-value.
R² refers to the coefficient of determination, it is an absolute value between 0 and 1. For graph F’s data, it has an R² of 0.49, which equates to 49.0% by multiplying 100 to the value. This indicates the amount of variance shared by the two variables. With that, it can be explained that 49.0% of the variability in the X variable is explained by the variability in the Y variable and vice versa. This also explains that the other 51.0% of the variance is explained by unknown factors not measured during the correlation test
p-value refers to the output given when performing two-tailed analysis on the null and alternate hypothesis with a significance threshold of 0.05. The null hypothesis will be accepted if the p-value > 0.05, or the alternate hypothesis will be accepted if the p-value < 0.05.
Important note: If hypotheses further state the direction of correlation, a one-tail analysis will be performed instead.
That being said, calculation of Pearson’s r value requires certain assumptions to be fulfilled for accurate inference on the strength of association between the two variables:
- The sample should be truly random and representative of one population of interest
- Both variables should be measured on a continuous scale and not on an ordinal scale
- Data must come with both an X and Y variable
- Variables should exhibit an approximate normal(Gaussian) distribution
- Variables must exhibit linear correlation
Spearman’s rank-order correlation
Also known as Spearman’s ρ, it measures the strength and direction of monotonic association between two ranked variables. Take note of the differences in definitions.
Similar to Pearson’s r, The fundamental difference between both correlation coefficients is that the Pearson coefficient works only with data with linear relationships while the Spearman coefficient works with monotonic relationships.
It is also important to note that Spearman correlation not only works with continuous data but also ordinal data as it is based on the ranked data instead of the raw data as shown below:
The Spearman correlation coefficient value is interpreted in the same way as in Pearson’s correlation coefficient.
The Assumptions for Spearman correlation is fairly similar to Pearson correlation with a few differences:
- Variables must exhibit a monotonic association
- Both variables should be measured on a continuous scale or on an ordinal scale
- Data does not have to be sampled from a normal distribution
Number of posts vs the trading volume of a stock
I decided to use the Spearman correlation test as it does not assume that the data is normally distributed. The first hypothesis I will be testing is whether the number of posts related to a particular stock has any form of correlation with its trading volume.
- Null Hypothesis: There is no correlation between the number of posts and the trading volume of stock
- Alternate Hypothesis: There is a correlation between the number of posts and the trading volume of stock
I started off by preprocessing the data first.
Preprocessing
- Split the entire corpus into each individual stock DataFrame for further preprocessing
- Scrape stock data for trading volume and stock price each day
Visualization
I decided to plot the number of posts and the stock trading volume to better visualize the relationship between the two variables. I used the rolling operator to get the average of 10 days window to have a clearer picture of the overall trend.
Here are the results:
Evaluation
All 3 stock’s p-values are well below the significance threshold of 0.05, disproving the null hypothesis and accepting the alternate hypothesis. Based on their correlation coefficients, it indicates that both GME and AMC have a moderate correlation between the number of mentions to the trading volume of each respective stock, while Tesla has a low correlation.
Post sentiments vs price of the stock
The next hypothesis I will be testing is whether sentiments of posts on a particular stock have any effect on the price of the stock.
- Null Hypothesis: There is no correlation between post sentiments with the price of the stock
- Alternate Hypothesis: There is a correlation between post sentiments with the price of the stock
Visualization
I created a function that uses the dataset processed just now to calculate the number of positive and negative posts based on the value provided by VADER and plot everything on 2 different subplots.
Here are the results:
Evaluation
Both TSLA and GME p-values are well below the significance threshold of 0.05, disproving the null hypothesis and accepting the alternate hypothesis, indicating that there is a low correlation between stock’s post sentiments to the value of each respective stock. While AMC p-value is greater than the significance threshold, accepting the null hypothesis that there is no correlation between the two variables.
Ending notes
Before I end, I would like to add a disclaimer:
The results obtained from my project should be taken lightly if possible. I am not a financial advisor and the analysis used in this article is not meant to be used to speculate the market
There were quite a few things that could have been done better
While I used the Spearman correlation test as it does not assume that the data is normally distributed, the characteristics of the data did not satisfy all of the assumptions for the Spearman correlation test. Upon inspection of the scatter diagrams of the data, it shows that certain sections of the data did not exhibit a monotonic relationship. I could have explored more into different statistical tests such as the Wilcoxon Mann-Whitney test and Kendall’s tau.
While Spacy’s Named Entity Recognition (NER) did capture a significant amount of stock tickers. A deeper analysis showed that NER did not manage to classify a good proportion of the data leading to a smaller dataset. I could have compromised on speed by using the Regular Expressions operators to accurately determine stock tickers of every post collected.
Furthermore, my analysis is only based on the 3 selected stocks due to their popularity in r/wallstreetbets subreddit. It does not represent the entire Stock Market. I could have dwelled into other popular stock-related subreddits to get a better variety of stocks.
All codes are available on my Github here:
LinkedIn Profile: Sean Yap
Cheers!
References
[1] Schober, Patrick MD, Ph.D., MMedStat; Boer, Christa Ph.D., MSc; Schwarte, Lothar A. MD, Ph.D., MBA, Correlation Coefficients: Appropriate Use and Interpretation(2018), Anesthesia & Analgesia
[2] Dr. Steven Bradburn, Pearson correlation explained, Spearman Correlation explained(2020), TipTopBio.com