Wallstreetbets Sentiment Analysis on Stock Prices using Natural Language Processing
The following is the first part in an ongoing project to analyze sentiment on the famous (or infamous?) subreddit, r/Wallstreetbets. The million dollar question is: can bullish or bearish sentiment on r/Wallstreetbets in combination with stock fundamentals be used to build a model predicting the direction of a particular stock price?
Disclaimer: This is not investment advice. If you make investment decisions based on this article or anything I write, you will likely lose money. Please do your own research and make your own investment decisions.
The Process
Even though this project is still in the research phase, there is much to uncover. This project will be divided into 2 parts:
- Gathering data and setting up a database. This includes cleaning optimizing for analytics, visualization and preparation for modeling.
- Model development. This includes model formulation, feature engineering, hyperparameter tuning, optimizing, and measuring performance.
The data ETL process will take some time. Here’s a basic summary of the process:
- Generating the 10 most mentioned tickers for the day and assigning a sentiment type and score. For example:
- Pulling price and fundamental information for the particular stocks from the TDAmeritrade’s Developer API
- Running both of these scripts every day for 100 days
- Storing each day’s data in a personalized database via PostgreSQL
After generating roughly 1,000 rows of data, I should have enough data to begin analysis on stock price predictability. Of course, I will continue to run the daily script to reach statistical significance (ie: we have enough data to be confident our observations were not generated by random chance).
The Data
Any self-respecting data scientist must take time to explain the data, how it was generated, and why this particular data was chosen. The data being generated from the subreddit is purely text: words pertaining to conversations about stocks and the stock market. I use the Reddit API, which is super simple, yet robust. The price and fundamental information are generated from the TDAmeritrade API.
The theory is that the price of a stock is a function of it’s operational/financial performance and public sentiment. A business’s fundamentals (net profit margin, P/E ratio, leverage, market cap, etc.) have a direct affect on the price, while public opinion (technical traders and speculators on r/Wallstreetbets, for example) have an indirect affect on the price. The goal is to accurately predict the direction of a stock based on these important factors.
What is Sentiment Analysis?
Sentiment analysis is a machine learning method in natural language processing that quantifies emotion in text by analyzing the meaning of words, their context, and their frequency. For example, on a scale of 0 to 10, 0 being most negative and 10 being most positive, the following sentences would score as follows:
- “AAPL is an absolute buy!” → 10
- “I am neutral about AAPL” → around 5
- “AAPL is a total sell!” → 0
The Python library Natural Language Toolkit is immensely helpful in analyzing the text. It contains a lexicon of pre-determined words and their respective sentiment. The following is logical workflow of how the algorithm processes each comment on r/Wallstreetbets:
- Comments are retrieved via the Reddit API
- Removes all punctuation
- Removes stop words (ex: I, in, or, etc.), which don’t add value in terms of context or meaning
- The words in each comment are scanned in the lexicon, counting the number of positives and negatives in each comment. The ratios of proportion for each category are calculated for each comment. Positive_score = # positive words / total # words, and so on for Negative_score and Neutral_score
- The scores are normalized, and the stocks are extracted to isolate them. Now we have polarity scores for each stock
Simple, right? Wrong!
Quantifying context and meaning can be tricky. For example, observe the sentences “I love AAPL” and “I love AAPL? Yeah, right!”. These two sentences have similar words, but completely different sentiments. The first is positive/bullish and the second is sarcastic and negative. Also, a third possible variation, “I love eating AAPLs,” has a completely different context from the previous two (trust me, weird comments like this exist). To us humans, this is extremely simple to decipher. But how does the machine learning model know the difference?
This is where the compound score comes into play. Ambiguity occurs when Bearish, Neutral, and Bullish scores are close (ex: all three scores are near 0.3, is the sentiment bearish, bullish, or neutral? It is unclear). The Total_Compound score is a normalized, weighted composite score between -1 and 1. This is the most useful metric if you need only a single dimension for predicting sentiment. Having a single metric may help remove the ambiguity of having small variances amongst bullish, bearish, and neutral sentiment scores.
Financial Data
The ticker price information and fundamental data is retrieved via the TDAmeritrade API. I discussed this process in a previous post titled Forecasting Google’s Stock Price with ARIMA Modeling. I am gathering a lot of information, but here’s a few examples:
- Close price
- Net price change
- Price volatility
- 52-wk high/low
- P/E Ratio
- Market Cap
- Net Profit Margin
- Quick/Current Ratios
- And many more…
There’s approximately 90 columns of data per stock. The ideas is that using price data combined with fundamental data and sentiment scores, we can develop some model to predict the direction of a stock. Stay tuned…
Future Work
In a few weeks, I will have a growing database with proper infrastructure. I will be updating this project with trends and interesting findings. Super excited to see what I find. You can definitely check out my Github Repo for specific code examples. But I will list the code snippets below.
Code
Establish Program Parameters: Every stock in the US with a market cap > $100 million and a price > $3. You can change these parameters to your liking. blacklist variable is to not confuse tickers with common words such as ‘YOLO’ .new_words helps the program create more accurate sentiment scores
Main Program
For a clear step-by-step walkthrough of this code, check out the author #EatTheBlocks YouTube channel
Generating price and fundamental data
Plotting sentiment trends
Running the following script in PostgreSQL will setup the database and table schema. I set all columns to text to save time. Plus, datatypes can be easily converted in Python.
Feel free to check out my Github Repo and email me at md.ghsd@gmail.com if you have any questions/comments.
Open-source acknowledgements
- Huge shout out to #EatTheBlocks YouTube channel, who’s Python code was immensely helpful in using the Reddit API. Click Here to check out his Github Repo.
- Natural Language Toolkit: Vader for scoring sentiment
- Vader Sentiment Github Repo
- Ted Talk by Andy Kim on sentiment analysis
- Squarify for visualizing mentions
- Python’s open-source libraries: Pandas and Matplotlib