Stock Market Recommendation System

Published in

The AI Guide

14 min readMay 12, 2021

MOTIVATION

The core motivation of this analysis is that generating consistent profit from the stock is considered to be a challenging task, especially because of the non-linear nature of the stock price movement.

Users are generally unsure which stock to opt for or when is the best time to enter or exit. Wallstreetbets clearly showed what social media can do to the stock level even surpassing the stock regulators. We feel many people underestimate the intelligence of some of these self-taught or social media traders. As most people know the stock market price is hard to predict, business tends to be seasonal meaning the holiday, quarterly earning reports, and four-quarter sales tend to affect the stock price.

This problem is very interesting because of how the subreddit r/WallStreetBets affects the stock markets and how they’ve managed to raise important issues about fairness and opportunity in our financial system. Versatility in the user preferences and availability of choices makes it more challenging.

Our purpose is to develop a recommendation system of finance that will recommend users by recommending to the users whether to buy, sell or hold the stock based on the messages or data we have collected. This will help investors to identify the healthy business to invest in.

DATA COLLECTION

We have used the Reddit API via the praw library. The praw library packages together the necessary information to call the Reddit API in a clean orderly fashion. We obtained nearly 3000 Reddit posts from the subreddit r/WallstreetBets. The PRAW library makes it easy to target a subreddit and pull “top” posts or “hot” posts. The top posts have the most upvotes while hot votes are becoming popular fast. There are also “best” posts that have the most amount of upvotes with the fewest downvotes.[2]

To extract the stocks from the Reddit data we leveraged pythons regex library. We searched for all strings that had 3 characters and capitalized. We looped through all of the posts and extracted a list of stocks from each post.

Regex For Stocks

Another data source we have leveraged is the yahoo finance API. Yahoo finance is a media platform that provides financial news, data about stock quotes, press releases, and financial reports. We leverage this API to validate our extracted stocks. So if we find a stock like GME, for example, we call the yahoo finance API and if we get a return we assume our extraction was successful. If we get an error we assume that our extraction was unsuccessful. The Yahoo finance API provides a lot of financial data on the company such as stock prices, company description, current shorts on the stock, and more other information on the company.[3]

To help combat and poor extractions we created a “stop stonk” dataset that was not actual stocks such as “AOC” which was actually Alexandra Ocasio-Cortez. Typically if part of the post such as the title was all capitalized we would not run the extraction on this post. These rules helped us extract any strings that were not in fact stocks.[5]

Result:

3000 Reddit post crawled
100 stocks retrieved
Historical prices added

Exploratory Data Analytics on collected Data

From the below Analysis of the extracted data, we understood how we have to approach the problem statement

We extracted data from the Reddit posts posted at r/wallstreetbets subreddit. Tokenized each word, removed stop words using lemmatizer (the process of finding the lemma of a word depending on its meaning and context) to clean up the data to extract word tokens and form that we have counted the frequency of the stonks mentioned in the posts.

The above image shows stock frequency analysis, we have the list of stock names and their respective occurrences as expected we have the highest number of occurrences for GME which is 510 since most of the Reddit users talked about it in the r/wallstreetbets subreddit, followed by OLD, AMC .

This is a pie chart representation of the stock occurrences as we can see GME secures 61% of the pie chart which means most people talked about it.

From the Reddit wiki, the score is the difference between the number of upvotes and the number of downvotes for the Reddit post. We wanted to leverage this information to dig more into the stonk analysis. Here we collected all the posts and respective upvotes with its stonk name. we grouped the data with stonk names and aggregated the scores and ranked them by sorting them in ascending order.

Here we can see GME scores the top 2 Million votes and follow by FOR, DAY, AMC …

The above graph beautifully illustrates the Average stock price vs short ratio. The short ratio is calculated by dividing the number of shares sold short by the average daily trading volume. Knowing how many shares of stock have been shorted is a good indication of how investors view that stock. That’s where the short ratio comes in handy. Also known as the “days to cover” ratio, if a stock’s short ratio is trending lower, then it could mean that investor sentiment in the company is improving, and the stock price has a good chance of going up. however, the short ratio on its own is not necessarily an accurate predictor of market direction.[1]

The short ratio tells investors approximately how many days it would take short sellers to cover their positions if the price of a given stock were to increase. The higher the short ratio, the longer it will take to buy back those borrowed shares.

Here if we can see the lowest of the short ratio includes GME for example, has the lowest short ratio which means the investors can think of the company and improving and a good chance of growing up and a lot of possibilities to invest in this specific stock.

KEY IDEA

The main thought behind this entire system is to leverage our extracted stocks to create baselines of comparison. A user would input stock and we would provide 1 of 3 types of recommendations based on similarity scores and groupings of the stocks data.

We will leverage clustering to create groupings of our data as one metric to help evaluate a stock for a recommendation. We will then leverage similarity scoring such as cosine similarity to return top 5 stocks with similar prices and results. Finally, we will confirm our groupings of recommendations using a networkX graph to ensure we have stability and clean recommendations.

CLUSTERING

To make strong recommendations, we wanted to see how our stocks grouped. We wanted to base our recommendations using more than one metric so we would ensure strong and accurate recommendations. We have Financial Results data such as Reported Revenue as well as the 12-month average stock price for 2020. We ran clustering on the dataset in three different ways.

Normalize data with z-score

2. We ran DBSCAN clustering on the entire dataset (Financial Results & Stock Price). The issue here was we had too many features to cluster on and we felt our clusters would not be representing the data very well.

3. Run DBSCAN on each dataset individually after a z-score was used on each to normalize for that specific data. This gave us 2 groups of clusters which we felt would be a better representation of the data. Knowing that some company’s stock price does not necessarily equate to strong financials we felt this was a lot better to leverage.

To pick the best epsilon value we used an elbow curve which can be done similarly for finding the optimal number of clusters in K-mean. We found that had 3 clusters and outliers within the data that we needed to consider differently.

We did attempt to use K-means clustering for the dataset and leveraged the silhouette score to find the optimal number of clusters. The silhouette score tells us that the cluster means are well enough apart from one another and distinguished. We found that K-means wanted to give us 2 clusters for each dataset. We determined for our use case more cluster is better.

In the above data frame, the columns M1 to M12 represent the monthly average of the stock prices. These are fetched from yahoo finance data. Based on these price values we clustered the stocks using DB scan since it is good at detecting outliers.

from the above image, all X’s here are outliers, we ignore them for now.

Here we used a Scatter plot because we wanted to see how data is clustered, with the resultant labels. Here the red “+” symbol is nicely clustered at a place and all the *’s in one place. We can just ignore the X’s since they are outliers.

Here is the 3D representation for the price labels with respect to the shares short, short ratio and profit margin.

SIMILARITY SCORING

To create the strongest recommendations possible we needed more metrics to indicate the similarity of a stock. To do this we continued to leverage our 2 datasets and attempted to add text from the Reddit post itself as well as the company description.

To get features of our text data we leverage sklearn’s TFIDF module to return the word features that were most important.

We then continue to use sklearns pairwaise_similarity module to take the cosine similarity of our text features. We can then input stock and dictate x number of stocks to return. In this case, x == 5, so we would return the top 5 similar stocks for Reddit posts and top 5 stocks based on company description.

When looking at the Reddit posts similarity score we found that outside the top 5 there were still extremely similar posts. So we decided to dig a little deeper and look at the features that were selected from TFIDF and found that there were a lot of words that we felt did not provide a lot of contexts. We at that point determined this may not be the best way to improve our recommendations.

As we took a look at the company descriptions we saw better results and found that we could use this to get other stocks that were described similarly to the input stock. We debated if this information was necessarily helpful in creating recommendations and found that it was not when comparing the next set of similarity scores.

We finally went back to our numeric data from the Yahoo API. The financial results data and stock price data. When the stock was input we would retrieve this information from the yahoo API and then leverage the z-score on our entire dataset (including the input stock information) to normalize the data. We then would use cosine similarity on each of the two datasets and return the top 5 stock recommendations.

In this case, if we input Best Buy we would get 2 sets of 5 stocks. We would then look for overlap.

BEST BUY Financial Results
PSA, NOK, AAL, MGM, SLV
BEST BUY Stock Results
DIS, CAT, AMD, SLV, PSA

We can then cross-reference the two lists to see which stock appears in both. We also then take a look at our stock recommendation baseline stocks to see if any of the returned stocks are on that list. We can then base our recommendation on that grouping.

We tested our system with euclidean distance as well and got very similar answers to using cosine similarity. Using the financial data as the largest source of recommendation basis makes the most sense. As we read through posts we noticed they focus on stock price and financial release information. Along with the DBSCAN labels, we felt this would give us the best system to make a recommendation.[4]

STONK! BASELINE

This part did require some research and understanding of the phenomena that occurred over the course of the Reddit Stock Boom.

STONK BUY!: Strong Buy

High similarity to GME and cluster grouping

JUST BUY IT: Weak Buy

High similarity to AMC and clustering grouping

SAVE YOUR MONEY!: Do Not Buy

Low similarity to STONK BUY or JUST BUY IT stonks
Cluster groups do not align to similar stonks

Putting to work our earlier analysis on to work we knew that GameStop had by far the largest amount of mentions. Using this we researched that GameStop was the largest targetted stock by wallstreetbets we can then use this as our coveted STONK BUY (strong buy). Next, we saw that AMC and NOK had a large number of mentions so we used these stocks are our JUST BUY IT (weak buy) recommendation. We then used a threshold as a cut-off that if it was not x similar to this stocks then we would recommend SAVE YOUR MONEY! (do not buy).

To help confirm our groups a little more we used networkX graph to show the placement of our stocks. We paired all of the stocks together and used our cosine similarity values between the paired stocks as the weight. This would allow us to calculate the betweenness centrality and see what stocks are more centralized in our similarity scores.

This image is of the full network graph of all of the extracted stonks! As you can see there are two distinct clusters on the graph. If we take a look at the smaller cluster we can see that many of these are not proper stocks. These were more poorly extracted stocks. When looking into our data further some of them were stocks that had stopped trading or just were not stock and the API returned 0 values for their metrics.

When we zoom into our larger graph and slide over to the right side of it we can locate GME. GME has a very low betweenness centrality score indicating it is not the most central data point in the network. This is a good thing because we would not want to have our highest stock rating very similar to all other stocks. This could create false positives of our “STONK BUY!” recommendation. As we slide more towards the middle of the graph we can see our weak buy stocks more centralized such as AMC.

EVALUATION

To help ensure we are making strong recommendations we will need to backtest our system in a way. This gives us anecdotal evidence that our system is performing properly and giving recommendations. We used some of the stocks from our 2000 mentioned stocks to test our system. We will expect what the results will be and observe the actual results from the system to compare. We also compared our stock price from 2020 before the Reddit investor event and compare the stock price to today to see if the price did in fact increase.

Evaluated stocks that we artificially placed in the system have gone up or down in value over the past few months. This allowed us to verify if they were a good purchase
9/10 stocks increased in value
We looked at this in a binary fashion (value increased/decreased)
Check stocks in the database for performance
Test if other similarity metrics provide similar results such as Euclidean distance and k-means

WHAT DID WE LEARN?

While doing the project, we learned how Reddit posts influenced the stock market itself. How the stock trends change according to users posts on the wallstreetbets Reddit

We learned the most valuable concept “Recommender System” and how it works, also most of the details by implementing the content-based filtering concept to our stock recommendation project and recommended the users as “Stonk Buy”, “Just Buy”, “Save your money”

Limitations

The major limitation of this project is that it’s doesn’t take account into live data as we used only 12 months of gathered data dumped into CSV. in-order to implement this project in real-time we need to make a lot of changes such as integrating KAFKA to get real-time data and analyzing this data and feeding it into our working model, then we will get reliable results.

The other limitation we have here is the userbase, to make this stonk recommendation into a robust engine we needed content-based filtering and collaborative filtering as well. Here, we don’t have user buy/sell transactions to keep track of user behavior/interests. The current stock recommendation engine would have been way more robust if we integrated the hybrid recommendation engine into the system.

Conclusion

We finished the working model for stonk prediction recommendation engine and it would definitely provide Content-based recommendations to the users who are very much interested in investing in stocks. Since we took the data from r/wallstreetbets, where people sometimes plan and go against the stock market system, even a layman can understand and take advantage of the situation and can invest in stocks. Our content-based recommendation method is on par with the exact stock market rise and drops at that time. However, to do real-time recommendations the current model has to be improved by having the capability to compute the real-time data by the Reddit posts. Additionally, the current recommendation system suffers with robustness since there is no users data and cannot be able to make hydbrid recommendation. Regardless of all these flaws we can recommend the the stock to users based on trend going on in wallstreetbets subreddit.

Appendix

Written By: AndrewD5, Divya Chandana, Jasneek Chugh

Author Contributions

— Andrew Dennis

Worked on multiple cluster techniques.
Implemented code for similarity scoring.
Cleaned data and finalized baseline stonks.
Evaluated the results.

— Divya Chandana

Research on understanding trends going on in r/wallstreetbets.
Understanding the data with Exploratory Data Analysis.
Research on how current system can be extended.
Worked on which python libraries to implement.

— Jasneek Singh

Data Collection from Reddit and Yahoo Finance
Data imputation and cleaning

References

[1]

https://www.investopedia.com/terms/s/shortinterestratio.asp#:~:text=The%20short%20Interest%20ratio%20is,its%20average%20daily%20trading%20volume.

[2]

r/wallstreetbets

r/wallstreetbets: Like 4chan found a Bloomberg Terminal

www.reddit.com

[3]

Yahoo Finance — Stock Market Live, Quotes, Business & Finance News

At Yahoo Finance, you get free stock quotes, up-to-date news, portfolio management resources, international market…

finance.yahoo.com

[4]

DBSCAN - Wikipedia

Density-based spatial clustering of applications with noise ( DBSCAN) is a data clustering algorithm proposed by Martin…

en.wikipedia.org

[5]

scikit-learn

"We use scikit-learn to support leading-edge basic research [...]" "I think it's the most well-designed ML package I've…

scikit-learn.org