Automating data collection from Reddit to invest in stocks

Published in

Social Media: Theories, Ethics, and Analytics

10 min readSep 21, 2020

NO INVESTMENT ADVICE. The Content is for informational purposes only.

During the COVID-19 pandemic lockdown many new retail investors jumped into stock trading given that there wasn’t anything else to do but also because new electronic brokers like Robinhood, TD, Ameritrade, offered lower to no commissions for stock trading. This along with the boredom of staying at home, have been the main drivers of this abnormally high levels in retail trading.

Among these ‘new’ traders, the group that has seen the largest growth are ‘Millennials’. Many used their free time at Reddit where subreddits like r/wallstreetbets lured users into ‘getting rich quick’ by trading options. This subreddit has seen a huge growth in popularity during the pandemic, as anonymous users showed great profits as a seemingly easy process. Interestingly enough, users that lost money were treated as heroes and a source of memes.

r/wallstreetbets has many daily posts on several different topics related to stocks and ‘investing’. Among them are specific posts called ‘DD’ which stand for Due Diligence. ‘DD’s are post that aim to research a specific stock or group of stocks before taking an investment decision. People upvote those ‘DD’s that consider useful or interesting. Or in some cases, posts that are just ‘funny’. This data however, doesn’t necessarily mean is of high quality. Most of the users in the platform are amateur traders and the community is well known for taking high risk trades (also known as YOLO’s). On the other hand, many experts are analyzing these recommendations and the market has seen price movement related to r/wallstreetbets activity and are attributing price changes to this community https://www.bloomberg.com/news/articles/2020-09-15/big-investors-are-dying-to-know-what-the-little-guys-are-doing.

This means, in some cases the recommendations might be useful. Our goal, is to analyze these posts to see if there’s any consensus for trade activity coming from these posts, and if they help make investment decisions. For this we will use ‘PRAW’ package from Python and will build the workflow to extract and analyze the data.

PRAW for Python

PRAW, an acronym for “Python Reddit API Wrapper”, is a python package that allows for simple access to reddit’s API. PRAW aims to be as easy to use as possible and is designed to follow all of reddit’s API rules.

PRAW has many limitations due to the data that is provided by reddit. Which limits the amount of queries, or the time frame from which we can pull data. But for this use, we can automate our code on a daily basis and expand our library of content as we apply this script.

In order to use PRAW, we first need to open an account on Reddit and also create an API from their platform. Here we will assume you know how to open a Reddit account, but if not, I recommend you to go here and open your account https://upcity.com/blog/how-to-create-an-account-and-recommend-content-on-reddit/

Creating a Reddit App

As we mentioned, the second step is to create an API. In order to do this, we first need to log in into our reddit account and then access this site. https://www.reddit.com/prefs/apps

Then, click on the button <are you a developer? create an app…>
Next, fill up the form:

1- Create your own name for the app
2- Pick the app to be a script. So that you can run it from your computer
3- Add a description for your app. Optional
4- About URL is a URL where you have the documentation for your app. Optional
5- redirect uri. Is the location where the authorization server sends the user once the app has been successfully authorized and granted an authorization code or access token. Because you are going to be using it in your own computer use http://localhost:8080

Now, click on <create app> and you should get a personal use script (14 character) and secret key (24 character).

Save those numbers and make sure you don’t share them. Now you can use this in your python code.

Using PRAW

Now we are ready to start downloading the data from reddit. First open a python notebook and make sure you have PRAW installed. Let’s first import PRAW and the other libraries we will be using.

# Pip install praw. Uncomment if you don't already have the package
# !pip install praw# Imports
import praw # imports praw for reddit access
import pandas as pd # imports pandas for data manipulation
import datetime as dt # imports datetime to deal with dates

Next, let’s access our Reddit app using our usual log in credentials and client id (14 character key) and secret key (24 character key).

reddit = praw.Reddit(client_id='Your_14_character_client_id', 
                     client_secret='Your_24_character_secret_key', 
                     user_agent='Your_api_name', 
                     username='Your_Reddit_user_name', 
                     password='Your_Reddit_password')

This object we called ‘reddit’ is a handle with that connect us to the Reddit site. Now we need to access the ‘subreddit’ where we want to pull the data. In this case, r/wallstreebets.

# Access subreddit r/wallstreetbets
subreddit = reddit.subreddit('wallstreetbets')

Finally, within the subreddit we need to filter the content we want to see within all the posts in the site. Normally, PRAW allows you to pull posts in different ways: .hot, .new, .controversial, .gilded , .search and .top . To find out more about these accessing methods we recommend to refer to the documentation https://praw.readthedocs.io/en/latest/code_overview/models/multireddit.html?highlight=.gilded#praw.models.Multireddit.gilded

Since ‘DD’ are a specific flair (which is the equivalent of a tag in other forum formats). We want to sort the data by date and we are going to pull the latest 100 posts.

# Pull latest 100 posts with flair 'DD' sorted from newest to oldest
DD_subreddit = subreddit.search('flair:"DD"', limit=100,sort='new')

If we inspect this object, this is a ListingGenerator which means the output has a set of items in a list. We can choose what variables we want to keep in our analysis by inspecting the keys in the list generated. We show an example below on how to do this.

# In case we don't have the package pprint we install it
#!pip install pprint # Import 
import pprint# Loop through the variable names in a post
for posts in DD_subreddit:
    pprint.pprint(vars(posts))

The output will show something like this. Be careful because when I run this sometimes the following script statements stop working and we need to invoque the subreddit.search() statement again in order to get results.

Finally we need to convert our data pull to tabular form so that we can manipulate the text. To do this, we create a dictionary where we store the posts we retrieved.

# Create a dictionary with the variables we want to save
DD_dict = { "title":[],
            "score":[],
            "id":[],
            "url":[],
            "comms_num": [], 
            "date": [],
            "body":[]}# We now loop through the posts we collected and store the data
for posts in DD_subreddit:
    DD_dict["title"].append(posts.title)
    DD_dict["score"].append(posts.score)
    DD_dict["id"].append(posts.id)
    DD_dict["url"].append(posts.url)
    DD_dict["comms_num"].append(posts.num_comments)
    DD_dict["date"].append(posts.created)
    DD_dict["body"].append(posts.selftext)

We are almost done. There’s a minor fix we need to do with the date variable. In order to do this we convert the numeric value from the ‘created’ field using datetime library.

# First convert dictionary to DataFrame
DD_data = pd.DataFrame(DD_dict)# Function takes a variable type numeric and converts to date
def get_date(date):
    return dt.datetime.fromtimestamp(date)# We run this function and save the result in a new object
_date = DD_data["date"].apply(get_date)# We replace the previous date variable with the new date variableDD_data = DD_data.assign(date = _date)# Let's check the output table
DD_data

Our data table with the latest 100 stock posts

Data Manipulation

And now is time for the fun. Given that this part is more involved in terms of the script used, we will omit the code. But the entirety of the code used here can be found in the repository.

We can take this data to perform our exploratory analysis and see how are these ‘traders’ operating. First, we are going to see the in a time series how do the DD breakout. To do this we index the data by date and then plot the time series to see when was the day with most posts.

We can also see how the market performed during the same time period to compare the activity in these DD’s.

SP500 performance in the latest 10 days (Source: Marketwatch)

We can guess that as the index SP500 built momentum from September 10th through September 15th, many speculations about growth could continue. However, as the market declined in the following days the number of DD’s declined as well. Could this be because the r\wallstreetbets is mainly ‘bullish’ with the market? Let’s test this hypothesis

Let’s create a word cloud for the DD’s we have downloaded, to see what words are most predominant in these posts.

Wordcloud from the last 100 DD’s from r/wallstreetbets

Here we see that the main sentiment of these posts suggest that the most commented stocks are Nikola ($NKLA) and Tesla ($TSLA). It might be that the underlying trend is to buy, but since the key terms we are seeing are related to both options trades, calls and puts, both actions seem to have relevance in these posts. So we need to find when are these terms being used related to the stocks we are seeing. For simplicity we will stick to Nikola and Tesla. And just for the reader’s understanding, the term ‘tendie’ refers to the profit or reward once a trade has been successful.

So now, we can filter the posts that contain a recommendation on Nikola or Tesla and filter what recommendations are mostly buy vs sell on any given day. Since Reddit users give ‘Karma’ points to those posts that they consider useful, we can use that score to compare the strength of the post direction. We can show this by date in order to see the progression of these posts by date. This analysis is very simplistic utilizing words in the post like bullish or buy in order to estimate the trade recommendation.

As we can see in Sept 10th the point of view was mainly to sell for these two companies. However, as we approached Sept 14th, optimism grew. Following this, a higher volume indicated strategies buying and selling both stocks, and Finally after Sept 17th, no real consensus on whether to buy or sell on these stocks was reached. For reference this was the performance of both stocks during the same time period.

Price trend for TSLA (dark blue) and NKLA (light blue) (Source: Yahoo! Finance)

As we can see there’s a pretty high correlation between price movement and DD postings. Does this mean traders on r/wallstreetbets are geniuses? The answer is ‘clearly not’. This correlation simply means that when prices move in one direction, users are more likely to ‘upvote’ a DD post that is in line with the current daily trend on the market.

This of course is only the beginning. And in order to to a more in depth analysis we should consider better understanding the body of the post and the sentiment in order to better extract the insights from theses posts. But this initial steps should get you going using reddit for your investment analyses.

Conclusion

We can see there’s consensus coming from the DD posts and we could make decisions using mainly these posts. However, the correlation happens on the trend the market is moving that day, rendering these recommendation a little suspicious. This is a big limitation of this process, since the data that gets most upvotes could be due to the market euphoria from that day.

There aren’t major implications in terms of ethical considerations, given that from a personal information point of view, most users are anonymous on reddit, but it could be argue that in some cases this information could by used by corporations to benefit from the trade decisions of these users. Other aspects such as harassments is uncommon in this subreddit given the nature of the application, but people using this API’s should considering deleting any personal identification they find and deleting non-aggregated data after it has been processed for the application

Closing Thoughts

Investing is a very complex task that takes a lot practice and training to master. And even after an understanding companies and the market, this can be a treacherous process. So the obvious question is: Should we solely rely on the opinion of people on the internet in order to invest? You probably answered a sound No, but taking into consideration different data points including this one could improve your investment model.

By the way, if your answer was actually Yes, then you surely belong in r/wallstreetbets Go check it out!