Data scientists often look to social media platforms as places to draw data from. It’s really the perfect place — there’s loads of data available, all free and open to the public. Reddit.com is a site built for discussions and posts about topics, called ‘subreddits’, and it’s used daily by millions of users. Reddit knows that their platform is a perfect target for data scraping, as all of the millions of conversations are free to look at. Reddit developed an Application Programming Interface, or an API (in Reddit’s case called PRAW), to allow data scientists to scrape easily.
Let’s look at how to use PRAW, to scrape things that we might be interested in. PRAW works beautifully with Python. The first thing you have to do is get an API key, which allows you to use the API. First, if you don’t have one, create a Reddit account. Then, go to reddit.com/prefs/apps, and click on ‘create app’. Below is a picture of what should pop up.
Make sure to click on ‘script’ when creating your key, but after that reddit will provide you with an key. Now it’s time to put it to use: go into your Python coding function of choice (I use Jupyter notebooks), and set it up.
Now, we can use the reddit object we created to look at all sorts of things. We can find submissions, comments, and subreddits, by calling r.submission(id), r.comment(id), and r.subreddit(subreddit). Lets start by checking out the disneyvacation subreddit, by calling dv = r.subreddit(“disneyvacation”). We can now access all the attributes of the subreddit: to view some options, call dir(dv), but lets take a look at the rules of the subreddit:
This returns us back a thick dictionary, but we can iterate through the dictionary to grab all of the rules by calling [i[‘description’] for i in dv.rules()[‘rules’]]. We can also look at the top posts of the subreddit by calling dv.top(), or save the top ones as a list by calling dv_top = [i.id for i in dv.top()]. Lets take a look at one of the top posts by calling dv_ex = dv_top. You can check out all of the options on a post with dir(dv_ex).
Now that we have a basic understanding of PRAW, lets use it to do some functions. On reddit, there are two very prevalent subreddits, ‘The_Donald’ (which is hyper Trump support), and ‘The_Mueller’ (which supports the investigation into Trump’s alleged collusion), and we can take a look at how people in each of these subreddits talk about the opposite head-figure. To do this, we’ll need to look at large quantities of posts, so we’ll import and use another software called PushShift (documentation can be found here or here). Calling ‘from psaw import PushshiftAPI’ and ‘api = PushshiftAPI(reddit)’ sets up the program to use your praw scraper, but this does it very efficiently.
The following code (though very clunky) searches ‘The_Mueller’ for the keyword ‘Trump’, and creates a dataframe.
We repeat the code with The Donald subreddit, searching for Mueller. After the dataframes are set up, we should check the frames to check that everything is proper.
Once we have the information, the next thing we want to do to compare how people are talking about the other is by running sentiment analysis on the title, which is all text. We can call the code ‘from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer’, after installing the library (documentation can be found here), and start running it. Similar to PRAW, we create a sentiment analyzer object, and pass in the text. The below shows code on how to do it.
This will add columns to our data frames with sentiment analysis. We can take a look at some of the stats that follow with it. The positive component explains how positive the text is, the negative component the negative, and the compound component is the all-around and go-to thing to be used.
With these columns, we can run some comparisons, using statistics.
We can see that the folks at The Mueller are about 7 times as more respectful when talking about Trump, as compared to the folks at The Donald talking about Mueller.
We can also graph how the compound rating affects score.
These graphs don’t seem to show us anything, only that posts at The Donald seem to be a lot more diverse in their scores.
There are also a lot more posts on TD that are low scoring (aka negative) than those in TM.
There are tons of things that you can look at with the dataframes we created, with many comparisons between columns (etc plotting scores/compound scores over time), which can be explored, but the main point of this post was to give you instructions on how to explore Reddit’s API, for your own purposes. You can explore your own research questions, but hopefully this has helped you in your endeavors!