Reddit data analytics trilogy #1 — Data scraping with PRAW
Fun scraping Reddit data
I was inspired by an article about creating a Data Table using Data from Reddit and I was convinced that I could do much more with atoti. Hence I started digging into the Reddit data and it piqued my interest so much, it became a trilogy! In this trilogy, I’m going to take you through:
- Data scraping
- NLP with spaCy
- Data exploration with atoti
The article mentioned above was published quite some time ago, and Dash would probably have evolved way beyond. Nonetheless, my end game is to achieve the below dashboard with atoti:
UPDATE: The above GIF is based on an older version of atoti. We have released much smoother and even more functional dashboards and widgets with the latest version of atoti. Check out this link to see the documentation of the latest version of atoti.
Check out how I could drill-down on different dimensions in the pivot table and interact with different data visualizations.
Let’s start with part 1 of my trilogy — data scraping Reddit! If you haven’t heard of Reddit before, do take a look! It has tremendous amounts of information on pretty much everything, classified under communities known as subreddits.
Being community-driven, Reddit is a treasure trove of data that gives the trends and the opinions of the community towards various topics. Let’s take a look at how you can get your hands on these data.
PRAW — Python Reddit API Wrapper
PRAW is a Python package that I used to access Reddit’s API to scrape the subreddits that I’m interested in. For your information, you need to have a Reddit account in order to get an API key that is necessary for connection to Reddit.
Click here to proceed to create an app by clicking on the button shown on the left.
Select “script” in order to obtain “refresh_tokens” and also use “http://localhost:8080” for the redirect uri as mentioned in the PRAW documentation.
Click on the “create app” to get the API information needed for Reddit connectivity.
PRAW — Authentication
You can store the sensitive information highlighted in the screenshot above separately, and invoke properly when needed. To keep things simple, I’m showing how to apply these values to connect to Reddit. Pluck the 3 values into the code segment below:
Accessing Subreddits
This is where the fun begins! Using the authorized Reddit instance, I can obtain a subreddit instance by passing the name of the subreddit as follows:
You can also combine multiple subreddits as follows:
Have a look at the Metrics for Reddit to know what Subreddits are available and which are the popular ones.
Accessing Submission instance from Subreddit instance
Reddit front is a listing class that represents the front page. Below is a summary of the front methods and what type of submissions they return:
- best — best items
- comments — most recent comments
- controversial — controversial submissions
- gilded — gilded items
- hot — hot items
- new — new items
- random_rising — random rising submissions
- rising — rising submissions
- top — top submissions
To access the attributes of the submission, iterate through the list returned by the front method.
In my case, I appended the attributes of the submission to a list so that I can convert it into Pandas dataframe later on:
Notice that I used the front method “new” to retrieve the latest 100 submission (set using the limit parameter) from the subreddit Wallstreetbets.
Voila! Now I have the data from Reddit and I’m ready to try out Natural Language Processing (NLP) with it! Check out part 2 of my trilogy — NLP with spaCy!