Download Data From Reddit Using Python

Reddit is a great source of information, I'll show you how to use it

Antonio Feregrino

Published in

Software y Data

5 min readMay 5, 2022

Photo by Volodymyr Hryshchenko on Unsplash — modified by author

Motivation

Ever since Russia’s “special military operation” in Ukraine started, I have been doomscrolling the comments in the r/worldnews subreddit live threads. I saw with amazement how the frequency of comments increased with each major event but also noticed how each day there were fewer and fewer comments showing a sustained decrease of interest (at least when measured by Reddit comments) on the topic of the invasion.

This prompted me to find all the live threads in an attempt to figure out whether my feeling was true or not. The following two posts are a result of this curiosity; in the first one (the one you are reading now) I’ll show you how I created the dataset, whereas in the second one, you will find how to use the data.

The Reddit API

There are a couple of ways to download data from the internet web scraping or APIs (when available): web scraping is my favourite, but at the same time, the most time consuming and fragile to maintain since any change to the layout makes your scraping go wild.

Luckily for us, Reddit offers an API one can use to consume data from the site.

As with most major websites APIs, to start using this api, one needs to register an application — my recommendation is that you create an entirely different Reddit account since you will also have to use the password of said account to authenticate.

When your app has been created, make a note of the following values as we will use them too.

PRAW to use the Reddit API

To consume the API via Python, we will be using the PRAW package. Installable using Python with pip install praw.

Once we have got our client id and secret we can move on to create a praw.Reddit instance passing the information we just got from Reddit; to avoid hardcoding our password and secrets let’s use environment variables to set these values:

Hashing function

We will use a function that takes a string and messes with it in a deterministic manner, this is to “mask” some values that I do not think should be made public, or at least, not so easily.

Finding all threads

We need to find all the live threads related to the invasion, as such, I will limit my search to begin from the 1st of February 2022 (there were no threads previous to February) and end one day before running the search:

Next I define a list of r/worldnews moderators, since they are the only ones who are able to create live threads. The list of mods can be obtained using the API itself:

mods = [
    # list of mods
]

Iterating over each user

The only way I found to find all the threads is to comb all submissions made by mods and then figure out which ones belong to what we care about here. The following fragment of code does that, fetching up to 200 submissions per user and storing them in a list:

Iterating over all submissions

Once we have all submissions made by mods, we can iterate over them in search of the ones we want. In this case, the ones we want start with either: "/r/worldnews live thread", "r/worldnews live thread" or "worldnews live thread" and were made between the 1st of February and yesterday:

Lastly, to extract all the properties, I am using the getattr function in combination with a list of properties.

Converting to a DataFrame

Once we have all the submissions in a list, we should convert it to a pandas DataFrame to make it easy to work with.

Then we can:

Use pd.to_datetime to convert the unix timestamp to an actual date
Hash the author’s name with the previously declared hash_string function

After all the transformations, we can save the thread’s data with an specified order in the columns, sorted by creation date and without index:

Downloading ALL the comments for a ALL threads

The next step is pretty straightforward. We need to iterate over the file we just created and use the PRAW package to download all the comments made to a submission.

To begin, let’s create a function that takes in a comment and a submission and returns a list of its properties, this function is a bit more complex given that comments differ from one another. Once again, I am using the getattr function to make our lives easy.

We are all set to iterate over the threads downloading all those we do not have yet. There is a tutorial on the PRAW website itself that details how to download comments to a thread — there is some customisation going on in terms of converting everything to a DataFrame, but the code itself is pretty much self-explanatory:

Automating through GitHub Actions (extra)

New threads are created every day, which means that if we want to keep our dataset updated, we must run this script every day as well. If you keep your code in GitHub, it sounds like the perfect candidate for automation with GitHub Actions.

First off, we will need to save our environment variables with secrets (CLIENT_ID, CLIENT_SECRET, PASSWORD) as repository secrets. To do this, go to Settings ➡ Secrets (Actions) ➡ New repository secret:

Once all three secrets are available, create a .yml file in the .github/workflows folder with the following content:

In short, every day at 10 AM:

It checkouts the code
Sets up Python 3.8
Installs the dependencies, in this case I was using pipenv to handle them locally — you could use something entirely different
Executes all the previous Python code that downloads the threads and their comments
Commits all the changes to the repository, saving our csv files.

Conclusion and resources

And that is it, now we have downloaded all the relevant threads, and we are ready to use them.

In this post, we had a look into how to create a dataset using Reddit data, and in the next one, I’ll show you how to use this dataset to create something interesting to discover how the interest for the topic has evolved over time; I hope you learned something new or at least that you liked it.

As always, the code for this post is available here along with the full repo, the dataset is available in Kaggle. Questions? comments? I am open for discussion on Twitter at @feregri_no.