Scraping Reddit using python

How to scrape data from Reddit using the Python Reddit API Wrapper (PRAW) in a structured way

Published in

The Startup

4 min readJun 21, 2020

In this post we are going to learn how to scrape all/top/best posts from a subreddit and also the comments on that post (maintaining the nested structure) using PRAW.

So, basically by the end of the tutorial let’s say if you wanted to scrape all all jokes from r/jokes you will be able to do it.

TL;DR Here is the code to scrape data from any subreddit .

In order to understand how to scrape data from Reddit we need to have an idea about how the data looks on Reddit. Here’s a snippet :

Now if you look at the post above the following would be the useful data fields that you would like to capture/scrape :

The post (title and body)
The comments in a structured way ( as the comments are nested on Reddit, when we are analyzing data it might be needed that we have to use the exact structure to do our analysis.Hence we might have to preserve the reference of a comment to its parent comment and so on)
The points(up-votes) of a post
The points(up-votes) of a comment
Timestamp on a post/comment
The url of the post and the comment

Now that we know what we have to scrape and how we have to scrape, let’s get started.

Let’s get started

So to get started the first thing you need is a Reddit account, If you don’t have one you can go and make one for free.

The next step is to install Praw. Praw is an API which lets you connect your python code to Reddit .

To install praw all you need to do is open your command line and install the python package praw.

pip install praw

The next step after making a Reddit account and installing praw is to go to this page and click create app or create another app.

In the form that will open, you should enter your name, description and uri. For the redirect uri you should choose http://localhost:8080

Now that you have created your Reddit app, you can code in python to scrape any data from any subreddit that you want.

If you want the entire script go here.

The first step is to import the packages and create a path to access Reddit so that we can scrape data from it.

You can use the references provided in the picture above to add the client_id, user_agent,username,password to the code below so that you can connect to reddit using python.

Now lets say you want to scrape all the posts and their comments from a list of subreddits, here’s what you do:

The next step is to create a dictionary which will consists of fields which will be scraped and these dictionaries will be converted to a dataframe.

Here’s the process flow for the code :

Create a list of queries for which you want to scrape the data for(for eg if I want to scrape all posts related to gaming and cooking , I would have “gaming” and “cooking” as the keywords to use.
Create a dictionary of all the data fields that need to be captured (there will be two dictionaries(for posts and for comments)
Using the query , search it in the subreddit and save the details about the post using append method
Using the query , search it in the subreddit and save the details about the comment using append method
Save the post data frame and comments data frame as a csv file on your machine

So lets say we want to scrape all posts from r/askreddit which are related to gaming, we will have to search for the posts using the keyword “gaming” in the subreddit. Here’s how we do it in code:

NOTE : In the following code the limit has been set to 1.The limit parameter basically sets a limit on how many posts or comments you want to scrape, you can set it to None if you want to scrape all posts/comments, setting it to one will only scrape one post/comment.

Conclusion

Praw is the most efficient way to scrape data from any subreddit on reddit. Also with the number of users,and the content(both quality and quantity) increasing , Reddit will be a powerhouse for any data analyst or a data scientist as they can accumulate data on any topic they want!

Thank you for reading this article, if you have any recommendations/suggestions for me please share them in the comment section below.

Happy scraping!

Scraping Reddit using python

How to scrape data from Reddit using the Python Reddit API Wrapper (PRAW) in a structured way

Let’s get started

Conclusion

Written by parth bhardwaj