PRAW — a python package to scrape Reddit Post data
Introduction
RRAW which stands for “Python Reddit API Wrapper” is a python package that allows simple access to Reddit’s API. It has a well documented official website which can be referred to for code snippets. Here I will discuss steps to install, configure and write some python script to scrape Reddit posts.
Installation & Upgrade
pip install prawpip install --upgrade praw
Requirements
Pandas should be installed previously, which will help to manage and store scraped data in CSV file more easily. If not installed, use the below command to install it.
conda install pandas
Sign Up and Registration
If you already have a Reddit profile, go and log in. If not, go here and click on the “SIGN UP” button as shown below.
After signing up you will be redirected to the Reddit homepage. Click on the dropdown button beside the profile picture (shown by red arrow below).
Click on “Visit Old Reddit”.
Now go to preferences on the top right corner of the page.
Navigate to apps section on Nav Bar and click on the button “are you a developer? Create an app…” [shown in the below image using red arrow]
Fill all the details and don’t forget to select the script [by default web app is selected]. Finally, click on the ‘Create app’ button. All the fields here are mandatory, ignoring anyone will not allow you to proceed further.
You will be redirected to a page where personal use script and security token will be given[as marked in the picture]. Copy those in your clipboard, which will definitely be required while writing Python script.
Writing Python Script
Here I will be using Jupyter Notebook to demonstrate the whole process. You can use any Notebook or code editor of your choice.
We will create a dataset consisting of features of Reddit posts from various subreddits. PRAW provides almost 20 attributes to scrape various features of Reddit post such as Id, Author, Flair, Title, Upvote etc. Here we will store 7 features of each Reddit post to create the dataset.
Importing Libraries
import praw
import pandas as pd
Accessing API using secret token
reddit = praw.Reddit(client_id = "CLIENT_ID", #peronal use script
client_secret = "CLIENT_SECRET", #secret token
usernme = "USERNAME", #profile username
password = "PASSWORD", #profile password
user_agent = "USERAGENT")
Initialize empty lists for each feature
author_list = []
id_list = []
link_flair_text_list = []
num_comments_list = []
score_list = []
title_list = []
upvote_ratio_list = []
Mention all subreddits that will be scraped
We will scrape hot posts from ten popular Subreddits. So all these subreddit names will be stored in a list and we will iterate over each element of the list.
subreddit_list= ['india',
'worldnews',
'announcements',
'funny',
'AskReddit',
'gaming',
'pics',
'science',
'movies',
'todayilearned'
]
Subreddit and various attributes
The subreddit( ) function takes a single parameter, i.e. the subreddit name. Each subreddit has a division named as hot. We access the hot Reddit posts using the function hot( ), which takes an attributes ‘limit’, using which we can specify the no. of posts that we want to access from that subreddit.
subreddit = reddit.subreddit(subred)
hot_post = subreddit.hot(limit = 10000)
Now we can iterate over the hot_post and for each instance we can call the attributes such as author, id, score, title etc. Each attribute then can be appended to the previously initialized columns accordingly. For example, the script for scraping a single attribute (here, author name) for 1000 hot posts from a subreddit ( here, India) has been shown below.
subreddit = reddit.subreddit('india')
hot_post = subreddit.hot(limit = 1000)for sub in hot_post:
author_list.append(sub.author)
You can learn about more attributes from the official website of PRAW here.
Final Script
The script to iterate through the list of 10 subreddit names, scraping & storing specific attributes of each post is given below.
for subred in subreddit_list:
subreddit = reddit.subreddit(subred)
hot_post = subreddit.hot(limit = 10000) for sub in hot_post: author_list.append(sub.author)
id_list.append(sub.id)
link_flair_text_list.append(sub.link_flair_text)
num_comments_list.append(sub.num_comments)
score_list.append(sub.score)
title_list.append(sub.title)
upvote_ratio_list.append(sub.upvote_ratio) print(subred, 'completed; ', end='')
print('total', len(author_list), 'posts has been scraped')
The last two print statements will help us to visualize the termination of each subreddit and total no. of data scrapped till then.
Storing Dataset in a CSV file
For storing all the scraped data we will use the pandas library and convert the list of data into a pandas dataframe first, then to a CSV file. Each list will be treated as a column in the dataset and each row of the dataset will describe a unique Reddit post.
df = pd.DataFrame({'ID':id_list,
'Author':author_list,
'Title':title_list,
'Count_of_Comments':num_comments_list,
'Upvote_Count':score_list,
'Upvote_Ratio':upvote_ratio_list,
'Flair':link_flair_text_list
})
df.to_csv('reddit_dataset.csv', index = False)
Conclusion
Here is the complete Python script. You can also clone the script and CSV file from this Github repo.
Here is a Reddit Dataset uploaded on Kaggle, that you can use for your Machine Learning project. Give it an upvote, if you find it helpful.
Thanks a lot for reading. Please let me know if any corrections/suggestions. Please do 👏 if you like the post. Thanks in advance…