Basics of Data Extraction of Reddit Threads using Python

In a previous post, I discussed my analysis of /r/Games subreddit during E3 2018 — a popular video games event. In this post, I will go through the code used to extract the data of the reddit threads used in the analysis. This was done using Python via PRAW (Python Reddit API Wrapper).

A second post (Part 2) will discuss how I used Pandas to do the analysis.

Import modules

Firstly we will import the modules needed to perform our extraction

import praw
from praw.models import MoreComments
import regex
import datetime
import redcreds as creds
  • PRAW is used to access the Reddit API, you’ll need to set up your credentials separately beforehand. More info on that here.
  • There’s a bit of text parsing that we need the regex module for
  • redcreds is just my .py file where I have stored my Reddit credentials. This is so I don’t have type sensitive account information directly into my code.
  • datetime is so we can convert dates into user-friendly formats

Set up Reddit API Access

r = praw.Reddit(username = creds.username,            
password = creds.password,
client_id = creds.client_id,
client_secret = creds.client_secret,
user_agent = creds.user_agent)

With all your credentials assigned to r, we can now access whatever part of Reddit needed.

Pick your subreddit

chosen_sub = input() #Games
subreddit = r.subreddit(chosen_sub)

We will input “Games” this time but when you run this code, you can input the subreddit you wish to access. The subreddit variable is now linked to accessing the subreddit /r/Games.

subs = []
subCount = 0
sub_entries = {}

These empty variables are needed to store our submission ids and the data linked to them.

Run your search query

sub_query = input() #E3 2018
for submission in subreddit.search(sub_query, sort='new', time_filter='week', limit=None):
subs.append(submission.id)
subCount+=1
  • sub_query will store your search query, in this case we will search “E3 2018”.
  • ‘time_filter’ is limited to day/week/hour/month/year and you can ‘sort’ results by new/top/controversial as you would when using Reddit as an user.
  • The subreddit.search will return 100 results max by default unless you set limit=None.
  • subCount keeps track of the no. of submissions

Run some checks

print(str(subCount) + " submissions have added to list")
print("1st entry is:")
print(r.submission(id=str(subs[0])).title + " created: " + str(datetime.datetime.fromtimestamp(r.submission(id=str(subs[0])).created)))
print("Last entry is:")
print(r.submission(id=str(subs[subCount-1])).title + " created on: " + str(datetime.datetime.fromtimestamp(r.submission(id=str(subs[subCount-1])).created)))

Let’s use some print commands to check what we have managed to collect from the subreddit search before continuing. The 1st and last submissions’ titles and creation dates is a good way ensure you’re not picking up irrelevant posts. For example, if I collected a submission from 2017, I’ll want to review the code.

Collect the submissions’ data

def collectSubData(submission):
post = r.submission(id=submission) #Access subreddit post based on submission id
subData = list() #list to store key data of submission
title = post.title
url = post.url
flair = post.flair
author = post.author
unique = post.id
score = post.score
created = datetime.datetime.fromtimestamp(post.created) #e.g. datetime used to converted to suitable format
upratio = post.upvote_ratio
topcommsCnt = len(post.comments)
allcommsCnt = len(post.comments.list())

subData.append((unique,title,url,author,score,created,upratio,topcommsCnt,allcommsCnt,flair))
sub_entries[unique] = subData

This function uses the provided submission id to collect the submission’s title, url, author, reddit score (upvotes — downvotes), upvote ratio, no. of top comments (comments that have been replied to), no. of total comments and flair. And then assigns the data to a dictionary entry labelled by the submission’s ID.

for submission in subs:
collectSubData(submission)
print("Submissions have been collected")
print(str(len(sub_entries)) + " entries have been added to the dictionary")

Taking the submission ids collected from previous subreddit search, we will iterate through to them to extract the data of each one. The print code is not needed but I like being notified when certain parts of my code have been executed.

Upload as CSV file

def updateSubs_file():
upload_count = 0
import csv
location = //Reddit Extractions//
filename = input() #don't forget to add .csv
file = location + filename
with open(file, 'w', newline='') as file:
#if you encounter encoding error use > encoding="utf-8"
a = csv.writer(file, delimiter=',')
headers = ["Post ID","Title","Url","Author","Score","Publish Date","Upvote Ratio","Total No. of Top Comments","Total No. of Comments","Flair"]
a.writerow(headers)
for sub in sub_entries:
a.writerow(sub_entries[sub][0])
upload_count+=1

print(str(upload_count) + " submissions have been uploaded")
updateSubs_file()

This function lets you input your filename to store your submission data. Make sure the folder location is appropriate and you add .csv on the end of your filename. It’s best to set user-friendly headers for your csv file too.

In part 2, we will visualise the results of our data from the CSV with pandas.