Using Pushshift’s API to extract Reddit Submissions

Source

PRAW is the main Reddit API used for extracting data from the site using Python. Although there are a few limitations including extracting submissions between specific dates. This inconvenience led me to Pushshift’s API for accessing Reddit’s data. In this article we will quickly go over how to extract data on post submissions in only a few lines of code.

Import the relevant modules

import pandas as pd
import requests
import json
import csv
import time
import datetime

Build your pushshift url

We can access the Pushshift API through building an URL with the relevant parameters without even needing Reddit credentials.

For example:

Without parameters, this is the foundation of the URL you’ll use to access Reddit: https://api.pushshift.io/reddit/search/

Now with parameters, we will access the PS4 subreddit, between 2 dates written in unix timestamps and search for all submissions that contain the keyword — screenshot: https://api.pushshift.io/reddit/search/submission/?q=screenshot&after=1514764800&before=1517443200&subreddit=PS4

All of this is easily and thoroughly explained at — https://github.com/pushshift/api

How does the data look like?

The URL will create returns a JSON page of our results. Click the above pushshift URL with the parameters to see for yourself.

You should see a page filled with JSON objects that look something like this (without the bold text):

"""{ “data”: [ { “author”: “[deleted]”, “author_flair_css_class”: null, “author_flair_text”: null, “brand_safe”: true, “can_mod_post”: false, “contest_mode”: false, “created_utc”: 1514778123, “domain”: “i.redd.it”, “full_link”: “https://www.reddit.com/r/PS4/comments/7nd12w/screenshot_finally_its_taken_so_long_i_cant_wait/", “id”: “7nd12w”, “is_crosspostable”: false, “is_reddit_media_domain”: true, “is_self”: false, “is_video”: false, “link_flair_css_class”: “media”, “link_flair_text”: “[Screenshot]”, “locked”: false, “num_comments”: 1, “num_crossposts”: 0, “over_18”: false, “parent_whitelist_status”: “all_ads”, “permalink”: “/r/PS4/comments/7nd12w/screenshot_finally_its_taken_so_long_i_cant_wait/”, “pinned”: false, “retrieved_on”: 1514850534, “score”: 2, “selftext”: “[deleted]”, “spoiler”: false, “stickied”: false, “subreddit”: “PS4”, “subreddit_id”: “t5_2rrlp”, “subreddit_type”: “public”, “thumbnail”: “default”, “thumbnail_height”: 78, “thumbnail_width”: 140, “title”: “[Screenshot] Finally, it\u2019s taken so long. I can\u2019t wait for the premium Bloodborne theme to show up in my Email.”, “url”: “https://i.redd.it/2k02hx4knd701.jpg", “whitelist_status”: “all_ads” },
{ “author”: “awesome2noah”, “author_flair_css_class”: null, “author_flair_text”: null, “brand_safe”: true,"""

In Python, JSON objects can translate to dictionary types and in this case would be held under the dictionary key “data”, followed by a list of nested dictionaries with what you see above. The important keys are highlighted in bold and you would access them like so:

data[“data”][0][“author”] data[“data”][0][“permalink”]

Build Function that builds PushShift URLs

In this example, we will be using these parameters:

  • size — increase limit of returned entries to 1000
  • after — where to start the search
  • before — where to end the search
  • title — to search only within the submission’s title
  • subreddit — to narrow it down to a particular subreddit

The code here is an adapted version of this snippet on GitHub.

def getPushshiftData(query, after, before, sub):
url = 'https://api.pushshift.io/reddit/search/submission/?title='+str(query)+'&size=1000&after='+str(after)+'&before='+str(before)+'&subreddit='+str(sub)
print(url)
r = requests.get(url)
data = json.loads(r.text)
return data['data']

This function adapts the URL with the submitted parameters then prints it to the console. Requests module is used to access the URL with the JSON module collecting the text version of the page in a format we can manipulate through Python.

With the before and after parameters, you need to use this format for selecting your dates e.g. 1514850534

This website is good for creating timestamps that are compatible for this task. https://www.unixtimestamp.com/index.php

Build Function to extract key data points

Once we get our search results, we want key data for further analysis including: Submission Title, URL, Flair, Author, Submission post ID, Score, Upload Time, No. of Comments, Permalink.

def collectSubData(subm):
subData = list() #list to store data points
title = subm['title']
url = subm['url']
try:
flair = subm['link_flair_text']
except KeyError:
flair = "NaN"
author = subm['author']
sub_id = subm['id']
score = subm['score']
created = datetime.datetime.fromtimestamp(subm['created_utc']) #1520561700.0
numComms = subm['num_comments']
permalink = subm['permalink']

subData.append((sub_id,title,url,author,score,created,numComms,permalink,flair))
subStats[sub_id] = subData

This function will be used to extract the key data points from each JSON result. Note that all posts will not come with a flair hence we wrapped it round a try/except clause. subData was created at the start to hold all the data which is then added to our global subStats dictionary.

Where and what data will we be storing?

#Subreddit to query
sub='PS4'
#before and after dates
before = "1538352000" #October 1st
after = "1514764800" #January 1st
query = "Screenshot"
subCount = 0
subStats = {}

In this example we are looking for all submissions with “Screenshot” in their title on PS4’s subreddit between 1st Jan ’18 and 1st Oct ’18. subCount tracks the no. of total submissions we collect. subStats is the dictionary where we will store our data.

Run code and loop until all submissions are collected

We will run the getPushshiftData once then run a While loop that continues until the function returns 0 results. This loop is needed as Reddit limits API calls to avoid overloading the server.

data = getPushshiftData(query, after, before, sub)
# Will run until all posts have been gathered 
# from the 'after' date up until before date
while len(data) > 0:
for submission in data:
collectSubData(submission)
subCount+=1
# Calls getPushshiftData() with the created date of the last submission
print(len(data))
print(str(datetime.datetime.fromtimestamp(data[-1]['created_utc'])))
after = data[-1]['created_utc']
data = getPushshiftData(query, after, before, sub)

print(len(data))

This code above:

  • Runs the function once to the maximum limit.
  • Takes the final entry’s creation date and updates the after parameter
  • Reruns until the function returns nothing thus data is less than 0

Check submissions

Once we have our data we can upload into a CSV for further analysis but first, we check what data we have gathered including:

  • Total no. of submissions
  • First entry and last entries titles (search criteria)
  • First and last entries creation times (timeframe)

Here, we turn the dictionary values into a list, access the first entry [0 up to X entries], access that entry’s tuple [always 0], access needed value in tuple [0 up to 7 elements]. And to access the last entry in any list we write -1.

print(str(len(subStats)) + " submissions have added to list")
print("1st entry is:")
print(list(subStats.values())[0][0][1] + " created: " + str(list(subStats.values())[0][0][5]))
print("Last entry is:")
print(list(subStats.values())[-1][0][1] + " created: " + str(list(subStats.values())[-1][0][5]))

We get something like this displayed in the console:

5605 submissions have added to list
1st entry is:
[Screenshot] Finally, it’s taken so long. I can’t wait for the premium Bloodborne theme to show up in my Email. created: 2018-01-01 03:42:03
Last entry is:
[Spider-Man] [Screenshot] Platinum Achieved - I <3 Manhattan created: 2018-09-30 18:41:10

Upload to CSV file

def updateSubs_file():
upload_count = 0
location = "\\Reddit Data\\"
print("input filename of submission file, please add .csv")
filename = input()
file = location + filename
with open(file, 'w', newline='', encoding='utf-8') as file:
a = csv.writer(file, delimiter=',')
headers = ["Post ID","Title","Url","Author","Score","Publish Date","Total No. of Comments","Permalink","Flair"]
a.writerow(headers)
for sub in subStats:
a.writerow(subStats[sub][0])
upload_count+=1

print(str(upload_count) + " submissions have been uploaded")
updateSubs_file()

If you haven’t noticed, I like printing updates to myself to track what my code is doing :). Here we simply set up the filename location via input then write the data to that file.

That’s all for now. This code is very adaptable for whatever other purposes you’ll like to use the Pushshift API for.

Thanks for reading, any feedback is welcome.