How to Scrap Reddit using pushshift.io via Python

  1. Most posts on this subject explain how to scrap Reddit and setup a developer account. These are great and I suggest reading them. I will not being showing how to do this.
import math
import json
import requests
import itertools
import numpy as np
import time
from datetime import datetime, timedelta
https://api.pushshift.io/reddit/search/submission?subreddit={}&after={}&before={}&size={}
  • URI template which will take in the Sub-Reddit being investigated, two EPOCH timestamps (without milliseconds) and max amount of records we would like back. As of this writing, 500 is the max for the size parameter.
  1. Add in logic to request more posts, if 500 posts are returned from a previous requests. We will pull the last created on timestamp prior to the next request. Just moving the needle…
  2. Create a method for building time period search intervals. Example: shrink the time period (1/1/2010–1/1/2018), into multiple, shorter time periods where our individual time periods can possibly be in days or weeks, instead of years.
def make_request(uri, max_retries = 5):
def fire_away(uri):
response = requests.get(uri)
assert response.status_code == 200
return json.loads(response.content)
current_tries = 1
while current_tries < max_retries:
try:
time.sleep(1)
response = fire_away(uri)
return response
except:
time.sleep(1)
current_tries += 1
return fire_away(uri)
  • If, for some reason, our request fails. Wait a second and retry. We will try that 5 times before giving up.
def pull_posts_for(subreddit, start_at, end_at):

def map_posts(posts):
return list(map(lambda post: {
'id': post['id'],
'created_utc': post['created_utc'],
'prefix': 't4_'
}, posts))

SIZE = 500
URI_TEMPLATE = r'https://api.pushshift.io/reddit/search/submission?subreddit={}&after={}&before={}&size={}'

post_collections = map_posts( \
make_request( \
URI_TEMPLATE.format( \
subreddit, start_at, end_at, SIZE))['data'])
n = len(post_collections)
while n == SIZE:
last = post_collections[-1]
new_start_at = last['created_utc'] - (10)

more_posts = map_posts( \
make_request( \
URI_TEMPLATE.format( \
subreddit, new_start_at, end_at, SIZE))['data'])

n = len(more_posts)
post_collections.extend(more_posts)
return post_collections
subreddit = 'Siacoin'end_at = math.ceil(datetime.utcnow().timestamp())
start_at = math.floor((datetime.utcnow() - \
timedelta(days=365)).timestamp())
posts = pull_posts_for(subreddit, start_at, end_at)## ~ 4314
print(len(posts))
## ~ 4306
print(len(np.unique([ post['id'] for post in posts ])))
def give_me_intervals(start_at, number_of_days_per_interval = 3):

end_at = math.ceil(datetime.utcnow().timestamp())

## 1 day = 86400,
period = (86400 * number_of_days_per_interval)
end = start_at + period
yield (int(start_at), int(end))
padding = 1
while end <= end_at:
start_at = end + padding
end = (start_at - padding) + period
yield int(start_at), int(end)
## test out the solution,
start_at = math.floor(\
(datetime.utcnow() - timedelta(days=365)).timestamp())
list(give_me_intervals(start_at, 7))[
(1509816061, 1510420861),
(1510420862, 1511025661),
(1511025662, 1511630461),
(1511630462, 1512235261),
...
]
subreddit = 'Siacoin'start_at = math.floor(\
(datetime.utcnow() - timedelta(days=365)).timestamp())
posts = []
for interval in give_me_intervals(start_at, 7):
pulled_posts = pull_posts_for(
subreddit, interval[0], interval[1])

posts.extend(pulled_posts)
time.sleep(.500)
## ~ 4306
print(len(posts))
## ~ 4306
print(len(np.unique([ post['id'] for post in posts ])))
  • Warning: we still may get duplicates out of this. This is still because of how we are request for more when our interval produces 500 posts.
import prawconfig = {
"username" : "*",
"client_id" : "*",
"client_secret" : "*",
"user_agent" : "*"
}
reddit = praw.Reddit(client_id = config['client_id'], \
client_secret = config['client_secret'], \
user_agent = config['user_agent'])
  • Warning: If you are not setup correctly, you will receive a status code of 401.
## WARNING: REDDIT WILL THROTTLE YOU IF YOU ARE ANNOYING! BE KIND!
TIMEOUT_AFTER_COMMENT_IN_SECS = .350
posts_from_reddit = []
comments_from_reddit = []
for submission_id in np.unique([ post['id'] for post in posts ]):
submission = reddit.submission(id=submission_id)
posts_from_reddit.append(submission) submission.comments.replace_more(limit=None)
for comment in submission.comments.list():
comments_from_reddit.append(comment)

if TIMEOUT_AFTER_COMMENT_IN_SECS > 0:
time.sleep(TIMEOUT_AFTER_COMMENT_IN_SECS)
## ~ 4306
print(len(posts_from_reddit))
## ~ 35216
print(len(comments_from_reddit))

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

18 Web Development Mistakes to avoid in 2018

FlexSlider and the product page of the Shopify Narrative theme

4 Books you Should Read to Become a DevOps Engineer

Practically understanding time complexities of Recursive and Iterative functions

Algorithms 101: Searching & Sorting

Communicating with APIs through Fetch Requests

Five Elements Ecosystem: An Overview

How to create multiple pages with selenium?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
pj

pj

More from Medium

Create a data scraper with Python+YouTube Data API+PostgreSQL+HerokuScheduler

How to Do Price Intelligence Using Python with Pandas, Scrapy, and SQL?

Web Scraping Tables from Wikipedia using BeautifulSoup in Python

Building a LINE Messenger Chatbot with Flask in Python