In early 2018, Reddit made some tweaks to their API that closed a previous method for pulling an entire Subreddit. Luckily, pushshift.io exists. For my needs, I decided to use pushshift to pull all available posts for a specific time period. Once obtained, I wanted to go back to Reddit and pulled the individual submission and all of its comments. The last step of pulling from Reddit may not always be necessary. Depends on your individual needs.
Warning:
- Most posts on this subject explain how to scrap Reddit and setup a developer account. These are great and I suggest reading them. I will not being showing how to do this.
import math
import json
import requests
import itertools
import numpy as npimport time
from datetime import datetime, timedelta
Step 1: Making a request to pushshift.
https://api.pushshift.io/reddit/search/submission?subreddit={}&after={}&before={}&size={}
- URI template which will take in the Sub-Reddit being investigated, two EPOCH timestamps (without milliseconds) and max amount of records we would like back. As of this writing, 500 is the max for the size parameter.
How can we ensure we are getting everything for the specified time period?
- Add in logic to request more posts, if 500 posts are returned from a previous requests. We will pull the last created on timestamp prior to the next request. Just moving the needle…
- Create a method for building time period search intervals. Example: shrink the time period (1/1/2010–1/1/2018), into multiple, shorter time periods where our individual time periods can possibly be in days or weeks, instead of years.
Solving problem 1: “Add in logic to request more posts…”
First, we need a method that takes in a URI and handles the HTTP request/response.
def make_request(uri, max_retries = 5):
def fire_away(uri):
response = requests.get(uri)
assert response.status_code == 200
return json.loads(response.content) current_tries = 1
while current_tries < max_retries:
try:
time.sleep(1)
response = fire_away(uri)
return response
except:
time.sleep(1)
current_tries += 1 return fire_away(uri)
- If, for some reason, our request fails. Wait a second and retry. We will try that 5 times before giving up.
Second, tie in our previous method and pull our interval. If we take in 500, make sure to recheck to see if more exist.
def pull_posts_for(subreddit, start_at, end_at):
def map_posts(posts):
return list(map(lambda post: {
'id': post['id'],
'created_utc': post['created_utc'],
'prefix': 't4_'
}, posts))
SIZE = 500
URI_TEMPLATE = r'https://api.pushshift.io/reddit/search/submission?subreddit={}&after={}&before={}&size={}'
post_collections = map_posts( \
make_request( \
URI_TEMPLATE.format( \
subreddit, start_at, end_at, SIZE))['data']) n = len(post_collections)
while n == SIZE:
last = post_collections[-1]
new_start_at = last['created_utc'] - (10)
more_posts = map_posts( \
make_request( \
URI_TEMPLATE.format( \
subreddit, new_start_at, end_at, SIZE))['data'])
n = len(more_posts)
post_collections.extend(more_posts)return post_collections
Warning: we will get duplicates due to the way it is written.
Lastly, putting it all together to pull posts for the past year from r/Siacoin. Keep in mind, your results will different depending on what time of the day you run this.
subreddit = 'Siacoin'end_at = math.ceil(datetime.utcnow().timestamp())
start_at = math.floor((datetime.utcnow() - \
timedelta(days=365)).timestamp())posts = pull_posts_for(subreddit, start_at, end_at)## ~ 4314
print(len(posts))## ~ 4306
print(len(np.unique([ post['id'] for post in posts ])))
Solving problem 2: “Create a method for building time period search intervals…”
def give_me_intervals(start_at, number_of_days_per_interval = 3):
end_at = math.ceil(datetime.utcnow().timestamp())
## 1 day = 86400,
period = (86400 * number_of_days_per_interval) end = start_at + period
yield (int(start_at), int(end)) padding = 1
while end <= end_at:
start_at = end + padding
end = (start_at - padding) + period
yield int(start_at), int(end)## test out the solution,
start_at = math.floor(\
(datetime.utcnow() - timedelta(days=365)).timestamp())list(give_me_intervals(start_at, 7))[
(1509816061, 1510420861),
(1510420862, 1511025661),
(1511025662, 1511630461),
(1511630462, 1512235261),
...
]
Lastly, putting it altogether and building a final solution.
subreddit = 'Siacoin'start_at = math.floor(\
(datetime.utcnow() - timedelta(days=365)).timestamp())posts = []
for interval in give_me_intervals(start_at, 7):
pulled_posts = pull_posts_for(
subreddit, interval[0], interval[1])
posts.extend(pulled_posts)
time.sleep(.500)## ~ 4306
print(len(posts))## ~ 4306
print(len(np.unique([ post['id'] for post in posts ])))
- Warning: we still may get duplicates out of this. This is still because of how we are request for more when our interval produces 500 posts.
Hopefully, that all makes sense. We will now move onto taking those submission ids and pulling the post and comments from Reddit. At this point, setup a developer account if you have not already done so. (Google: “reddit developer account”)
import prawconfig = {
"username" : "*",
"client_id" : "*",
"client_secret" : "*",
"user_agent" : "*"
}reddit = praw.Reddit(client_id = config['client_id'], \
client_secret = config['client_secret'], \
user_agent = config['user_agent'])
- Warning: If you are not setup correctly, you will receive a status code of 401.
Finally, call Reddit and store our posts and comments. A notebook walking through all of this can be found here.
## WARNING: REDDIT WILL THROTTLE YOU IF YOU ARE ANNOYING! BE KIND!
TIMEOUT_AFTER_COMMENT_IN_SECS = .350posts_from_reddit = []
comments_from_reddit = []for submission_id in np.unique([ post['id'] for post in posts ]):
submission = reddit.submission(id=submission_id) posts_from_reddit.append(submission) submission.comments.replace_more(limit=None)
for comment in submission.comments.list():
comments_from_reddit.append(comment)
if TIMEOUT_AFTER_COMMENT_IN_SECS > 0:
time.sleep(TIMEOUT_AFTER_COMMENT_IN_SECS)## ~ 4306
print(len(posts_from_reddit))## ~ 35216
print(len(comments_from_reddit))
Bonus: Luigi Tasks to process a sub-reddit and split the posts and comments out into a file per day. Basically, the activity per day. See Github scrappers and pipelines repositories.