Scraping reddit in 2023

Arjhun S
5 min readSep 2, 2023

--

Reddit recently went through with a really controversial decision of making their API usage paid. Several companies went out of business. But there are several really ingenious workarounds to getting reddit data for free without using PRAW.

I started writing a selenium script and it was working pretty well. It works naturally on reddit and there is no noticeable IP blacklisting happening.

subreddits = ['https://www.reddit.com/r/tech/top/?t=month']

class ScrapeReddit():
def __init__(self):
# start headless if you want later on.
options = Options()
self.driver = webdriver.Safari(service=Service(executable_path='/usr/bin/safaridriver'), options=options)

self.postids = []

def lazy_scroll(self):
current_height = self.driver.execute_script('window.scrollTo(0,document.body.scrollHeight);')
while True:
self.driver.execute_script('window.scrollTo(0,document.body.scrollHeight);')
time.sleep(2)
new_height = self.driver.execute_script('return document.body.scrollHeight')
if new_height == current_height: # this means we have reached the end of the page!
html = self.driver.page_source
break
current_height = new_height
return html

def get_posts(self):
for link in subreddits:
self.driver.get(link)
self.driver.maximize_window()
time.sleep(5)
html = self.lazy_scroll()
# html = self.driver.page_source
parser = BeautifulSoup(html, 'html.parser')
post_links = parser.find_all('a', {'slot': 'full-post-link'})
print(len(post_links))
count = 1

for post_link in post_links:
# generate a unique id for each post
post_id = post_link['href'].split('/')[-3]
print(f"{count} - {post_id}")
count += 1
if post_id not in self.postids:
self.postids.append(post_id)

def destroy(self):
self.driver.close()

Let me explain what I’m doing here. I’m considering a subreddit (r/tech in this case) and modifying the link to retrieve the top posts from the last month. I’m using a selenium technique to get around the lazy scroll problem with reddit. I’m using BeautifulSoup to get all the post_links and for each of the post_links, I’m getting the post_ids (these ids are generated by reddit and are unique). You can scrape other content too but that is a simple increment to the script.

Now with these post_ids, you can just go to https://reddit.com/{post_id} and you get the post. You don’ t really need to know the subreddit or the post-name or anything. Now, you can scrape the post specific content too, like the comments, replies and any media.

I was doing this for a project of mine and I really needed all the replies to be scraped. However, the view more replies button after each comment is really annoying since you cannot predict how many of these buttons you’d have to click to get all the replies. I couldn’t really find a usable workaround this issue. Any one method always fails for some weird unexpected use-case.

An Interesting revelation

I was looking for ways to get the replies which were crucial to my problem-statement. And I found something really interesting.

Apparently, adding .json to any reddit post gives the entire metadata of the post. It was a weird revelation for me this doesn’t really occur with any other social media sites. Funny enough, I tried getting the data from a bunch of reddit posts and at around 70–80 requests, I started to get the 429 error which is basically that I’m making too many requests. So I tried employing some generic techniques to stop the IP blacklisting.

https://www.reddit.com/r/tech/comments/15kg6u2/us_scientists_repeat_fusion_ignition_breakthrough.json

I tried adding random sleep times between requests and it pretty much works. One downside is the time it takes to scrape everything. You can do it faster if you have access to distributed servers or proxies.

Now that we have the postids ready, we can add these functions to the class we defined already.

def get_data(self, postid):
base_url = "https://reddit.com/"
url = base_url + postid + ".json"
self.driver.get(url)
self.driver.maximize_window()
html = self.driver.page_source
soup = BeautifulSoup(html, 'html.parser')
text = soup.find('body').get_text()
time.sleep(3)
return text
def get_post_details(self):
jsons = []
count = 1
if not self.postids:
print("No post ids found. Please run get_posts() first.")
return
for postid in self.postids:
print(postid, count)
text = self.get_data(postid)
jsons.append(text)
time.sleep(random.randint(1, 10))
count += 1

self.jsons = jsons
return jsons

When you run the get_post_details function, it iterates through all your postids and gets the entire post metadata in one shot. It takes a while to do so though because of the randomised sleep times required to make sure our IP does not get blacklisted.

Now, the more important task is to get the data out of this json file. It is pretty complicated and heavily nested. It took me a while to get to a working script. I will include it here.


def get_post_info(json_data):
"""
Gets the post body, all comments and their replies,
the user IDs of the post, comments, and replies,
and the timestamps of the post, comments, and replies
from the JSON data.
"""

post = json_data[0]['data']['children'][0]['data']
post_body = post['title']
post_user = post['author']
post_time = post['created_utc']
comments = json_data[1]['data']['children']
comments_list = []
for (comment, idx) in zip(comments, range(len(comments))):
comment_body = comment['data']['body']
comment_user = comment['data']['author']
comment_time = comment['data']['created_utc']
comments_list.append({'body': comment_body,
'user': comment_user,
'time': comment_time})
comment_replies = []

# append reply to the comment to which it belongs

if comment['data']['replies'] != '':
replies = comment['data']['replies']['data']['children']
for reply in replies:
reply_body = reply['data']['body']
reply_user = reply['data']['author']
reply_time = reply['data']['created_utc']
comment_replies.append({'body': reply_body,
'user': reply_user, 'time': reply_time})
comments_list[idx]['replies'] = comment_replies

return {
'post_body': post_body,
'post_user': post_user,
'post_time': post_time,
'comments': comments_list,
}

Please note that you have to loop through all the json documents you saved and call this function each time. It’s better to save it to a file.

data = reddit.jsons

res = []
for i in range(len(data)):
try:
parsed_json = json.loads(data[i])
res.append(get_post_info(parsed_json))
except JSONDecodeError as e:
print(e)
continue
import os
from datetime import datetime

def save_to_json(data, subreddit):
"""save it as /data/{subreddit}/{timestamp}.json"""
timestamp = datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
filename = f'data/{subreddit}/{timestamp}.json'
os.makedirs(os.path.dirname(filename), exist_ok=True)
with open(filename, 'w') as f:
json.dump(data, f)

save_to_json(res, reddit.subreddit)

Works as expected.

There are other ways too. One such online tool is socialgrep (not advertising). It’s a pretty nifty tool to get reddit data by employing filters that even reddit does not support anymore. The catch is that its not entirely free. You can barely download 100 rows of data with a free account. I don’t believe in paying for data, but if you’re really desperate and have money to spend, its a quick way to get hold of data.

That’s all I got. Thanks for reading through.

Cheers.

--

--