Using APIs in Python for Data Collection — Scraping Reddit (Part 2)
This is part 2 of Using APIs in Python For Data Collection. In this tutorial we’ll look at using the PRAW package to scrape reddit and two specific use-cases useful for the social sciences and Human Computer Interaction research. We’ll be covering the basics of setting up python and PRAW to collect posts and comments from a specific subreddit. Then we’ll look at two use-cases:
- Sentiment Analysis: This is used for determining how positive or negative the language in a comment or post is on Reddit.
- Tracking Specific Topics: After collecting text data, you can use keywords to see how different communities discuss specific topics. For example, how the reddit community covid19_support discusses mental health.
We chose Reddit as our exemplar social media site, since there’s a vast amount of posts, comments, and interactions, and all of it is quite accessible! This lets us social scientists and HCI researchers study a variety of novel online communities that form online. Let’s learn how to access it together :D
Disclaimer: If you aren’t familiar with APIs, how and why they work, you can check out our previous article here:
The Basics of Data Collection with Python PRAW
In this section we’ll go over all the basic setup required to get PRAW and your Reddit scraper working.
We’ll cover:
- Setting up your development environment
- Creating a reddit app through which you’ll run your bot
- Scraping posts from a specific subreddit
- Scraping comments from a specific post
These 4 steps will give you access to thousands of posts from different communities around the world. This forms the basis of the more advanced use-cases we’ll cover after, so it’ll be good to get the basics down first!
Setting up the environment
Before we begin, ensure that you have Python installed on your computer. You can download Python from here. Next, we need to install the PRAW package. You can do this by running the following command in your terminal or command prompt:
pip install praw
Creating a Reddit App
To use the PRAW package, you will need to create a Reddit App to obtain the necessary credentials. A Reddit app is any program that interacts with the Reddit API. We need to register our app with Reddit to get permission to use the Reddit API by following these steps:
- Log in to your Reddit account.
- Go to the App Preferences page.
- Scroll down to the “Developed Applications” section and click on the “Create App” or “Create Another App” button.
- Fill out the form as follows:
* “name”: Choose a name for your app.
* “App type”: Select “script.”
* “description”: Provide a brief description of your app (optional).
* “about url”: Leave this field blank.
* “redirect uri”: Enter “http://localhost:8080" (without quotes).
* “permissions”: Leave the default selection. - Click “Create app” to complete the process.
After creating the app, you will see a “client ID” and “client secret” on the app page. Take note of these values, as we will need them in our code.
Collecting Reddit data using PRAW
Now that we have our Reddit App set up, let’s start collecting data from Reddit using the PRAW package. We will start by importing the necessary libraries and initializing the Reddit instance with our credentials.
import praw
#Replace the following placeholders with your Reddit App credentials
client_id = "your_client_id"
client_secret = "your_client_secret"
user_agent = "your_user_agent"
reddit = praw.Reddit(client_id=client_id, client_secret=client_secret, user_agent=user_agent)
With the Reddit instance initialized, we can now start gathering data. For example, let’s say we want to collect the top 10 posts from the ‘AskReddit’ subreddit.
# Collect the top 10 posts of the subreddit
# We start by choosing the subreddit we want to interact with - in this case, it's "AskReddit"
subreddit = reddit.subreddit("AskReddit")
# We then use the "top" method of the "subreddit" object to get the top posts in the subreddit, and limit it to 10 using the "limit" parameter
top_posts = subreddit.top(limit=10)
# We then use a "for" loop to iterate over each post in the "top_posts" list
for post in top_posts:
# For each post, we print its title using the "title" attribute of the "post" object
print(post.title)
# We then use another "for" loop to iterate over the comments in the post, limited to the first 10 using the slice notation [:10]
for comment in post.comments[:10]:
# For each comment, we print its body using the "body" attribute of the "comment" object
print(comment.body)
The above code snippet accesses the ‘AskReddit’ subreddit, retrieves the top 10 posts, and prints their titles. You can replace “AskReddit” with any other subreddit of your choice.
Collecting Comments
Now, let’s say we want to collect the comments from each of those posts. We can add on to our previous loop.
# We choose the subreddit we want to interact with
subreddit = reddit.subreddit("AskReddit")
# We get the top 10 posts in the subreddit
top_posts = subreddit.top(limit=10)
# We iterate over each post in the top_posts list
for post in top_posts:
# We print the title of the post
print(post.title)
# We iterate over the first 10 comments in the post
for comment in post.comments[:10]:
# We print the body of the comment
print(comment.body)
Replace “your_post_id” with the ID of the post you want to scrape comments from. The above code snippet retrieves all comments from the specified post and prints their content.
Putting it All Together
Now we can put all that together. We’ll also add in some code to write the comment bodies and their associated post titles to a csv file at the end. This way you’ll have access to the comment data after the program closes.
import praw
import csv
# Create a Reddit instance with your authentication credentials
reddit = praw.Reddit(client_id='YOUR_CLIENT_ID', client_secret='YOUR_CLIENT_SECRET', user_agent='YOUR_USER_AGENT')
# Choose the subreddit you want to interact with
subreddit = reddit.subreddit("AskReddit")
# Get the top 10 posts in the subreddit
top_posts = subreddit.top(limit=10)
# Create a new CSV file to write the data to
with open("reddit_data.csv", mode="w", newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(["Post Title", "Comment Body"])
# Iterate over each post in the top_posts list
for post in top_posts:
# Get the title of the post
post_title = post.title
# Iterate over the first 10 comments in the post
for comment in post.comments[:10]:
# Get the body of the comment
comment_body = comment.body
# Write the post title and comment body to the CSV file
writer.writerow([post_title, comment_body])
After this, you’ll have a csv file filled with comments that can be analyzed for a variety of research questions.
Advanced Tools
Chances are you’ve got a more interesting question in mind than “what are the top 10 posts in r/askreddit?” To answer more precise questions, we’ll need more specific data. Let’s take a look at two ways to collect more interesting data:
- Sentiment Analysis
- Key Word identification
Analyzing Sentiment in a Subreddit
As a social scientist, you might be interested in understanding the general sentiment of a subreddit. For instance, you may want to explore the sentiment of posts in the “r/COVID19_support” subreddit to see how people are coping with the pandemic.
Setup
Let’s start by installing some new packages to help us with sentiment analysis.
pip install textblob
pip install nltk
Then we’ll download the data required to do the sentiment analysis
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('corpora')
Next we’ll set PRAW to look at the r/COVID19_support subreddit and collect the top 100 posts.
# Inital setup of PRAW package to gather from the COVID19_support subreddit
import praw
from praw.models import MoreComments
# Create a Reddit instance with your authentication credentials
reddit = praw.Reddit(client_id='YOUR_CLIENT_ID', client_secret='YOUR_CLIENT_SECRET', user_agent='YOUR_USER_AGENT')
# Choose the subreddit you want to interact with
subreddit = reddit.subreddit("COVID19_support")
limit = 100
posts = subreddit.top(limit=limit)
Collect the textual data from the posts
Next we’ll use our comment collection process from earlier to get a corpus of comments. We’ll be using data dictionaries to organize our data. If you’re not familiar with those, let us know in the comments and we’ll do a follow up post on common data types in Python :D
# Create an empty list to store the data dictionaries
data = []
# Iterate over each post in the top_posts list
for post in top_posts:
# Get the title of the post
post_title = post.title
# Iterate over the first 10 comments in the post
for comment in post.comments[:10]:
# Get the body of the comment
comment_body = comment.body
# Get the number of upvotes for the comment
upvotes = comment.score
# Create a data dictionary with the post title, comment body, number of upvotes, and sentiment analysis score
data_dict = {
"Post Title": post_title,
"Comment Body": comment_body,
"Upvotes": upvotes,
}
# Append the data dictionary to the list
data.append(data_dict)
# Print the data list to verify that it worked
print(data)
Now we have code that collects the top 100 posts from the r/covid19_support subreddit, and the top 10 comments on the post. For each comment we also get the text, and the number of upvotes.
Perform sentiment analysis on the collected data
Last we’ll add in the sentiment analysis portion. This is a really small change to the existing code. We only need this line
sentiment_score = TextBlob(comment_body).sentiment.polarity
This gives us a positive number if the language used in a comment was positive, and a negative value if the language was negative.
With sentiment analysis we can quickly get an idea of how a community views particular topics or if a person is using emotionally charged language when conversing online.
Putting it All Together
Here is the code all put together. First it creates a reddit instance with PRAW, then selects a subreddit, grabs the top 100 posts, and collects the top 10 comments, their text, their number of upvotes, and their sentiment.
import praw
from textblob import TextBlob
# Create a Reddit instance with your authentication credentials
reddit = praw.Reddit(client_id='YOUR_CLIENT_ID', client_secret='YOUR_CLIENT_SECRET', user_agent='YOUR_USER_AGENT')
# Choose the subreddit you want to interact with
subreddit = reddit.subreddit("Covid19_support")
# Get the top 10 posts in the subreddit
top_posts = subreddit.top(limit=100)
# Create an empty list to store the data dictionaries
data = []
# Iterate over each post in the top_posts list
for post in top_posts:
# Get the title of the post
post_title = post.title
# Iterate over the first 10 comments in the post
for comment in post.comments[:10]:
# Get the body of the comment
comment_body = comment.body
# Get the number of upvotes for the comment
upvotes = comment.score
# Perform sentiment analysis on the comment using TextBlob
sentiment_score = TextBlob(comment_body).sentiment.polarity
# Create a data dictionary with the post title, comment body, number of upvotes, and sentiment analysis score
data_dict = {
"Post Title": post_title,
"Comment Body": comment_body,
"Upvotes": upvotes,
"Sentiment Score": sentiment_score
}
# Append the data dictionary to the list
data.append(data_dict)
# Print the data list to verify that it worked
print(data)
Tracking Discussions about Specific Topics
You can track discussions about a specific topic by searching for keywords in comments within a given subreddit. For instance, you may be interested in the differences between discussions on mental health in conservative and liberal online communities. Let’s take a look at how we might do that.
import praw
import csv
from textblob import TextBlob
# Create a Reddit instance with your authentication credentials
reddit = praw.Reddit(client_id='YOUR_CLIENT_ID', client_secret='YOUR_CLIENT_SECRET', user_agent='YOUR_USER_AGENT')
# Create a list of subreddits to collect data from
subreddits = ["conservative", "liberal"]
# Define the keywords to search for in the comments
keywords = ["mental health", "depression", "anxiety"]
# Create an empty list to store the data dictionaries
data = []
# Iterate over each subreddit in the subreddits list
for subreddit_name in subreddits:
# Choose the subreddit you want to interact with
subreddit = reddit.subreddit(subreddit_name)
# Get the top 100 posts in the subreddit
top_posts = subreddit.top(limit=100)
# Iterate over each post in the top_posts list
for post in top_posts:
# Get the title of the post
post_title = post.title
# Iterate over the first 10 comments in the post
for comment in post.comments[:10]:
# Get the body of the comment
comment_body = comment.body
# Check if any of the keywords are in the comment
if any(keyword in comment_body.lower() for keyword in keywords):
# Get the number of upvotes for the comment
upvotes = comment.score
# Perform sentiment analysis on the comment using TextBlob
sentiment_score = TextBlob(comment_body).sentiment.polarity
# Create a data dictionary with the post title, comment body, number of upvotes, and sentiment analysis score
data_dict = {
"Subreddit": subreddit_name,
"Post Title": post_title,
"Comment Body": comment_body,
"Upvotes": upvotes,
"Sentiment Score": sentiment_score
}
# Append the data dictionary to the list
data.append(data_dict)
# Save the data as a CSV file
with open('mental_health_comments.csv', mode='w', newline='', encoding='utf-8') as csv_file:
fieldnames = ["Subreddit", "Post Title", "Comment Body", "Upvotes", "Sentiment Score"]
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
for d in data:
writer.writerow(d)
This Python code collects data from two political subreddits — “conservative” and “liberal” — on Reddit, by searching for comments that contain certain keywords related to mental health. Specifically, it searches for the words “mental health,” “depression,” and “anxiety.”
For each qualifying comment found within the top 100 posts of each subreddit, the code records the post title, comment body, number of upvotes, and a sentiment analysis score using the TextBlob library.
The resulting data is stored in a list of dictionaries called “data”, with each dictionary containing information about a single qualifying comment. The “data” list can be exported to a spreadsheet for further analysis by social scientists or researchers interested in studying the relationship between political ideology and attitudes towards mental health.
Conclusion
With these scripts we can build a corpus of 1000s of comments from a subreddit of our choice and analyze their sentiment within minutes.
All this data has a lot of potential uses. Here are three use cases to spark your imagination:
- Studying public opinion on political issues: By analyzing the sentiment of comments on political subreddits, researchers can gain insights into public opinion on certain issues and track changes in opinion over time.
- Examining engagement on online platforms: Analyzing the relationship between sentiment and engagement (measured by the number of upvotes) can help researchers understand what factors contribute to engagement in online discussions.
- Identifying patterns in discussions of mental health: Researchers can analyze the language used in comments on subreddits related to mental health to identify common themes and patterns in how people talk about mental health issues.
These are just a few ideas. I’d love to hear what you all have in mind!
If you have any other ideas for things we could add to the Academic’s Toolkit, let us know. This post was inspired by HennyGe Wichers’ comment on our first post
Cheers,
Nathan Laundry from the Intelligent Adaptive Interventions lab at UofT
Edited by: Hyuna Cho