Image credit(Reddit)

Using Pushshift API for data analysis on Reddit

Hector Rodriguez Dominguez
MCD-UNISON
Published in
8 min readSep 14, 2021

--

On this entry, we will learn how to mine, clean and analyze data from the social network Reddit, by using a python library named “Pushshift”.

Introduction

Let’s begin by defining what Reddit and Pushshift are…

Reddit: According to themselves, “It’s a network of communities where people can dive into their interests, hobbies and passions. There’s a community for whatever you’re interested in”.

Pushshift: Is a social media data collection, analysis, and archiving platform that has collected Reddit data and made it available to researchers. Pushshift’s Reddit dataset is updated in real-time, and includes historical data.

Now that we have defined our tools of the trade, we can begin with the “good stuff”.

In the last year, the Covid-19 epidemic changed our life in every aspect, we started using more and more technology in our daily routines, people increased their usage of social networks creating a huge amount of data ready por the picking.

In this publication, we will learn how to gather information from Reddit using python libraries and we will see how to collect data from a particular topic or subreddit, how to explore and visualize it by plotting the dataframes, that will leave a bedrock for further data analysis.

Setting up the environment

Reddit is organized in communities called subreddits. Each subreddit is filled of submissions posted by users. Each submission can be commented by other users and can be upvoted or downvote.

In order to analyze Reddit, we need to access all of its submissions, comments and users’ information. To do this, we’ll use an API called “pushshift”.

To setup our environment, first we need to install some tools. So open up a terminal and type the following commands:

pip install psaw                 #Installing pushshift library

Next, we need to import some tools. So, open up a python script and type the following commands:

from psaw import PushshiftAPI    #library Pushshift
import datetime as dt #library for date management
import p #library for data manipulation
import matplotlib.pyplot as plt #library for plotting

Once our tools are installed and imported, on the python script, we need to create an instance of the pushshift API.

api = PushshiftAPI()              #Object of the API

Now that we are all set up! We’ll explore two of the main methods of the API.

search_submissions”: Which will return a list of posts from the selected subreddit.

search_comments”: Which will return a list of comments associated with a search term.

This methods use some parameters in common to filter the search results:

after”: Lower date limit for search

before”: Upper date limit for search

filter”: Column names we want to retrieve (suggested: ‘id’, ‘author’, ‘created_utc’, ‘domain’, ‘url’, ‘title’, ‘num_comments’)

limit”: Number of rows to return

subreddit”: The name of the subreddit to crawl (case sensitive)

q”: Search term.

We have defined the parameters for the functions, so we will add the following code to our script.

"""FOR POSTS"""
def data_prep_posts(subreddit, start_time, end_time, filters, limit):
if(len(filters) == 0):
filters = ['id', 'author', 'created_utc',
'domain', 'url',
'title', 'num_comments']
#We set by default some useful columns

posts = list(api.search_submissions(
subreddit=subreddit, #Subreddit we want to audit
after=start_time, #Start date
before=end_time, #End date
filter=filters, #Column names we want to retrieve
limit=limit)) ##Max number of posts

return pd.DataFrame(posts) #Return dataframe for analysis


"""FOR COMMENTS"""
def data_prep_comments(term, start_time, end_time, filters, limit):
if (len(filters) == 0):
filters = ['id', 'author', 'created_utc',
'body', 'permalink', 'subreddit']
#We set by default some usefull columns

comments = list(api.search_comments(
q=term, #Subreddit we want to audit
after=start_time, #Start date
before=end_time, #End date
filter=filters, #Column names we want to retrieve
limit=limit)) #Max number of comments
return pd.DataFrame(comments) #Return dataframe for analysis

Let’s create our main function, where we will define our parameters and do some data cleaning before we can start visualizing the data on plots.

def main():
subreddit = "Darkestdungeon" #Subreddit we are auditing
start_time = int(dt.datetime(2021, 1, 1).timestamp())
#Starting date for our search
end_time = int(dt.datetime(2021, 1, 31).timestamp())
#Ending date for our search
filters = [] #We don´t want specific filters
limit = 1000 #Elelemts we want to recieve

"""Here we are going to get subreddits for a brief analysis"""
#Call function for dataframe creation of comments
df_p = data_prep_posts(subreddit,start_time,
end_time,filters,limit)

#Drop the column on timestamp format
df_p['datetime'] = df_p['created_utc'].map(
lambda t: dt.datetime.fromtimestamp(t))
df_p = df_p.drop('created_utc', axis=1)
#Sort the Row by datetime
df_p = df_p.sort_values(by='datetime')
#Convert timestamp format to datetime for data analysis
df_p["datetime"] = pd.to_datetime(df_p["datetime"])
"""Here we are going to get comments for a brief analysis"""
term = 'bitcoin' #Term we want to search for
limit = 10 #Number of elelemts
df_c = data_prep_comments(term, start_time,
end_time, filters, limit)
#Call function for dataframe creation of comments

Now that we have created our dataframe df_p and df_c and converted the timestamp to a datetime format, we can start creating functions for different purposes and add them to our main function, and run them. I’ll add some screenshots of the resulting graphs.

Posts per day on a specific subreddit

def count_posts_per_date(df_p, title, xlabel, ylabel):
df_p.groupby([df_p.datetime.dt.date]).count().plot(y='id', rot=45, kind='bar', label='Posts')
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()

Mean of comments on a specific subreddit

def mean_comments_per_date(df_p, title, xlabel, ylabel):
df_p.groupby([df_p.datetime.dt.date]).mean().plot(y='num_comments', rot=45, kind='line', label='Comments')
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()

Most active users on a specific subreddit

def most_active_author(df_p, title, xlabel, ylabel, limit):
df_p.groupby([df_p.author]).count()['id'].nlargest(limit).sort_values(ascending=True).plot(y='id', rot=45, kind='barh', label='Users')
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()

Origin of crosspostings

def get_posts_orign(df_p, title, xlabel, ylabel, limit, subreddit):
domains = df_p[(df_p.domain != 'reddit.com') & (df_p.domain != f'self.{subreddit}') & (df_p.domain != 'i.redd.it')]
domains.groupby(by='domain').count()['id'].nlargest(limit).sort_values(ascending=True).plot(kind='barh', rot=45, x='domain', label='# of posts', legend=True, figsize=(8,13))
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()

Most active subreddit according to search term

def get_subreddits(df_p, title, xlabel, ylabel, limit):
df_p.groupby(by='subreddit').count()['id'].nlargest(limit).sort_values(ascending=True).plot(kind='barh', x='subreddit', label='Subreddit', legend=True, figsize=(8,13))
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()

Well, we are almost at the end of this brief tutorial. The only thing left to show you is the full script in case there are still some questions.

Full Script

"""
BEGIN - Script preparation
Section for importing libraries and setting up basic environment configurations
"""
from psaw import PushshiftAPI #Importing wrapper library for reddit(Pushshift)
import datetime as dt #Importing library for date management
import pandas as pd #Importing library for data manipulation in python
import matplotlib.pyplot as plt #Importing library for creating interactive visualizations in Python
from pprint import pprint #Importing for displaying lists in the "pretty" way (Not required)

pd.set_option("display.max_columns", None) #Configuration for pandas to show all columns on dataframe
api = PushshiftAPI() #We create an object of the API
"""
END - Script preparation
"""



"""
BEGIN - DATAFRAME GENERATION FUNCTIONS

Here we are going to make a request to through the API
to the selected subreddit and the results are going
to be placed inside a pandas dataframe
"""

"""FOR POSTS"""
def data_prep_posts(subreddit, start_time, end_time, filters, limit):
if(len(filters) == 0):
filters = ['id', 'author', 'created_utc',
'domain', 'url',
'title', 'num_comments'] #We set by default some columns that will be useful for data analysis

posts = list(api.search_submissions(
subreddit=subreddit, #We set the subreddit we want to audit
after=start_time, #Start date
before=end_time, #End date
filter=filters, #Column names we want to get from reddit
limit=limit)) #Max number of posts we wanto to recieve

return pd.DataFrame(posts) #Return dataframe for analysis


"""FOR COMMENTS"""
def data_prep_comments(term, start_time, end_time, filters, limit):
if (len(filters) == 0):
filters = ['id', 'author', 'created_utc',
'body', 'permalink', 'subreddit'] #We set by default some columns that will be useful for data analysis

comments = list(api.search_comments(
q=term, #We set the subreddit we want to audit
after=start_time, #Start date
before=end_time, #End date
filter=filters, #Column names we want to get from reddit
limit=limit)) #Max number of comments we wanto to recieve
return pd.DataFrame(comments) #Return dataframe for analysis

"""
END - DATAFRAME GENERATION FUNCTIONS
"""



"""
BEGIN - FUNCTIONS
"""
###Function to plot the number of posts per day on the specified subreddit
def count_posts_per_date(df_p, title, xlabel, ylabel):
df_p.groupby([df_p.datetime.dt.date]).count().plot(y='id', rot=45, kind='bar', label='Posts')
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()

###Function to plot the mean of comments per day on the specified subreddit
def mean_comments_per_date(df_p, title, xlabel, ylabel):
df_p.groupby([df_p.datetime.dt.date]).mean().plot(y='num_comments', rot=45, kind='line', label='Comments')
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()

###Function to plot the most active users on the subreddit
def most_active_author(df_p, title, xlabel, ylabel, limit):
df_p.groupby([df_p.author]).count()['id'].nlargest(limit).sort_values(ascending=True).plot(y='id', rot=45, kind='barh', label='Users')
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()

###Function to que the orgin form the crossposting
def get_posts_orign(df_p, title, xlabel, ylabel, limit, subreddit):
domains = df_p[(df_p.domain != 'reddit.com') & (df_p.domain != f'self.{subreddit}') & (df_p.domain != 'i.redd.it')]
domains.groupby(by='domain').count()['id'].nlargest(limit).sort_values(ascending=True).plot(kind='barh', rot=45, x='domain', label='# of posts', legend=True, figsize=(8,13))
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()

###Gets most active subrredits according to search term
def get_subreddits(df_p, title, xlabel, ylabel, limit):
df_p.groupby(by='subreddit').count()['id'].nlargest(limit).sort_values(ascending=True).plot(kind='barh', x='subreddit', label='Subreddit', legend=True, figsize=(8,13))
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()


"""
END - FUNCTIONS
"""


def main():
subreddit = "darkestdungeon" #Name of the subreddit we are auditing
start_time = int(dt.datetime(2021, 1, 1).timestamp()) #We define the starting date for our search
end_time = int(dt.datetime(2021, 1, 31).timestamp()) #We define the ending date for our search
filters = [] #We don´t want specific filters
limit = 1000 #Number of elelemts we want to recieve

"""Here we are going to get subreddits for a brief analysis"""
df_p = data_prep_posts(subreddit,start_time,
end_time,filters,limit) #Call function for dataframe creation of comments

df_p['datetime'] = df_p['created_utc'].map(
lambda t: dt.datetime.fromtimestamp(t))
df_p = df_p.drop('created_utc', axis=1) #Drop the column on timestamp format
df_p = df_p.sort_values(by='datetime') #Sort the Row by datetime
df_p["datetime"] = pd.to_datetime(df_p["datetime"]) #Convert timestamp format to datetime for data analysis


df_p.to_csv(f'dataset_{subreddit}_posts.csv', sep=',', # Save the dataset on a csv file for future analysis
header=True, index=False, columns=[
'id', 'author', 'datetime', 'domain',
'url', 'title', 'num_comments'
])

count_posts_per_date(df_p, 'Post per day', 'Days', #Function to plot the number of posts per day on the specified subreddit
'posts')
mean_comments_per_date(df_p, #Function to plot the mean of comments per day on the specified subreddit
'Average comments per day',
'Days', 'comments')
most_active_author(df_p, 'Most active users', #Function to plot the most active users on the subreddit
'Posts', 'Users', 10)
get_posts_orign(df_p, 'Origin of crosspostings', #Function to que the orgin form the crossposting
'Crossposts', 'Origins', 10,
subreddit)

"""Here we are going to get comments for a brief analysis"""
term = 'bitcoin' #Term we want to search for
limit = 10 #Number of elelemts we want to recieve
df_c = data_prep_comments(term, start_time, #Call function for dataframe creation of comments
end_time, filters, limit)

get_subreddits(df_c, 'Most active subreddit', 'Posts', #Gets most active subrredits according to search term
'Subreddits', 10)




if __name__== "__main__" : main()

Also, you can download it from https://github.com/Silvertongue26/reddit_api/blob/main/reddit6.py

--

--