Using Pushshift API for data analysis on Reddit

Published in

MCD-UNISON

8 min readSep 14, 2021

On this entry, we will learn how to mine, clean and analyze data from the social network Reddit, by using a python library named “Pushshift”.

Introduction

Let’s begin by defining what Reddit and Pushshift are…

Reddit: According to themselves, “It’s a network of communities where people can dive into their interests, hobbies and passions. There’s a community for whatever you’re interested in”.

Pushshift: Is a social media data collection, analysis, and archiving platform that has collected Reddit data and made it available to researchers. Pushshift’s Reddit dataset is updated in real-time, and includes historical data.

Now that we have defined our tools of the trade, we can begin with the “good stuff”.

In the last year, the Covid-19 epidemic changed our life in every aspect, we started using more and more technology in our daily routines, people increased their usage of social networks creating a huge amount of data ready por the picking.

In this publication, we will learn how to gather information from Reddit using python libraries and we will see how to collect data from a particular topic or subreddit, how to explore and visualize it by plotting the dataframes, that will leave a bedrock for further data analysis.

Setting up the environment

Reddit is organized in communities called subreddits. Each subreddit is filled of submissions posted by users. Each submission can be commented by other users and can be upvoted or downvote.

In order to analyze Reddit, we need to access all of its submissions, comments and users’ information. To do this, we’ll use an API called “pushshift”.

To setup our environment, first we need to install some tools. So open up a terminal and type the following commands:

pip install psaw                 #Installing pushshift library

Next, we need to import some tools. So, open up a python script and type the following commands:

from psaw import PushshiftAPI    #library Pushshift
import datetime as dt            #library for date management
import p                         #library for data manipulation
import matplotlib.pyplot as plt  #library for plotting

Once our tools are installed and imported, on the python script, we need to create an instance of the pushshift API.

api = PushshiftAPI()              #Object of the API

Now that we are all set up! We’ll explore two of the main methods of the API.

“search_submissions”: Which will return a list of posts from the selected subreddit.

“search_comments”: Which will return a list of comments associated with a search term.

This methods use some parameters in common to filter the search results:

“after”: Lower date limit for search

“before”: Upper date limit for search

“filter”: Column names we want to retrieve (suggested: ‘id’, ‘author’, ‘created_utc’, ‘domain’, ‘url’, ‘title’, ‘num_comments’)

“limit”: Number of rows to return

“subreddit”: The name of the subreddit to crawl (case sensitive)

“q”: Search term.

We have defined the parameters for the functions, so we will add the following code to our script.

"""FOR POSTS"""
def data_prep_posts(subreddit, start_time, end_time, filters, limit):
    if(len(filters) == 0):
        filters = ['id', 'author', 'created_utc',
                   'domain', 'url',
                   'title', 'num_comments']                 
                   #We set by default some useful columns

    posts = list(api.search_submissions(
        subreddit=subreddit,   #Subreddit we want to audit
        after=start_time,      #Start date
        before=end_time,       #End date
        filter=filters,        #Column names we want to retrieve
        limit=limit))          ##Max number of posts

    return pd.DataFrame(posts) #Return dataframe for analysis


"""FOR COMMENTS"""
def data_prep_comments(term, start_time, end_time, filters, limit):
    if (len(filters) == 0):
        filters = ['id', 'author', 'created_utc',
                   'body', 'permalink', 'subreddit']
                   #We set by default some usefull columns 

    comments = list(api.search_comments(
        q=term,                 #Subreddit we want to audit
        after=start_time,       #Start date
        before=end_time,        #End date
        filter=filters,         #Column names we want to retrieve
        limit=limit))           #Max number of comments
    return pd.DataFrame(comments) #Return dataframe for analysis

Let’s create our main function, where we will define our parameters and do some data cleaning before we can start visualizing the data on plots.

def main():
    subreddit = "Darkestdungeon"     #Subreddit we are auditing
    start_time = int(dt.datetime(2021, 1, 1).timestamp())  
                                     #Starting date for our search
    end_time = int(dt.datetime(2021, 1, 31).timestamp())   
                                     #Ending date for our search
    filters = []                     #We don´t want specific filters
    limit = 1000                     #Elelemts we want to recieve

    """Here we are going to get subreddits for a brief analysis"""    #Call function for dataframe creation of comments
    df_p = data_prep_posts(subreddit,start_time,
                         end_time,filters,limit) 

    #Drop the column on timestamp format
    df_p['datetime'] = df_p['created_utc'].map(
        lambda t: dt.datetime.fromtimestamp(t))
    df_p = df_p.drop('created_utc', axis=1) 
    #Sort the Row by datetime               
    df_p = df_p.sort_values(by='datetime')  
    #Convert timestamp format to datetime for data analysis               
    df_p["datetime"] = pd.to_datetime(df_p["datetime"])"""Here we are going to get comments for a brief analysis"""
    term = 'bitcoin'            #Term we want to search for
    limit = 10                  #Number of elelemts 
    df_c = data_prep_comments(term, start_time,
                     end_time, filters, limit)
                                #Call function for dataframe creation of comments

Now that we have created our dataframe df_p and df_c and converted the timestamp to a datetime format, we can start creating functions for different purposes and add them to our main function, and run them. I’ll add some screenshots of the resulting graphs.

Posts per day on a specific subreddit

def count_posts_per_date(df_p, title, xlabel, ylabel):
    df_p.groupby([df_p.datetime.dt.date]).count().plot(y='id', rot=45, kind='bar', label='Posts')
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.show()

Mean of comments on a specific subreddit

def mean_comments_per_date(df_p, title, xlabel, ylabel):
       df_p.groupby([df_p.datetime.dt.date]).mean().plot(y='num_comments', rot=45, kind='line', label='Comments')
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.show()

Most active users on a specific subreddit

def most_active_author(df_p, title, xlabel, ylabel, limit):
    df_p.groupby([df_p.author]).count()['id'].nlargest(limit).sort_values(ascending=True).plot(y='id', rot=45, kind='barh', label='Users')
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.show()

Origin of crosspostings

def get_posts_orign(df_p, title, xlabel, ylabel, limit, subreddit):
    domains = df_p[(df_p.domain != 'reddit.com') & (df_p.domain != f'self.{subreddit}') & (df_p.domain != 'i.redd.it')]
    domains.groupby(by='domain').count()['id'].nlargest(limit).sort_values(ascending=True).plot(kind='barh', rot=45, x='domain', label='# of posts', legend=True, figsize=(8,13))
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.show()

Most active subreddit according to search term

def get_subreddits(df_p, title, xlabel, ylabel, limit):
    df_p.groupby(by='subreddit').count()['id'].nlargest(limit).sort_values(ascending=True).plot(kind='barh', x='subreddit', label='Subreddit', legend=True, figsize=(8,13))
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.show()

Well, we are almost at the end of this brief tutorial. The only thing left to show you is the full script in case there are still some questions.

Full Script

"""
BEGIN - Script preparation
Section for importing libraries and setting up basic environment configurations
"""
from psaw import PushshiftAPI                               #Importing wrapper library for reddit(Pushshift)
import datetime as dt                                       #Importing library for date management
import pandas as pd                                         #Importing library for data manipulation in python
import matplotlib.pyplot as plt                             #Importing library for creating interactive visualizations in Python
from pprint import pprint                                   #Importing for displaying lists in the "pretty" way (Not required)

pd.set_option("display.max_columns", None)                  #Configuration for pandas to show all columns on dataframe
api = PushshiftAPI()                                        #We create an object of the API
"""
END - Script preparation 
"""



"""
BEGIN - DATAFRAME GENERATION FUNCTIONS

Here we are going to make a request to through the API
to the selected subreddit and the results are going 
to be placed inside a pandas dataframe
"""

"""FOR POSTS"""
def data_prep_posts(subreddit, start_time, end_time, filters, limit):
    if(len(filters) == 0):
        filters = ['id', 'author', 'created_utc',
                   'domain', 'url',
                   'title', 'num_comments']                 #We set by default some columns that will be useful for data analysis

    posts = list(api.search_submissions(
        subreddit=subreddit,                                #We set the subreddit we want to audit
        after=start_time,                                   #Start date
        before=end_time,                                    #End date
        filter=filters,                                     #Column names we want to get from reddit
        limit=limit))                                       #Max number of posts we wanto to recieve

    return pd.DataFrame(posts)                              #Return dataframe for analysis


"""FOR COMMENTS"""
def data_prep_comments(term, start_time, end_time, filters, limit):
    if (len(filters) == 0):
        filters = ['id', 'author', 'created_utc',
                   'body', 'permalink', 'subreddit']        #We set by default some columns that will be useful for data analysis

    comments = list(api.search_comments(
        q=term,                                             #We set the subreddit we want to audit
        after=start_time,                                   #Start date
        before=end_time,                                    #End date
        filter=filters,                                     #Column names we want to get from reddit
        limit=limit))                                       #Max number of comments we wanto to recieve
    return pd.DataFrame(comments)                           #Return dataframe for analysis

"""
END - DATAFRAME GENERATION FUNCTIONS
"""



"""
BEGIN - FUNCTIONS
"""
###Function to plot the number of posts per day on the specified subreddit
def count_posts_per_date(df_p, title, xlabel, ylabel):
    df_p.groupby([df_p.datetime.dt.date]).count().plot(y='id', rot=45, kind='bar', label='Posts')
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.show()

###Function to plot the mean of comments per day on the specified subreddit
def mean_comments_per_date(df_p, title, xlabel, ylabel):
    df_p.groupby([df_p.datetime.dt.date]).mean().plot(y='num_comments', rot=45, kind='line', label='Comments')
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.show()

###Function to plot the most active users on the subreddit
def most_active_author(df_p, title, xlabel, ylabel, limit):
    df_p.groupby([df_p.author]).count()['id'].nlargest(limit).sort_values(ascending=True).plot(y='id', rot=45, kind='barh', label='Users')
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.show()

###Function to que the orgin form the crossposting
def get_posts_orign(df_p, title, xlabel, ylabel, limit, subreddit):
    domains = df_p[(df_p.domain != 'reddit.com') & (df_p.domain != f'self.{subreddit}') & (df_p.domain != 'i.redd.it')]
    domains.groupby(by='domain').count()['id'].nlargest(limit).sort_values(ascending=True).plot(kind='barh', rot=45, x='domain', label='# of posts', legend=True, figsize=(8,13))
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.show()

###Gets most active subrredits according to search term
def get_subreddits(df_p, title, xlabel, ylabel, limit):
    df_p.groupby(by='subreddit').count()['id'].nlargest(limit).sort_values(ascending=True).plot(kind='barh', x='subreddit', label='Subreddit', legend=True, figsize=(8,13))
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.show()


"""
END - FUNCTIONS
"""


def main():
    subreddit = "darkestdungeon"                           #Name of the subreddit we are auditing
    start_time = int(dt.datetime(2021, 1, 1).timestamp())  #We define the starting date for our search
    end_time = int(dt.datetime(2021, 1, 31).timestamp())   #We define the ending date for our search
    filters = []                                           #We don´t want specific filters
    limit = 1000                                           #Number of elelemts we want to recieve

    """Here we are going to get subreddits for a brief analysis"""
    df_p = data_prep_posts(subreddit,start_time,
                         end_time,filters,limit)           #Call function for dataframe creation of comments

    df_p['datetime'] = df_p['created_utc'].map(
        lambda t: dt.datetime.fromtimestamp(t))
    df_p = df_p.drop('created_utc', axis=1)                #Drop the column on timestamp format
    df_p = df_p.sort_values(by='datetime')                 #Sort the Row by datetime
    df_p["datetime"] = pd.to_datetime(df_p["datetime"])    #Convert timestamp format to datetime for data analysis


    df_p.to_csv(f'dataset_{subreddit}_posts.csv', sep=',', # Save the dataset on a csv file for future analysis
                header=True, index=False, columns=[
            'id', 'author', 'datetime', 'domain',
            'url', 'title', 'num_comments'
        ])

    count_posts_per_date(df_p, 'Post per day', 'Days',     #Function to plot the number of posts per day on the specified subreddit
                         'posts')
    mean_comments_per_date(df_p,                           #Function to plot the mean of comments per day on the specified subreddit
                           'Average comments per day',
                           'Days', 'comments')
    most_active_author(df_p, 'Most active users',          #Function to plot the most active users on the subreddit
                       'Posts', 'Users', 10)
    get_posts_orign(df_p, 'Origin of crosspostings',       #Function to que the orgin form the crossposting
                    'Crossposts', 'Origins', 10,
                    subreddit)

    """Here we are going to get comments for a brief analysis"""
    term = 'bitcoin'                                        #Term we want to search for
    limit = 10                                              #Number of elelemts we want to recieve
    df_c = data_prep_comments(term, start_time,             #Call function for dataframe creation of comments
                         end_time, filters, limit)

    get_subreddits(df_c, 'Most active subreddit', 'Posts',  #Gets most active subrredits according to search term
                   'Subreddits', 10)




if __name__== "__main__" : main()

Also, you can download it from https://github.com/Silvertongue26/reddit_api/blob/main/reddit6.py

Using Pushshift API for data analysis on Reddit

Introduction

Setting up the environment

Written by Hector Rodriguez Dominguez