A Network Analysis of “Impeach”-Related Reddit Posts

Published in

Web Mining [IS688, Spring 2021]

12 min readFeb 24, 2021

The impeachment trial may be over, but what has Reddit had to say about it?

If you’re an American who’s been paying attention to political news lately, you probably heard about Trump being acquitted at his second impeachment trial following the former president’s incitement of insurrection. This, as many of us know, led to the capitol riots, and Democrats in Congress attempted to convict Trump for his actions. The Senate voted 57 for guilty and 43 for not guilty. All Democrat and both Independent senators voted guilty along with 7 Republican senators voted guilty. A conviction would require a 2/3 majority of the Senate (67 votes). This is the second time Trump has been acquitted in an impeachment trial. More information about Trump’s impeachment trial can be read here.

Source: https://www.flickr.com/photos/37527185@N05/50812356151/

Source: https://www.latimes.com/politics/story/2021-01-14/trump-impeachment-trial-could-begin-on-inauguration-day

To no surprise, some people have been talking about how this is Trump’s second impeachment trial, how he’s the first president in US history to be impeached twice (and acquitted twice), how he should or should not be convicted, and so on. Such an event can cause Americans much distress. Some people might not even care at all and choose to ignore what’s happened. But for those who have paid attention to this trial, people have discussed online what they think and/or speculate about what this means for the country, for Trump, and the future of America.

In this article, I will discuss how I investigated posts related to Trump’s impeachment on Reddit. The impeachment trial may be over, but what has Reddit had to say about it? Reddit is a social media platform where users can post and comment online according to different subreddits (communities based on topics, i.e. r/politics or r/news). Users typically go by a pseudonym and are generally free to discuss whatever they want on subreddits.

In my network analysis, I looked into the kinds of posts that have been submitted with the keyword “impeach” on all of Reddit to see what communities people have posted on and if there were any particular users or posts in general. By doing this, we can potentially gain some insight into what kinds of things people have had to say about the impeachment trial and where the general public stands on this issue. Has there even been significant discussion revolving around this event? Do people care about it at all?

The Process

The following is an explanation of how I eventually get to the results (discussed below). I used Python in a Jupyter Notebook and Gephi (more on that at step 4) for this analysis.

1. Get Reddit API Access: To obtain access to the Reddit API, an account and approval of the API are required. More information on using the Reddit API is here. Once one has the API key, they can access data on what users post and comment in any community. All discussion on Reddit is public and is therefore accessible with the API.

1a. Import Libraries for Usage: Before starting to extract data, I imported the following libraries I thought about using here: praw (helpful for the Reddit API), pandas, json, matplotlib, and datetime. Despite importing these 5 libraries, the most important ones were praw and pandas.

2. Pull Data from Reddit: Since I’m looking for posts with the keyword “impeach” on all of Reddit, I’m able to search for it as such. Example code outlining what that looks like is below. What I would normally do is search for posts within a specific subreddit, but it’s possible to use “r/all” to search for on all of Reddit. For this network analysis, I decided to only use the keyword “impeach” to obtain as many results as possible as opposed to including multiple keywords to search for. I wanted a fairly broad range of posts as well, hence the 1000 post limit.

#Access the Reddit API
redditApi = praw.Reddit(client_id='insert_client_id',
client_secret='insert_client_secret',
user_agent='insert_app_name')#Define 
all_subreddits = redditApi.subreddit(“all”)#Search for up to 1000 posts with the keyword "impeach"
for post in all_subreddits.search(“impeach”,limit=1000):
    print(post.title,post.author,post.subreddit)

3. Shape Data into Pandas Dataframe and Export CSV: What would be helpful is to be able to have this data in a dataframe for easier viewing, manipulation, and analysis. With the results from Reddit, using a for loop, I appended them into a list. In that function, I also convert the time the post was created from Unix into a more readable date (year-month-date hour-minute-seconds). The information I wanted to extract from the posts include the title of the post, author of the post, subreddit post originated from, the score of the post, upvote ratio of the post (% of upvotes on that post), number of comments, time post was created, number of awards the post received, and ID of the post. The ID of a post allows for easier web access to the post. For example, if an ID of a post is ‘abcdef’, then you can find that post on the web by typing in this: reddit.com/abcdef. For this dataframe, I ended up only obtaining 248 relevant posts with “impeach” in the post title. Additionally, I filtered out only the authors and subreddits of the dataframe to see if there were any significant users on Reddit that discussed the impeachment trial and what subreddits they posted on. I can then also see if those users posted on multiple subreddits. Once that was done, I exported that filtered dataframe into a CSV.

#Search for Posts
submissions = all_subreddits.search("impeach",limit=1000)#Create Empty List
posts = []#Convert post.created Time to Readable Date
#Append Results into List
#Info includes title, author, subreddit, score, upvote ratio, num of comments, timestamp, awards received, id
for post in submissions:
    timestamp = datetime.datetime.fromtimestamp(post.created)
    real_time=timestamp.strftime('%Y-%m-%d %H:%M:%S')
    posts.append([post.title, post.author, post.subreddit,
                  post.score, post.upvote_ratio, post.num_comments, 
                  real_time, post.total_awards_received, post.id])#Take list with results and transform into a dataframe with column names
df = pandas.DataFrame(posts,columns=['title', 'author', 'subreddit', 'score', 'upvote ratio','num comments', 'created', 'awards', 'id'])#Filter Authors and Subreddits
au_and_subs=df[["author","subreddit"]]#Export dataframe to a CSV
au_and_subs.to_csv('name_of_file.csv')

4. Import CSV to Gephi: With the data I need in the CSV, I finally move on to the Network Analysis. I use the software program, Gephi, to shape this network and analyze it. I use Gephi not only because it is something I’m familiar with, but also because it offers several layout algorithms for networks and is capable of running some network statistics. Gephi also doesn’t have an undo feature (that I’m aware of at least), so I tend to run several trials of layout algorithms until I find one that I’m happy with. Different layout algorithms have different parameters and scales for how the network will look. Figure 1 displays what the network looks like upon importing it into Gephi. This is a directed network with 236 nodes and 198 edges.

Figure 1: What the network initially looks like before running any layout algorithms.

5. Run Layout Algorithms on the Network: This next step required some trial and error for me. Since I’m not entirely sure what I want the network to look like until I do, I run multiple layout algorithms and sometimes test it out in multiple workspaces. This is because I’m not sure what kind of clusters and connections will develop initially. However, I eventually ended up using ForceAtlas and Contraction. Figure 2 looks messy and too clumped but this will be fixed.

Figure 2: Network after running ForceAtlas and Contraction layout algorithms.

6. Run Network Statistics: To gain some further insight on the network, Gephi can run network statistics and give an overview of what it is we’re looking at. The most significant statistics I found were the following:

Average Degree = 0.839
Average Weighted Degree = 1.047
Modularity = 0.554
Connected Components = 38

The average degree refers to the average number of edges a node has. That metric is fairly low, meaning that users haven’t posted on many subreddits on this topic. The weighted degree metric is similar but includes the weight of each edge. In this case, a node that’s weighted more is a user that’s posted more to a certain subreddit. Modularity refers to how well the network can be separated into its own clusters. Higher modularity scores refer to networks that have “dense connections between nodes within [clusters] but sparse connections between nodes in different [clusters]” [1]. We can see that the modularity score is in the middle, so there is some instance of connections between nodes. Finally, the number of connected components refers to how many sets of nodes are connected. In this case, there are 38 which is a handful but means that not many of the users are connected through their posts/through subreddits. Figure 3 will reflect these network statistics.

7. Continue Shaping The Network Appearance: Figure 3 shows what the network looks like after working on its appearance more. This is done by a series of things. For one, I add the labels of each node to the graph. Additionally in Gephi’s Appearance tab, I adjust how the nodes and edges look.

For nodes, I rank nodes’ sizes by in-degree (how many nodes point to another). Bigger nodes refer to the subreddits users posted to. I rank nodes’ modularity as well by color: red to yellow to blue. The more red clusters are denser and the more blue clusters are less dense. Everything else falls somewhere in the middle.

For edges, I rank by weight by color. Again, I use the red to yellow to blue color scheme. In this case, the more red arrows indicate a smaller edge weight and more blue arrows indicate larger edge weights.

In this network, I would argue the most important nodes and the most important cluster is the one that contains users posting to r/politics (aka the big cluster to the right in red). That is where most of the activity is.

Figure 3: The network in its entirety from afar.

Figure 4: Closer look at politics subreddit cluster

Figure 5a: Connections from politics to subreddit to news subreddit. | Figure 5b: Connections from politics to Illinois subreddit and Michigan subreddit.

Figure 6: Closer look at Conservative subreddit cluster.

Figure 7: Closer look at other connections in the network including the conspiracy subreddit.

8. Making Observations About the Network: As stated before, this is a directed network. What this means here is that nodes pointing with an arrow to another node refers to one user posting on a certain subreddit. It’s clear in Figures 3 and 4 that r/politics is the subreddit that contains the majority of the posts containing “impeach.” It’s the densest cluster and has the most connections. A couple of users did post on multiple subreddits with the keyword, “impeach,” as shown in Figures 5a and 5b, as a few users posted both on r/politics and other subreddits.

There is some activity on other subreddits such as r/Conservative where a handful of users posted as shown in Figure 6. Finally, there are many other connected components where up to several users post on one or two subreddits.

We can also see that the node PoliticsModeratorBot is the user that posted the most in r/politics, as it can be seen by the blue edge pointing to the politics node. Just by the name, we can likely infer this is a bot on r/politics that generated posts. However, we may find that the posts by PoliticsModeratorBot had a lot of activity.

9. Looking Further Into The Reddit Posts: Based on my network observations, I wanted to look specifically into what PoliticsModeratorBot posted on r/politics as well as the top posts on r/Conservative and r/conspiracy. Going back into the dataframe that’s in my Jupyter notebook, I filtered out individual dataframes with these criteria. What I thought was important was determining how much engagement these posts were getting. Thus, when filtering this data, I sorted it by posts with the most comments. I decided to briefly look more into the most commented posts among these three sets of data.

Most Commented PoliticsModeratorBot Post

polmodbot_posts = posts.loc[posts['author'] == 'PoliticsModeratorBot']
polmodbot_posts.sort_values(by='num comments',ascending=False)

Top 5 Most Commented Posts with “impeach” Submitted by PoliticsModeratorBot on r/politics.

The full title of the most commented post is “Discussion Thread: House Morning Session — Debate and Votes on Article of Impeachment of Donald J. Trump — 01/13/2021 | Live — 9:00 AM ET,” indicating that people on the subreddit were discussing the trial live. There were a total of 64,431 comments and I also gathered that there were a total of 409 top-level comments (the majority of comments are replies to other comments). Top-level comments are comments that reply to the actual post. Among those 409 top-level comments, 326 of them come from distinct users. You can access the post here: reddit.com/kwfzfa.

Most Commented r/Conservative Post

conserve = posts.loc[posts['subreddit'] == 'Conservative']
conserve.sort_values(by='num comments',ascending=False)

Top 5 Most Commented Posts with “impeach” Submitted on r/Conservative.

The full title of the most commented post is “House impeaches Trump for second time over Capitol riots.” There were a total of 10,524 comments on this post. There were a total of 54 top-level comments. Among those 54 top-level comments, 50 of them come from distinct users. You can access the post here: reddit.com/kwqbyu.

Most Commented r/conspiracy Post

conspire = posts.loc[posts['subreddit'] == 'conspiracy']
conspire.sort_values(by='num comments',ascending=False)

Top 5 Most Commented Posts with “impeach” Submitted on r/conspiracy.

The full title of the most commented post is “If they can forge documents to impeach a president. Imagine what they can do to you.” There were a total of 595 comments and a total of 53 top-level comments. All 53 top-level comments came from distinct users. It’s clear there many people who speculate that the forgery of the articles of impeachment against Donald Trump is a conspiracy. You can access the post here: reddit.com/lkdnfo.

Limitations

Using the keyword “impeach” does account for posts that talk about the impeachment in the body text of the post but not the title as well as posts that use other forms of the word such as “impeachment” or “impeached.”
Not all posts with the word “impeach” are necessarily referring to Trump’s impeachment, but essentially all posts were created in 2021. All posts were likely about Trump’s impeachment, but it is not guaranteed.
Accessing comments to a post and mapping them (along with second or third-level comments) onto a network for its own analysis is possible, but was not feasible for the time being. Due to a large number of comments, parsing through all of them can take time to run on my local machine.
Despite the fact I ran multiple layout algorithms to eventually get my network, there may be a better way for it to be shaped.

Source: https://abcnews.go.com/Politics/trump-impeachment-trial-live-updates-senate-rejects-democrats/story?id=68410003

Conclusions

Overall, there was a decent amount of activity talking about “impeach” related things on Reddit, likely referring to Trump’s impeachment trial this year. Based on just looking at a few posts across three different subreddits (r/politics, r/Conservative, and r/conspiracy), there has been a lot of discussion surrounding the impeachment trial of Trump. Thousands of comments, hundreds of users, and what can be assumed to be limitless discussion surrounding the event. Initially, I would have thought that this kind of discussion primarily took place in r/politics. While that’s true, it’s also become clear with this network analysis that it can take place in many other subreddits too, even if it’s just one user making that post. We can conclude there has been significant discussion surrounding the impeachment trial, historic as it is, and that one way or another, people certainly have an opinion on it.

References

[1] https://parklize.blogspot.com/2014/12/gephi-clustering-layout-by-modularity.html