Reddit network analysis

Meehir Bhalla
INST414: Data Science Techniques
4 min readApr 26, 2022

A non-obvious insight I wanted to extract from my Reddit data was which users in the r/buildapc and r/MechanicalKeyboards have replied to one another via the comment section for a given post. I chose to use both r/buildapc and r/MechanicalKeyboards since they consist of similar posts and users due to the similar topics discussed within them. Moreover, utilizing both of these subreddits would yield more users to investigate and result in a larger network. Visualizing these relationships via a network can help better define the activity between users and their interactions with one another. A network can help identify users of high betweenness centrality to indicate critical nodes and also separate the network into clusters, indicating groups of users. This information can become useful when determining clusters of users or visualizing how and who connects certain clusters. This could inform insights regarding which users link the most clusters of users together and how these users interact with those users to connect them to another user population.

I leveraged NetworkX and Gephi to create my networks, with the NetworkX library in Python doing the bulk of the work. The source of my network data was obtained via the Reddit API and the praw library in Python. I accessed the Reddit API using the praw library and created a Reddit application using my client and secret id. I then concatenated the r/buildapc and r/MechanicalKeyboards subreddits to access a larger population of users with similar content. Obtaining the necessary information from Reddit was simple and involved iterating through ‘hot’ posts in both the r/buildapc and r/MechanicalKeyboards subreddits to gain relevant and highly upvoted posts that contained comments. I then created nodes that represented comment authors and edges that represented the relationship between the comment and parent author. Using this data, I created an undirected network using the nx.Graph() and nx.draw() methods by populating a graph with my nodes and edges. Further, I used Gephi to make my network more understandable and easier to view. This consisted of editing repulsion strength to distinguish amongst different clusters of users, distinguishing nodes of high betweenness centrality via size, color-coding communities using modularity class, etc.

nx.draw() - r/buildapc and r/MechanicalKeyboards subreddits.

You can already start to point out nodes that link these various networks together and have a high degree, but visualizing these aspects would be easier. The structure of the Gephi network below consists of larger nodes that represents nodes of high betweenness centrality. These nodes represent important ‘bridges’ in the network. In this particular case, these bridges are users who are the common connection between two groups of users. When viewing the Gephi network, users like Nuclear_Niijima and AKC6 are important nodes since they bridge clusters together and are characterized by their larger size — meaning a high betweenness centrality.

Gephi network.

An error I ran into while trying to draw my network graph using nx.draw(). This error was that my random_state_index was incorrect. This error was due to my decorator not being updated to the latest version. After installing the latest decorator nx.draw() was able to display my graph. This error also benefitted my knowledge of the NetworkX library since I had to read up on the nx.draw() documentation, making me aware of more functions and quirks of the method. Moreover, some limitations of my data are that my network consists of a lot of clusters and therefore makes interpreting any one cluster more difficult since they are restricted in size and depth. Also, my analysis focused on finding nodes of high betweenness centrality therefore only investigating one set of important nodes.

The main takeaway(s) from my analysis is that the two subreddits consist of a lot of connected individuals, all of which make up a handful of clusters. When investigating users of high betweenness centrality that bridge clusters of users together, the importance of these nodes became quite apparent. The removal of these users would destroy clusters and harm the structural integrity of the network, which was interesting to see. Moreover, the study of networks in a social application like Reddit shows how dependent user relationships are, and how they allow non-directly connected users to come into contact.

--

--

INST414: Data Science Techniques
INST414: Data Science Techniques

Published in INST414: Data Science Techniques

This publication includes student-authored articles from INST414, focusing on data science techniques in a wide range of areas

Meehir Bhalla
Meehir Bhalla

Written by Meehir Bhalla

Undergrad @ UMD — College Park ~ Passionate about big data, data science, data visualization, and machine learning. I enjoy cooking, running, and music! 🌀