How Well-Connected are UMD Students Online?

mpothen
INST414: Data Science Techniques
4 min readMay 9, 2022

An Analysis of the UMD subreddit network

Students use Reddit to get a “real” inside look at the student experience of their respective colleges; however, it is difficult to decipher whether certain posts are trustworthy. I conducted an analysis on the connections that have formed on the UMD subreddit. My goal is to analyze how well-connected the UMD subreddit is and identify the most active users in the forum. These insights could help in-coming UMD students understand which users may be more trustworthy when searching through the UMD subreddit.

I started by mapping out the relationship between users and the commenters on their posts. The source of my network data is the UMD subreddit page. I used the praw library to scrape this page and create a dictionary for every post in which the keys were the author of the post and the values were the list of users who commented on this post. I then created a list of tuples (author, commenter) so that the data was in the format that could display relationships. I used NetworkX to create the graph in Jupyter Notebook. Using this library, I added the edges to an instance of a graph by looping through the list of tuples. The nodes were each user (either an author or commentator). The edges represented a commenter relationship to another user. I also used the software Gephi for refined visualizations of the network.

for i in list_of_tuples: 
names.append(tuple(i))
G.add_edges_from(names)
nx.draw(G, with_labels = True)
plt.show()
Starting point: Original graph of UMD subreddit depicting network of users

The resulting graph was an undirected graph with 514 nodes and 539 edges (extracted data from 300 posts). Since my goal was to find the most well-connected/trustworthy user, I used degree centrality. I established that the important nodes were the nodes with the highest degree or number of edges. To narrow down the user, I filtered the network to nodes with 3 or more edges. A user named User_Establishment15 had the highest degree at 16 edges. Users Sarcastro16 and terpAlumnus also had a notably high number of connections. These users could be considered as the most well connected UMD Reddit users, and therefore are more likely to be given trustworthy information. This graph, which ranks degree by color, shows that the majority of users are not well connected. This may indicate that the majority of users are not active on the UMD subreddit.

Next I studied the communities in the network by running the modularity statistic. The result showed 26 communities, where users with the highest degrees tended to be the leaders within these communities. Users such as User_Establishment15 and sarcastro16 have the two largest communities. When I increased the modularity resolution to 5, the number of communities reduced to 17; User_Establishment15, had the majority of users, including sarcastro16 and terpAlumnus, in its cluster. This shows us that the most active users tend to stay in the same communities.

UMD Subreddit Communities: (Cluster by color & Degree by size)
Modularity Color Key

An issue I encountered was the strength of the networks. Initially, I had scraped 100 posts, however, the graph for this dataset consisted of many clusters of only two users. This indicated to me that I need to extract data from more posts to get a better idea of how these users are interconnected. Thus, I increased the number of posts to 300; yet the graph structure shows that the UMD subreddit community is still not very strongly connected. There are 2–3 users which are very active, but many users which have only interacted with only one other user.

My data was limited in that I do not have any information on the user’s age, major, or gender which would be useful to understand what types of users are most active. My data analysis is limited by the understanding of the content of user posts. To take this project to the next level, I would like to do a text analysis on the content from the various clusters in the network to identify the popular topics within each group.

Conclusion

The main takeaways are that the UMD subreddit community is not very well connected: majority of users have a low degree or low level or interaction with other users; the accounts User_Establisment15, sacrastro16, and terps_alumnus are the most active users in the network; and though there are many communities within the network, each community only holds a few users with a high level of activity. In-coming UMD students may not receive the most updated information from this weak network. At most, students can rely on the few users who were identified as the most active to gather information on the UMD experience.

See Jupyter Notebook for data collection: UMDRedditAnalysis

--

--