Analyzing the Structure of the chelseafc Subreddit Network

Casey Tabatabai
INST414: Data Science Techniques
3 min readMay 10, 2022

Reddit is a social media platform that features a network of communities where users can engage in discussions within the millions of subreddit communities available. Subreddits are the communities in Reddit where users speak to each other about a specific topic. The subreddit that I will be focusing on in this assignment is chelseafc subreddit. Chelsea Football Club is one of the best soccer teams in the world and is from London, United Kingdom. I have been a Chelsea fan for as long as I can remember, and I have also loosely followed the chelseafc subreddit. Within this assignment I want to learn more about the network that is the chelseafc subreddit, and which users I should follow if I want to see what the most engaged users are currently discussing.

The source of my network data is the chelseafc subreddit data gained through the use of Reddit’s API. To make this happen I had to create a Reddit application on Reddit’s app page in order to receive the unique identifiers needed to acquire the data. To create my network I used the NetworkX Python package to iterate through the posts and comments of the chelseafc subreddit while adding nodes and edges to my network. The nodes in my network represent a user that made one of the top twenty comments on one of the top ten posts of all time in the chelseafc subreddit, or made one of the top ten replies to one of those twenty comments. The edges in my network represent an interaction between two users through a reply to a comment.

The importance of a node in a graph can be determined with a few different measures depending on what you would like to specifically look at. The three measures I will use in this assignment to test node importance are degree of centrality, closeness centrality and betweenness centrality. Degree centrality measures importance with the amount of edges a node has. Closeness centrality measures importance with the average shortest distance of a node to every other node. Betweenness centrality measures importance by looking at which nodes frequently lie on the shortest path between other nodes. After looking at the results above, the user named “Zarly88” is clearly the most important node in the chelseafc subreddit network with the highest degree centrality, closeness centrality, and betweenness centrality. Additionally, the users: “MoDollazz”, “Vicar13”, and “Deuce_GM” are all important nodes as they have top ten measures in closeness centrality and betweenness centrality.

One issue that I struggled with throughout this assignment was a “ValueError: None cannot be a node” error that I kept receiving when trying to recursively iterate through each post and comment. The error specifically occurred when a new node was being added for a comment’s author which confused me as I believed that each comment should have a username attached to it. However, after doing a bit more research on Reddit I realized that users can delete their accounts without their comments disappearing, which likely explains why this happened. I was able to solve this issue with an if statement ensuring that if a node has a value of None, then it will continue back to the top of the loop.

One limitation of my data and analysis is the amount of posts and comments I used to create my graph. Overall, I had 176 nodes and 241 edges in my network, but the chelseafc subreddit has much more content that I could have used for the graph. It would have been interesting to get a broader look into which user is the most important node in the network overall rather than just looking at the top ten posts ever in the subreddit.

My main takeaway from this assignment is the knowledge I gained about different centrality measures. I have had a brief introduction to the various centrality measures in the past, but I believe this assignment really allowed me to get to understand them in a bit more depth.

--

--