Discovering Reddit communities with common moderators

AS
Social Media: Theories, Ethics, and Analytics
4 min readOct 14, 2020

Discovering new subreddits is beneficial to any research on a given topic. To find communities on a specific topic, we could scrape the entire Reddit and look for keywords in descriptions or top comments. As of 2018, Reddit had 138000 active communities. Such a large number would make the text mining approach impractical at best. Perhaps a better way is to explore communities related to the ones we already know are within our interest. This idea will serve as a motivation for the following questions.

Question 1: How can we discover communities related to the topic of interest?

Question 2: How can we identify significant communities on a given topic?

The rest of this post will describe an attempt at using that approach.

Data Collection

Attempt 1

To accomplish my research goal, I decided to utilize the “related” subreddits that each community recommends. Related subreddits are stored under a widget object, so iterating thru those was a natural place to start. This approach fell apart fairly quickly.

Not all subreddit use the “related” objects in their sidebar. Related subreddits are sometimes stored in a text box under other widgets. My first thought was to account for that in the scraping script, but since I could not foresee all the places that the data of interest was stored, I would have no way to verify that my data set is complete. So it was back to the drawing board.

Attempt 2

This time I decided to use sub moderators to explore the network. I picked r/gamedev as a starting point for the search. I wrote a recursive function for scraping moderators from the subreddits and adding all the subreddits each user moderates. Initially, I decided to run go five layers deep (not counting the initial node). After an hour of scraping, the Juypter notebook crashed. A quick look suggests that the script didn’t even get thru scraping the third layer. With that in mind, I stopped at a second layer. I worked from the assumption that a smaller network will not hinder the validity of the approach.

Cleaning the Data

Initially, I did not expect to clean the data. I made sure that only unique nodes were added to the results, in part, to minimize the number of calls to the Reddit API. I constructed the graph and exported it as graphml. When I imported the graph to Gephi, the result looked immediately wrong.

I used the Gephi degree analysis to understand better what exactly the problem was. The top results had an abnormally high degree. The “shittyaskscience” subreddit showed a degree of 12392, which is evidently incorrect. I could not find the cause of this abnormality and decided to remove it from the data set as an outlier. I restarted the scraping algorithm, excluding the “QuantumInformation” and “shittyaskscience,“ subreddits. These nodes still appear in the graph, but only as endpoints. The final set includes 139 nodes, connected by 617 edges, organizes into an undirected graph.

Results

The results provided some interesting insights. The overall density of the graph was lower than expected. The diameter is an understandable result with a search limited to two layers deep from the root. The high number of bridge nodes has confirmed that this approach can be used to discover communities and clusters of communities that are not immediately obvious.

However, the range of topics that individual redditors moderate is wider than expected. As a result, relatively few nodes are related to the original topic. Some clusters constrain only completely unrelated nodes.

The node degree and centrality analysis suggests that the r/gamedev community was not the most significant despite being the root node. However, out of ten nodes with the highest degree, at least six are related to the topic of interest. Similarly, out of the nodes with the highest closeness centrality, 7 out of 10 are connected to game development.

The results are less clear when looking at the betweenness centrality. The overall highest result is a bridge to a cluster that contains no communities related to game development.

Final Thoughts
The approach of using common moderators to discover new subreddits has proven itself successful. Determining how closely the new subreddits relate to the topic of interest is out of scope for this exploratory approach. The qualitative analysis of results is possible in a smaller network but would be impractical in any large dataset.

--

--