Similarities Between Subreddits

Published in

INST414: Data Science Techniques

4 min readDec 2, 2023

Insight

Reddit, a social media platform and web-based social network, is used by millions of users to ask questions, share opinions, initiate discussions, post content, sell products, and much more. Reddit has various forums called subreddits which are dedicated to specific topics and categories. Anyone can join, view, post, and comment under these subreddits. Using Reddit, I wanted to calculate how similar the subreddits were based on common words found across selected subreddits. Finding similarities between subreddits and establishing a web of subreddits can help researchers analyze similarities and differences between different communities. Researchers can analyze different things like language used, common opinions, demographics, etc. Furthermore, this insight and analysis can be used to recommend different subreddits and topics to Reddit users.

Data Collection

The network data that I used for this analysis came from Reddit’s API. I installed Reddit’s library for Python called Python Reddit API Wrapper (PRAW). After installing PRAW and importing my credentials, I used reddit.subreddits.popular to fetch the names of 15 popular subreddits and print them. Next, a for loop iterated through the 15 subreddits and used subreddit.new(limit=20) to fetch the 20 most recent posts and comments from each subreddit. The titles and comments from each subreddit were then combined into a single string and stored in the subreddit_texts dictionary, where the name of the subreddit is the key.

Nodes, Edges, and Importance

Each node in the plot represents one of the 15 popular subreddits that was fetched from the Reddit API. Each edge between two subreddit nodes represents the similarity between the subreddits based on common words. To be considered similar to each other, two subreddits had to share at least six common words. Using a for loop, the loop iterated through the subreddits, taking two subreddits at a time. Then, a set for each subreddit’s common words was created and a variable named common_count was created to determine the amount of common words shared by the two subreddits. If the amount of common words shared between the two subreddits was more than 5, an edge was added between the subreddits. The importance in this graph would be the amount of shared common words between subreddits. After sorting nodes based on their degrees, “Home,” “AskReddit,” and “BaldursGate3” appeared to be the three most important nodes.

Software Used

To facilitate my network analysis, I used NetworkX, a Python library used to study and analyze graphs and networks. I used the NetworkX library to add selected subreddits as nodes and to add edges between subreddits based on shared common words. Additionally, I used NetworkX to draw a plot representing nodes and the edges between them.

Data Cleaning and Bugs

Since post titles and comments were combined into a single string, some data cleaning had to be done. I used the split() to split the combined strings into a list. I used the lower() method to convert all letters to lowercase. This prevented uppercase letters from getting in the way of identifying and counting common words. Furthermore, I set the length of common words to be at least five in order to attempt to filter out some prepositions and articles. One of the possible bugs others may encounter is exceeding the rate limit. After testing and running my code a few times, I got an error stating that I had exceeded the rate limit. The normal solution to this error would be to wait the appropriate amount of time before sending requests again. However, I had to proceed with building and modifying my code, so I imported new credentials from a different account and sent requests again. Another issue someone could run into with this code is filtering out important common words that are shorter than six letters. To fix this, one would have to modify the code to filter common words by excluding prepositions and articles, rather than filtering based on the length of the words.

Plot displaying subreddit nodes and edges

Limitations and Biases

One of the limitations of my analysis is that it does not take into account important common words that may be shorter than six letters. Moreover, there may still be articles and prepositions that are longer than 5 letters being stored in the common words dictionaries. Another limitation of this analysis is that not all common words found may be considered important keywords. A lot of common words found across the subreddits are likely words commonly used in sentences to describe various things. Not all common words found may be specific to a category or subreddit. This could make the code biased and not pick up keywords useful for analysis.

You can find the code for this analysis here: https://github.com/adasti/INST414