Data Science Tutorial

Building a Reddit Recommendation System

With Collaborative Filtering, Implicit Data Collection, K Nearest Neighbors, and Cosine Similarity

Doug Rizio
Nerd For Tech

--

Reddit — AKA, “The Front Page of the Internet.” Just over 15 years ago, Reddit began as a small and little-known website made by college students, featuring anonymous forums or “subreddits” for topics mostly related to science and programming. Now, Reddit is one of the largest social networking platforms in the world, with an estimated 430,000,000 active users and 100,000 active subreddits currently online as of early 2021.

Users on Reddit submit everything from pictures, animated GIF’s, and videos to text-based opinion posts and links to news stories. While some of the content on Reddit is original, much of it is shared or “reposted” from other users, and not only can users comment on posts in the subreddits that they follow, they can also “upvote” or “downvote” those comments and posts as a democratic way of raising or lowering the content’s popularity and thus visibility to other users.

Each subreddit is dedicated to a single particular subject, and although some of the most popular subreddits are relatively broad in scope such as “r/news” or “r/politics”, there are scores of others devoted to an unimaginable range of niche topics.

However, with a hundred thousand unique options to choose from, how might a Reddit user know which subreddit to subscribe to?

Fortunately, the Reddit platform already utilizes a recommendation system in its search function, and a whole world of subreddits opens up to you for even the simplest of searches. But what if it didn’t? What if Reddit was somehow developed a decade ahead of schedule in an alternate timeline, before the popularization of machine learning algorithms or even search engines? What if the only way for a user to find a subreddit that might interest them was to type the URL into the web browser directly, or to click on a link to the subreddit from another page that hosted it?

Sure, a user in this alternate universe who is interested in something like art might know that they should simply navigate to “reddit.com/r/Art.” Perhaps they would get creative, and get the idea to enter something like “reddit.com/r/ConceptArt” or “reddit.com/r/Drawing.” But would they think to type in “r/DrawForMe?” What about “r/ICanDrawThat?” Or maybe “r/IDrewAPicture?” “r/RandomActsOfDrawing?”

The more we explore Reddit, the more obscure and peculiarly-titled subreddits we find. Yet without a precise way of finding similar items to the ones we are interested in, we have no chance of visiting the vast range of communities that call Reddit home.

So, let’s solve this hypothetical problem, and learn how to build a Reddit recommendation system. That way, whatever the universe our users are in, we can help them find their next favorite subreddit, based on the ones that they already know.

INTRODUCING THE ALGORITHMS

In order to develop a program that can recommend subreddits to a user, we have to understand the basics of how such a similarity search might work, and what algorithms we have to use.

Similarity Matrix

In my earlier article about searching for similar content, I talked about Collaborative filtering and how we can implement it using the Cosine Similarity metric. This article also relies on Cosine Similarity and Collaborative filtering — but in this case, the filtering is based on items rather than users and uses a Similarity Matrix of user ratings. To put it simply, if one user gives similar ratings to the same items that another user has also rated, then it stands to reason that the two users themselves are similar to each other, at least in terms of preferences, and would have similar preferences for other items even if one of the users has not already rated some of them. Likewise, it stands to reason that a set of items rated similarly by the same users are also similar to each other. We also intend to use the K-Nearest Neighbors algorithm along with Cosine Similarity to identify items that are in close proximity to each other. And, while the users in a dataset don’t always give explicit ratings to the items in question, we can estimate a set of implicit ratings based on certain user behaviors, such as how many times they have interacted with one item compared to another item.

SETTING UP AND COLLECTING DATA

While I wasn’t able to register for the Reddit API, I was able to find a large Reddit dataset on Kaggle taken from over the course of 5 days in 2017. The dataset features 14,000,000 rows and two columns, with over 20,000 unique usernames and a list of subreddits that each user has commented on. Despite its size, however, this dataset is severely limited — it doesn’t include user posts, it doesn’t show the contents of the comments themselves, and none of the data is ready for use in any kind of recommendation system as it is. I wasn’t even sure that it had enough information to work with at first, but fortunately I was able to implement a series of clever methods to extract the implicit ratings from the users and restructure the data into a usable form.

As is the case with every data science project, our first step is to import the libraries we plan to use throughout the project. Only after we plug these into python can we start to work with the dataset that we have downloaded, and convert the CSV file into a readable dataframe.

Importing Libraries and Reading Dataframe
  • Pandas for dataframes
  • Numpy for calculations
  • Scipy for creating matrices
  • Scikit-Learn for machine learning algorithms
  • Matplotlib and Seaborn for visualizations.

RESTRUCTURING THE DATA

Dataframe Information

The first thing we notice is that this dataset contains a datetime column in addition to username and subreddit. We have no reason to use this feature, so we drop it — and then print the number of unique usernames and subreddits in the dataset.

Dropping datetime column, finding number of usernames and subreddits
Number of Usernames: 22610
Number of Subreddits: 34967

Note: usernames are blacked out to preserve anonymity.

Dataframe Head and Tail

Here we have the head and tail of the dataframe, with the same username listed repeatedly on the left and a representation of a single comment made by that user in a certain subreddit listed on the right. When the same subreddit and username repeat in succession, it means that the user has made two comments in that subreddit. But here comes the tricky part: even though we have millions of these rows in our dataset, we don’t have a list of explicit ratings for anyone, and so we don’t have any way of measuring a user’s preferences for the subreddits that they’ve commented on.

Or do we? By calculating the total number of comments that each user has made in each subreddit, then calculating the maximum number of comments that each user has made in any subreddit, and then dividing the total by the maximum, we can generate an implicit rating that represents a user’s interest in one subreddit that they have commented on compared to all of the other subreddits that they have commented on.

Code for Creating a Rating System
Dataframe with Total Comments (left), Dataframe with Maximum Comments (center), Dataframe with Both (right)

For example: if User A has commented on Subreddit B 10 times, Subreddit C 50 times, and Subreddit D 100 times, and these are the only three subreddits that User A has commented on, then Subreddit B gets a rating of 10/100 (or 0.1), Subreddit C gets a rating of 50/100 (or 0.5), and Subreddit D gets a rating of 100/100 (or 1.0). As a result, we can measure a user’s ratio of participation from one subreddit to another, regardless of their actual comment count. It’s a makeshift solution, but it might just work.

Dataframe with Ratings

https://gist.github.com/dougrizio/a2b6c2aaa775aa7c9598efa35c80beb8.js

However, having a rating for each user’s comments per subreddit isn’t enough — in order to create a similarity matrix, every field must be given a numerical id. These next long lines of code show how we can make a set of separate dataframes with only the dataset’s unique usernames and subreddits, assign a fixed numerical id to each based on its index number, and then add those ids back into the dataset into convenient positions.

Code for Creating User and Subreddit IDs
Dataframe with User ID (left), Dataframe with Subreddit ID (center), Dataframe with Both (right)

After a few tedious steps, we’ve turned a bare-bones, two-column dataset into a whole grid of information that we can extract interesting insights from. Why don’t we visualize the data to get more familiar with it?

VISUALIZING THE DATA

Plotting the Top 10 Subreddits with the Most Users

Here we can see the top 10 most popular subreddits in the dataset, based on the total number of users subscribed to it. Anyone who is a Redditor themselves would recognize some of these top-tier subs, with r/AskReddit, r/pics, r/funny, r/TodayILearned, and r/worldnews taking the five highest spots. And at 14,000 users strong, our number 1 subreddit r/AskReddit boasts more than 60% of the users in the entire dataset.

Top 10 Subreddits with the Most Users
Plotting the Top 10 Subreddits with the Most Comments

Below, we can also see the top 10 most popular subreddits based on the total number of comments that the users have made in them. While the differences between users were relatively small from one top-tier subreddit to another, r/AskReddit eclipses the competition in terms of total comment count, numbering at just over 1,000,000 comments — more than twice the total of the subreddit with the next highest number of comments, r/politics.

Top 10 Subreddits with the Most Comments
Plotting the Top 10 Users Following the Most Subreddits

Another way we can visualize this data is to plot the top 10 users who follow the most amount of subreddits. And while my earlier disclaimer mentioned that we wouldn’t show anyone’s usernames, upon further inspection, every single username in this graph belongs to bots (AKA, automated Reddit accounts designed for different purposes, like posting specific content or commenting on other users’ comments when they use certain words and phrases). Who else would comment on 700 different subreddits?

Top 10 Users Following the Most Subreddits
Plotting the Top 10 Users with the Most Comments

Here we decide to plot the top 10 users with the highest number of total comments in all subreddits (and most of them appear to be human!) However, we reach a surprising conclusion: no user has more than 1,000 comments listed in the dataset. This might pose a problem for our impromptu ratings system, which relies on an accurate ratio of comments per subreddit — if User A actually commented on Subreddit B 1,000 times, Subreddit C 5,000 times, and Subreddit C 10,000 times, but all of the comments were cut off at 1,000, then all three of these subreddits would be rated as equal to each other, even when the user is clearly more interested in one than in the others.

Top 10 Users with the Most Comments
Plotting the Top 10 Favorite Subreddits of the User Who Follows the Most Subreddits

Our last graph before we get into the recommendation system shows the subreddits with the most comments, from the user following the most subreddits — in this case, one of the aforementioned bots. What is somewhat surprising is that, while the bot has commented on over 700 different subreddits, it hasn’t commented more than 25 times in each.

Top 10 Favorite Subreddits of the User Who Follows the Most Subreddits

SIMILARITY MATRIX AND DATA REDUCTION

Dataframe with Only Numerical Values

Now that we’ve familiarized ourselves with the data, it’s time to set up our matrix of ratings. By eliminating non-numerical values, pivoting the dataset into a grid that compares all users to all subreddits in the dataset, and replacing the values between the users and subreddits with no existing connection from null to zero, we have created a vast matrix of relationships — although it is mostly empty. This is known as the problem of sparsity, which is that most users have not commented on the majority of subreddits, and most subreddits do not have comments from the majority of users.

Creating Similarity Matrix
Dataframe with Null Values (left), Dataframe with Null Values Replaced by Zeroes (right)
Aggregating Users and Subreddits

In the next step, we aggregate the number of users who commented on different subreddits, and the number of subreddits that were commented on by different users, and project those numbers onto a scatter plot to see all of the dataset represented as points.

Plotting Number of Users Commenting Per Subreddit

In the plot below, we have placed a line at y=100 to represent the minimum threshold of users commenting on each subreddit that we want to base our recommender system on. While there are many subreddits in the dataset, many of them have a very low number of users commenting on them, which means that the subreddits either have a small user base or little user activity. There isn’t much of a point in recommending subreddits that no one else is interested in, so we will eventually remove any subreddit with less than 100 users commenting.

Number of Users Commenting Per Subreddit
Plotting Number of Subreddits Followed Per User

In this second scatter plot, we can see the number of subreddits commented on, or followed, per user. Unlike the previous plot, while there are relatively few subreddits compared to users and most subreddits have a small user base, there are many users compared to subreddits and most users follow and comment on a surprisingly large amount of subreddits (with the points only losing density around y=100.) However, basing a recommender system on only the users following at least 100 subreddits seems like a good way to get inaccurate results — as we realized in our earlier bar charts, many of the users commenting on the most subreddits are actually bots. In reality, most human users probably follow far fewer subreddits, and only comment on the handful of subreddits that actually interest them. Because of this reasoning, our threshold for minimum subreddits per commenting user will be set to 10. Although this number is very low compared to the previous threshold, we want to make sure that most users’ interests are represented. We also want to make sure that we are including a high ratio of human users to bots, which we have no way of ensuring other than casting a wide net of users in general.

Number of Subreddits Followed Per User
Reducing the Dataframe

Now that we have decided on our minimum thresholds, we can reduce our dataset. We can also use a machine learning tool called the Compressed Sparse Row (CSR) to help us parse the system. Even a sparse matrix with many zeroes such as ours is rather sizable and requires a great deal of computational power, but because the zeros in the matrix contain no useful information they end up increasing the time and complexity of the matrix operations without having to. The solution to this problem is to use an alternate data structure to represent the sparse data, which ultimately amounts to ignoring the zero values and focusing only on the sections of the matrix that are more dense.

Reduced Dataframe

As a result of limiting the dataset to the two user and subreddit thresholds that we discussed earlier, we have greatly reduced the dataset size from around 35,000 x 22,000 to about 1,200 x 17,000. And, after using CSR, we have manipulated the matrix into a more usable form as an object named “csr_data.” It’s time to design our subreddit recommender.

SUBREDDIT RECOMMENDER

Code for Subreddit Recommender

First we fit the CSR data into the KNN algorithm with the number of nearest neighbors set to 20 and the metric set to cosine distance in order to compute similarity. Then we define the function subreddit recommender, set the number of recommended subreddits to 10, instruct it to search for our inputted subreddit in the database, find similar subreddits, sort them based on similarity, and output those top 10 most similar subreddits from the list. Let’s try it out.

Subreddit — r/AskReddit

Our first attempt will be on r/AskReddit, the most popular subreddit in the whole series.

Searching for similar subreddits to: ‘AskReddit’

All of the results are other popular subreddits, and their similarities are fairly low, with the most similar subreddit being r/AdviceAnimals, at a distance of only 0.673. Maybe our input was too generic, since most users seem to follow it, as well as the rest of these subreddits on the list. Perhaps we should pick a subreddit that is a little more niche. That will be the real test of this subreddit recommender’s success.

Subreddit — r/leageoflegends

Our next subreddit is r/leagueoflegends, another less popular subreddit that is still within the top 5 subreddits featuring the most amount of user comments. Before we run this search we should ask ourselves, what are our expectations for it? With any luck, a subreddit based on a specific subject, such as videogames, should be fairly similar to other subreddits based on videogames.

Searching for similar subreddits to: ‘leagueoflegends’

After the popular subreddit r/pics, the next most similar subreddits from the search are r/diabo3, r/Overwatch, and r/wow, with respective distances of 0.942, 0.939, and 0.934. Since all three of these subreddits are based on videogames as well, I would call this attempt a success! Other similar subbredits are r/anime (an interest which often overlaps with videogames), r/gaming, r/hearthstone (another videogame), and r/summonerschool (which is another League of Legends subreddit). In fact, the only subreddits that are not immediately relevant are r/pics, r/videos, and r/AskReddit — but these three subreddits are so commonly followed amongst users that it might be hard for the search to not see them as similar to other subreddits.

Subreddit — r/Cyberpunk

Our final attempt will be r/Cyberpunk, a subreddit devoted to the tantalizing aesthetics of dark, technological, and dystopian societies. We would expect to see other subreddits of a similar genre come out of the search.

Searching for similar subreddits to: ‘Cyberpunk’

While some people might struggle to see the similarity between these subreddits and our original input, I see a few common themes — r/Perfectfit and r/DesignPorn are about the aesthetics of functional objects, while r/Futorology, r/privacy, r/EverythingScience, and r/tech are about science and technology, and r/ANormalDayInRussia is rather dystopian in a way that’s hard to describe. r/batman also represents a superhero that could be considered the epitome of cyberpunk depending on the context. r/bodyweightfitness and r/Assistance are the only subreddits I have difficulty relating to the topic, but most of the other similar subreddits in the range are easily explainable and also have a similarity score between 0.79 and 0.89 — which to me is an additional sign of success.

LIMITATIONS AND CONCLUSION

This project has several limitations, mostly as a result of the dataset.

First, the data is limited to only 5 days of time and scraped from 4 years ago. The set also does not feature all Reddit users, all of the comments made by those Reddit users, or all of the subreddits available at the time of scraping. It only includes a record of user comments, as well, not posts — which might even be more important to know if we are trying to determine levels of user engagement in different subreddits. Additionally, with one hundred thousand subreddits and hundreds of millions of users actually using Reddit, making an accurate subreddit recommender based on the entire site would be a much greater undertaking than this project was.

Another clear limitation was that we had to develop our own user rating system from scratch. Rather than relying on the explicit subreddit preferences that a user might give to us directly in the form of actual ratings, we had to estimate the users’ ratings of different subreddits based on the implicit preferences revealed to us by their comment history.

On top of these more obvious limitations, we also discovered that an indeterminable number of users are bots whose comment counts and followed subreddits don’t reflect the behaviors or interests of actual human beings. The similarity search results were also skewed by the most followed subreddits containing the largest number of comments — and if our system recommends a subreddit that everyone already knows about, then why bother using the search?

However, despite these limitations, I would still rate this project as a moderate success! Most of the recommended subreddits were relatively similar to our original input, and while not everything was a perfect match, there were few results that were radically different. Although it could use a bit more data and a few modifications to its rating system, it seems that this subreddit recommender is nearly ready for use — in any universe!

REFERENCES

--

--