Exploring Reddit with praw
Subreddit recommendations without a building a recommendation system
Reddit is the Wild West of the internet. Unlike many modern social platforms, it’s segmented into communities that each have their own purpose and standards. Instead of adding people to your personal network, you can explore and join these communities to get a taste of what they’re about.
Thought Experiment: How would you go about finding new communities of people irl, without Reddit. Or the internet. Like if you actually had to get out of the house and meet people. Imagine you’re dropped into a new city in 1985 and you don’t know a single soul.
The first thing you might do is find the community you’re most familiar with. Somewhere where you’re already familiar with the customs and know how to maneuver. For me, that community is probably /r/bjj
, but we can use /r/datascience
(I know you nerds love data science). Let's use reddit’s python API, praw, to get the titles of the 10 hottest posts in that community.
I created a new Reddit
object and passed in some log in credentials. Check out this page to determine how to get client_id
and client_secret
keys. You can generate the pair here by creating a new application.
The code above should give you a basic idea of how to use the API. From the Reddit
object, we can access basic Reddit entities like subreddits, redditors, and comments. Here, we access /r/DataScience and use the hot()
method to get the top 10 hottest posts. Note that hot()
returns a generator.
The API’s structure is actually really nice to work with. To get a post’s title, we can just use the post.title
property. The author is accessible through the post.author
property. Very intuitive and ✨pythonic✨. We iterate through the hottest posts, extract the relevant info, then dump it all into a dataframe so it's easy to deal with. I pulled out some extra fields to give some hints at other data points that might be helpful in future analyses.
Ok, so we’ve found our main community. We have a home base from which we can branch out and find new communities and hobbies to take part in. Let’s look around and meet some people. On reddit, communities interact primarily through posting and commenting. Let’s aggregate a list of potentially interesting friends. Maybe they’ll be able to point us to other cool communities.
Boom. Now we have a set of people who can potentially point us to new communities. We got this list by collecting a set of people who either posted or commented.
Now that we have a list of friends, let’s examine a list of other subreddits one of them has interacted with.
Oooooo this redditor is into some cool stuff. While their hangouts are interesting, they seem to be biased towards the entrepreneurship/marketing side of data science. Instead of looking at this one person’s interests, I want to leverage the wisdom of the crowd here. It would be interesting to ask all of my new friends for new hangout recommendations and see what the most common answers are. Let’s do that with code and see which other communities are mentioned the most.
Ok, now we’re talking! we’re starting to flesh out the picture of other potentially cool communities. Some of them, like r/AskReddit or r/Showerthoughts, are generally popular — they probably aren’t anything new to us. There are ways we could potentially filter those out (Something like tf-idf), but that might be more trouble than it’s worth. No need to be elegant here, we can simply look at the list and filter them out.
Further down the list we start to see some really interesting communities like r/CryptoCurrency and r/datasets. To make this a bit more robust, we can increase the number of redditors and subreddits we look through for suggestions.
There we have it—a quick set of recommendations without numpy, sklearn, tensorflow, or pytorch. Note that we did no math and accounted for the fact that we don’t even have access to that much data, given the constraints of the API. In cases where you just need an answer that’s grounded in data, thinking through the problem and approaching it intuitively can be a much better choice than defaulting to hardcore ML.
Note that taking a traditional machine learning approach and building a full blown recommendation system here would be dead in the water. We don’t have perfect information since we can run a “SELECT * FROM users” query. Instead, we’re taking advantage of the limited information we have and leveraging it to achieve the stated goal to a “good enough standard”. When you being thinking this way, you’ll begin to see opportunities pop up in this liminal space between basic data analysis and machine learning at scale. take advantage of it.
Thanks for the read! Leave me some claps and follow me on Instagram @kydro_digital