Machine Learning with Reddit, and the Impact of Sorting Algorithms on Data Collection and Models
As part of my immersive data science course at General Assembly, I designed a classification model in Python, using natural language processing, and basic machine learning techniques. This model would determine the origin of a reddit post- if it was from the /r/futurology, or /r/worldnews subreddit, though the model can be generalized to compare other subreddits. The model succeeded, typically determining which post belonged where about 83 to 91% of the time. What caught my interest after I ran and reran my model was the range of variance in it’s success-specifically how choosing different sorting algorithms on reddit significantly affected my performance.
For my model, I selected these two subreddits that tend to share some similar content, but are distinct enough that someone familiar with the two subreddits could often distinguish which post belongs where. One could say the measure of success for a machine learning model would be if it’s performance is comparable to, or exceeds, skilled human performance. What else are we interested in machine learning, after all, than to sort information more effectively than people can?
I took advantage of an existing script, PRAW, or Python Reddit API wrapper, along with my Reddit API keys, to effectively scrape my data. In order to scrape any subreddit data, you have to specify which sorting algorithm you want to order your posts by — for the sake of consistency, it’s important we compare data gathered with the same sorting algorithm (more on this soon). Once I had saved my data of interest — subreddit origin and post titles — within a Pandas Dataframe, I assigned the post titles to X, as my independent variable, which I am looking to assign to and determine a relationship with what subreddits they’re likely to be in, my Y, or dependent variable. I utilized sci-kit learn to randomly assign my post titles to train test split and that will train my machine learning model with a portion of the data, with the posts attached to their actual subreddit of origin, creating a model to test on the remainder of the subreddit titles that are Not attached to their subreddit of origin. Prior to running the model, my post titles words were CountVectorized and transformed into machine readable tokens, meaning individuated variables that ignored punctuation to group similar words. Two models, a Multinomial Naive Bayes model, and a simpler logistic regression model, were used to classify my test data into which subreddit it belonged to, based on the information provided to my model from the training data. The main hyperparameter I tweaked to optimize my model was number of variables in the CountVectorizer: I limited the number of words to the most predictive 1500–2000 usually, preventing the training data from from being too specific and not generalizing to the new testing data (overfitting).
What I Found
As I said, this model succeeded in classifying the testing data to the right subreddit 83 to 91% of the time usually—This felt like a sensible range given the overlap in post types between these two subreddit. But why was there this variance? This variance didn’t occur when randomizing and splitting the same data into a train/test split- it occurred when we scraped the same reddits, with different sorting algorithms.
Let’s explore some of the implications of this!
There’s currently six sorting algorithms one can organize a subreddit by — Best, Hot, New, Top, Controversial, and Rising. I found another Medium article written in 2015, laying out the hot algorithm when Reddit’s code was open source — as of 2016 this is no longer the case, these once publicly transparent sorting algorithms are proprietary components of Reddits business model (relatedly, these codes are obscured to prevent bad faith posters from artificially pushing bad or irrelevant content to the top of the subreddits, or to farm karma, for whatever pathological reasons someone wants fake internet points). The author rewrote the code from Pyrex — used to write Python to C extensions — into Python, for readability, which can be viewed here :
# Rewritten code from /r2/r2/lib/db/_sorts.pyx
from datetime import datetime, timedelta
from math import log
epoch = datetime(1970, 1, 1)
td = date - epoch
return td.days * 86400 + td.seconds + (float(td.microseconds) / 1000000)
def score(ups, downs):
return ups - downs
def hot(ups, downs, date):
s = score(ups, downs)
order = log(max(abs(s), 1), 10)
sign = 1 if s > 0 else -1 if s < 0 else 0
seconds = epoch_seconds(date) - 1134028003
return round(sign * order + seconds / 45000, 7
What one can interpret from this is the hot algorithm was defined as a function that returns posts with more upvotes than downvotes, prioritizing posts with high upvote counts that were posted relatively recent compared to other posts within the subreddit. What we do generally know these days otherwise, is Controversial posts have a higher proportion of downvotes, top posts are historically popular, best posts have higher ratios of upvotes after a certain period of time, new posts are simply organized by time posted, and rising are new posts getting lots of attention and votes. In this case, the difference in returned posts from my scrapes using the hot and top algorithms at the same time demonstrates the nature of posts within a subreddit will change over time, meaning our machine learning models are not useful if they are not receiving new training data. Of course, this applies universally to machine learning applications with online data, and even more broadly, human learning: the internet is an ever evolving digital organism that a static digital archive will never capture the full essence of. As such, we must always be prepared to adapt and learn and be receptive to new information to be able to understand what we are witnessing.