Gaze Into My Reddit Crystal Ball
Using Watson Machine Learning to predict a post’s potential
Editor’s note: This article is part of an occasional series by the 2017 summer interns on the Watson Data Platform developer advocacy team, depicting projects they developed using Bluemix data services, Watson APIs, the IBM Data Science Experience, and more.
Reddit is a social news-aggregation and discussion forum that receives millions of new posts every day. Some of these posts are links or images, but some contain only text, and usually serve to request/provide information or spark some kind of discussion. Users on the site can “upvote” or “downvote” these posts, nudging the post’s score by one in either a positive or negative direction. The end result of this system is a ranked list of posts for users to scroll through, divided into “subreddits” (subjects) with those posts having the highest scores situated at the top.
In this article, I’ll describe an app I built to help with my Reddit game, and what I learned about machine learning in the process. I’ll also share the code so you can try it yourself.
Introducing the Reddit Crystal Ball
The Reddit Crystal Ball is an app that predicts how high a score your post will receive on Reddit, when posted at the current time. If there’s a time later in the day at which the app thinks you could get a better score, you’ll be notified of that as well.
The app uses Watson’s machine learning service to make its prediction, which is based on a few different factors:
- Current Time of Day
- Average Word Size
- Watson Social Tone Analysis
I used these features to build a machine learning model with Spark ML, which I then deployed on Bluemix using the Watson ML service. This creates a “scoring endpoint,” which allows us to interact with and query our model through a REST API that can be accessed from any platform, using any programming language.
To make predictions, the machine learning model uses an algorithm called K-Means Clustering to group similar posts into clusters. The clusters of posts are then analyzed to determine the average score for posts placed in that cluster, which are then separated into 4 groups: Low, Medium, High, and Great.
This method wasn’t my first choice. I initially attempted using decision-tree- and probabilistic-based algorithms like Random Forest and Naive Bayes to predict a specific score, but quickly learned that predicting an exact score was not going to work well given the constraints of this data set.
Because I wanted to document the process of gathering data, processing it, and creating a machine learning model, I chose to build this project in a Jupyter Notebook. A notebook is an environment that allows documentation and executable code to live together, side-by-side, so it was perfect for this project.
Implementation in a data science notebook
After stepping through my notebook, you’ll not only understand how the data was processed and used to train a model, but you’ll also be able to interact with that model and use it to make predictions on you own posts.
These interactive elements are called PixieApps. A PixieApp is an app created with Python, using the PixieDust helper library, that runs in the notebook itself. Using the templating language Jinja 2, it becomes relatively easy to create a nice UI that helps the data come alive.
What I [machine] learned
After playing with the data and interacting with the model in the PixieApp, some interesting trends emerged. While all the features influenced the prediction, the most important were the choice of subreddit and the time a post was made. This makes sense, since different sections of the site are likely to be most active at different times, and it follows that posts would score higher during these periods of activity. At the same time, a post containing a link — which can drastically increase the average word size of a post — or posts that skew far in a certain direction in the tone analysis can be predicted to score higher or lower than solely based on subreddit and time alone.
At the start of this project, machine learning was a completely foreign concept to me. Even the process of gathering, cleaning, and analyzing data was something I had little experience with. The great thing about notebooks on the IBM Data Science Experience is that you get the Pandas Python Data Analysis Library and the Spark engine out-of-the-box to get you started with small and large data science projects alike.
Working with these tools, I was able to analyze Reddit post data set and experiment with different features in-depth. Now, I feel I have a much better grasp on what machine learning does, how it works, and the tools needed to work with large data sets.
So check out the notebook and PixieApp I created, and let me know what you think here in the comments. You’ll be able to see the entire process of building and deploying the model, and you’ll have the opportunity to make predictions on your own posts. You might even find that it helps you create the perfect, high-scoring Reddit post of your dreams.
To create your own crystal ball, load the notebook, complete the setup steps, and follow the instructions in the notebook cells. May your comments be plentiful, and your future filled with upvotes!