The Digg Video Recommender
New Algorithm, Same Hot Content
Here’s the thing about the Internet: there’s a lot of it and not everything is gold. The job of a news aggregator is to sort through each day’s daily dose of Internet and choose the most interesting and relevant stories and videos for you people. Different sites have taken different approaches to this problem; Google News uses its algorithms to deliver a personalized homepage, Reddit uses upvotes and downvotes to deliver the freshest stream of dog photos and entertaining AskReddit topics. Here at Digg, we use humans because our goal is not to recommend lots of good things, but to surface a small amount of the very best things. And while algorithms are good at lots of things, we think that humans have the advantage at recognizing this type of content.
However, algorithms are useful when we need to make lots of decisions about what to show you, and this problem arises on our video pages. Every time someone watches a video, we recommend 6 videos that we think they might want to watch next. We post about 20–30 new videos per day, so that means our editors would need to generate well over 100 recommendations per day (or so our data scientists have assured us). That level of demand would take away from their Internet surfing responsibilities, and we cannot have that.
So we’re adopting a hybrid approach (some have called this the cyborg approach to curation). Our editors make recommendations where there is a clear connection between videos. For example, we believe that if you liked this adorable video of a kid’s rocket logic then you’ll also be interested in this kid’s reaction to getting steak sauce for Christmas. When we cannot think of videos with a clear connection, we’ll use our new video recommendation algorithm to serve up its own recommendations. In fact, most video recommendations will be generated by this algorithm.
We like to think that we are augmenting cold robotic logic with a human touch.
How does this new video recommendation algorithm work? It takes an approach known as “explore and exploit”: It initially serves up a wide variety of videos in an effort to give each video a fair shot at earning your attention. As time goes on it learns what videos are getting clicked on, what videos are getting ignored, and allocates attention towards videos with higher click-through-rates. The process resets periodically to make sure that the recommended videos are fresh, rather than videos that were performing well a few days ago. The algorithm also segments viewers by traffic source, so people that are directed to a video from Facebook will receive different recommendations than those who arrive via our homepage. We think this approach gives the right balance of showing new videos along with the best videos we have posted over the last month.
So that’s it. If you want to know more about some of the technical aspects of the new system, read on. Otherwise we recommend you go to Digg and click on some recommended videos. The robots would appreciate it.
The Technical Details
When we first decided to build a new video recommender, we threw around a lot of ideas. What form of collaborative filtering should we use? Would a simple k-means cluster work well or do we need to do something cool and crazy like collaborative topic modeling? Should we integrate social signals from Facebook and Twitter? Do we use deep learning, and if so, how do we teach our machines to dream about dogs?
In all this thinking about fancy algorithms, there was a temptation to lose sight of the real goal of the system: to choose the video that maximizes the probability that a user clicks on it. Mathematically, we could state the goal like this:
P( click | recommended_vid, focal_vid, user characteristics )
Any recommendation algorithm, regardless of the details, is just trying to solve this maximization problem. Unfortunately, it is generally infeasible to compute the click probability for all combinations of videos and profiles of user characteristics, so we often have to approximate it with some other, easier function. The variety in recommendation algorithms is primarily driven by different heuristics for estimating this click probability function.
There are some settings where it’s feasible to directly estimate click probabilities, and we feel Digg is one of them for a couple of reasons. The set of active videos on Digg is tiny compared to other sites. YouTube gets thousands of uploads per hour while Digg only gets 20–30 per day. Moreover, a random video on Digg is much better, on average, than random one from YouTube (because of all those humans hard at work), so there’s relatively little risk in adopting a policy that might not always show the cream of the crop.
Our high-level strategy is to measure the click probability for each video and then recommend the videos with the maximum probabilities. This sounds like a good plan but how do we actually estimate these probabilities, particularly when we have no historical click data for newly published videos?
One option is to use an A/B test. If we only had two videos, we could randomly split our users into two groups and show video A to the first group and video B to the second group. We let this run for a small sample of users and then show the video with more clicks to the greater population. However there are two issues with this solution:
1. We have way more than 2 videos. Even though our policy is to only recommend videos that have been posted in the last month, that still leaves us with a set of ~300 videos to potentially recommend. We might end up devoting all our traffic to just figuring out which is the best video, and have no users left to show it to.
2. With this simple A/B test, we would only estimate P( click |recommended_vid), not P(click | recommended_vid, focal_vid, user characteristics). In other words, we estimate the click probability of a video, averaged over the distribution of user characteristics and features of the focal video (the video the user is currently watching). This approach ignores potentially valuable information about user characteristics.
The first problem is common in computer science and is nicely captured by the concept of the explore-exploit tradeoff. We’ll discuss this more below. The latter problem just requires us to be more careful about how we actually compute click probabilities but doesn’t fundamentally change the idea. The algorithm we have running on Digg conditions its probability estimates on user characteristics but we’ll defer the details of that process to another post. For the purposes of this blog post, we’ll satisfy ourselves with measuring average click probability.
The explore-exploit tradeoff
The explore-exploit tradeoff is best explained by comparing it to a more familiar concept: A/B testing. Imagine that we only had two videos in our candidate set of videos, conveniently named video A and video B, and we decided to use an A/B test to figure out which video to recommend. The goal of A/B testing is to be statistically certain which video has a higher click-through-rate by the end of the test. We accomplish this by deciding on the number of users we should include in our test, and then recommending video A to half of those users and video B to the other half. When the test is over we see which video received more clicks and recommend that video for the rest of time.
How many users do we place into this test? If we really wanted to be certain that we chose the video with the higher CTR, we would place 100% of users into the test. However our goal is not to be certain that A is better than B, our real goal is to maximize the number of people using the video recommendations, and this may be at odds with A/B testing. For example, we might notice that after the first 10,000 users, video A has a CTR of 10% while video B has a CTR of 1%. It would be silly to continue to show video B because it seems that A would maximize the number of clicks that the recommender gets. You might suggest that “well you should end the A/B test” if things look like this and you would be right in the context of this example. However there is a better way than just relying on your gut instinct for when to end a test (I should also note that if the differences weren’t so large, it would be a mistake to end the test early).
The explore-exploit approach gives a principled way of modeling this trade-off. We don’t need absolute statistical certainty about the CTR of our videos, we just need to be certain enough that we are making the right decision. Multi-armed bandit algorithms are a class of algorithms are that are explicitly designed to optimally manage the tradeoff between exploring options for the purposes of estimating the quality of that option and exploiting the best known option.
There are a few different multi-armed bandit algorithms but most of them perform similarly in practice. The new Digg video recommender is a variant of a multi-armed bandit algorithm called Thompson Sampling. We’re still in the process of tweaking parameters and getting it right, so we’ll defer the details our of system to a future blog post.
Live by the data, die by the data
The process of building this recommender taught us a bunch of things. If I had to choose one big takeaway, it would be:
When deploying a new algorithm, think hard about how to measure and evaluate its performance.
This lesson is certainly not unique to Digg, as many data scientists will tell you that proper monitoring and evaluation is at least as important as the right algorithm. Since this was the first time we were seriously measuring the performance of the video recommender, we didn’t quite have a well calibrated baseline. This turned out to be a problem.
We officially turned on the new video recommender system on February 2nd and started to monitor the recommender CTR, which is defined as the fraction of users who clicked on a recommended video from a Digg video page. Mathematically, it would be:
rec_CTR = # of clicks on recommended vids / # of digg video views
The stats looked good for the first few days after we released the new system but later in the week things started to take a dive…
(We normalized the values in the plot to show the relative change in CTR, rather than the absolute numbers. We’re keeping those numbers secret.)
We expected to see peaks and valleys in recommender CTR due to time of day effects but even the average CTR was steadily declining. We had just deployed the system and there were no other changes to Digg as far as we could tell, so the natural culprit was our new system. It doesn’t take a data scientist to tell you that.
However it does take a data scientist to make some half-baked statistical argument about variance in an effort to keep his job. So let’s see how I did that.
We need to understand what the data looks like before we can find the best strategy to manipulate it. Every time a user visits a digg.com video page, we generate a row of data that looks like this:
referrer, focal_vid_url, recommended_vids, clicked_recommendation
- referrer: Where was the user before they arrived at this video page. Did they come from the homepage, from Facebook, etc.,?
- focal_vid_url: The video the user is currently watching.
- recommended_vids: The videos recommended by the algorithm to watch next.
- clicked_recommendation: A binary variable indicating whether the user clicked on any of the recommendations.
My first instinct was to look at the distribution of focal_vid_url’s to get a sense of what videos are being watched. The plot below shows a histogram of visits to Digg videos on Friday, February 2nd (other days in that period show a similar pattern). Each bar represents a different video and the height of that bar is the number of views it received on February 2nd. I’m just showing the top 15 videos in this plot.
There’s one video that received about 15 times as many views as any other video, and more views than every other video combined. This is unusual for us; the majority of views are typically concentrated in a small set of videos (around 5 or so) but we rarely see it concentrated on a single video. So what is this video?
This is the first ad addressing domestic violence and sexual assault during the Super Bowl. And since most people watch…digg.com
It’s a Super Bowl ad against domestic violence (recall that February 2nd was a few days before the Super Bowl) but the ad is from last year’s Super Bowl. This video was getting tons of traffic from Facebook but almost 0% of those users were clicking on any recommended video. Given that these users accounted for so much of our total traffic, this effectively chopped the recommender CTR in half.
Here’s the weird thing about this video: we hadn’t posted it to our homepage or Facebook page within the last year (it was posted in February 2015), so the recent surge in popularity wasn’t a result of our efforts. This video was being watched by hundreds of thousands of people (or bots) but they were discovering it in a very different way than the usual methods for reaching our audience. Whoever these viewers were, they are probably not similar to Digg’s typical demographic. This sounds like a good foundation for the “it wasn’t my fault” argument.
A vaguely principled approach to juking the stats
We were able to find a reasonable explanation for the dip in recommender CTR but it would be a lot of work to have to repeat this analysis every time something odd happens. The traffic pattern from this Super Bowl ad is not common but we get viral events like this with some regularity. Our articles and videos have a habit of popping up in parts of the Internet that we didn’t intend, like the time that Google placed us as the number one search result for Hotline Bling (Drake SEO is not at the top of our priority list).
We need to develop a way to measure recommender CTR that is robust to these crazy traffic patterns. If we thought about what factors influence the usage of the recommender system on a given video page, we might come up with a model like this:
recommender CTR = recommended_videos + user_characteristics + time_effects
In words, recommender CTR is affected by the videos chosen by the algorithm, the characteristics of the users watching the original video, and time effects (weekends see less interested traffic, people click on more videos at 1pm than at 1am, etc.,). The user_characteristics term reflects the fact that certain people are just more likely to click on recommended videos than others. Loyal users of Digg tend to watch more videos on average than people coming from Facebook or Google.
We really want to measure the quality of the videos recommended by the algorithm, but that is hard to directly observe, so we instead measured recommender CTR as a proxy. Our hope was that the nuisance terms, user_characteristics and time_effects, would cancel out given enough time and page views. This Super Bowl ad example shows this is not always a good assumption, especially when we monitor recommender CTR on a short time scale.
So if recommender CTR is not the right thing to measure (or not the only thing we should be measuring), what should we do? Unfortunately there isn’t a single “right” answer to this problem because accounting for variance in web experiments is hard. One approach we could take is to to measure lots of user characteristics and use historical data to build a model for how those characteristics affect CTR. That model would allow us to separate out the effects of nuisance parameters, like time and user characteristics, and actually measure the algorithm’s performance. This will probably give better estimates but comes at the significant expense of building another system.
Luckily there is a much easier method for our setting, which is to first calculate the average CTR by focal video, and then calculate the average of those focal_vid_CTR’s. In code, this looks like:
focal_vid_ctr_list = click_data.groupby(‘focal_vid_url’)[‘clicked_recommendation’].mean()
recommender_CTR_new = numpy.mean(focal_vid_ctr_list)
Whereas the first method was:
recommender_CTR = click_data[‘clicked_recommendation’].mean()
The intuition behind this new method is that it reduces the influence of any one video on the recommender CTR. In fact, all videos affect recommender_performance equally, even if one has 100,000 views and the other has 1,000 (we filter out videos with very few views before computing this metric). This reduces the effect of large variations in the user_characteristics term because all that crazy activity can only change the estimate for a single video. In turn, that video only has a small influence on recommender CTR. We should emphasize that this heuristic is tailored to Digg. When we get large swings in traffic it’s because a single video has gone viral some place, and thus any associated statistical weirdness is isolated to a single video. We would need to adopt a different approach if these large bursts of traffic affected all videos at once.
These two methods give very similar estimates most days but the new method gives better estimates when there’s a viral video. When we recompute recommender CTR for the days before and after the Super Bowl (2/2- 2/11) with this new method, things look a lot better and I get to keep my job.
We were somewhat unlucky with timing; we just released this new video recommender into the wild and had some uncertainty about how it would perform. If this drop had happened after a month of stable performance, we probably wouldn’t have worried about it unless the drop persisted for a week or so. This episode gave us a good moment to think about proper evaluation of our system and change a number of things. These numbers will ultimately give us a much better picture on the recommender’s performance.
That’s about it for now. We’re continuing to experiment with this new algorithm and we’ll be sure to post more results as we find them. We’re also exploring how we might apply this algorithm to other parts of Digg.
Greg Stoddard is a computer science PhD student and a data scientist in residence at Digg. In the case of a robot uprising, he wouldn’t side with the robots but he wouldn’t fight against them either. He occasionally makes dumb jokes on Twitter at https://twitter.com/gregstod