Applying Deep Learning to Collaborative Filtering: How Hulu builds its industry leading recommendation engine

Author: Yin Zheng

Modern online services rely heavily on recommendation systems to help users find items that they might be interested in and provide personalized experiences. Over the past decade, collaborative filtering (CF) has become the de facto standard technique of building a recommender system. Collaborative filtering could be an umbrella term for various techniques embodying the basic idea, that is to predict user preferences by analyzing past user behaviors and establishing relevance between items and also between users. At Hulu, product features like Personalized Masthead, Watchlist and Top Picks are in part powered by collaborative filtering.

Figure 1. Part of Hulu’s Recommendation Products.

Many previous works are based on Matrix Factorization, such as SVD++[1], PMF[2], Local Low-Rank Matrix Approximation (LLORMA) [3], etc. Recently, with the impressive successes of deep learning in computer vision and NLP, it is tempting to apply neural networks to collaborative filtering. In that direction, RBM-CF [4] is one of the most famous works, which performs well in practice and in some challenging online contests. However, there is still space for improvement for broader application in the industry, in terms of both efficiency and accuracy.

This year, researchers at Hulu invented a novel neural network based collaborative filtering approach, called neural autoregressive distribution estimator for collaborative filtering (CF-NADE) [5], which is state-of-the-art on several challenging public benchmarks. This work was accepted and presented at this year’s edition of International Conference for Machine Learning (ICML 2016).

How does CF-NADE work?

CF-NADE models the distribution of the vector of user ratings, optimized by maximizing the joint probability of all the vectors.

Figure 2. A demonstrative example of CF-NADE.

Let’s take an example. Suppose a user rated 4 movies, “Transformers”, “SpongeBob”, “Teenage Mutant Ninja Turtles” and “Interstellar”, with scores 4,2,3 and 5, respectively, on a 5-star scale. In CF-NADE, the joint probability of vector (4,2,3,5) is factorized as a product of conditionals by chain rule, which are:

  1. The probability that the user gives “Transformers” 4-star conditioned on nothing;
  2. The probability that the user gives “SpongeBob” 2-star conditioned on giving “Transformers” 4-star;
  3. The probability that the user gives “Teenage Mutant Ninja Turtles” a 3-star conditioned on giving 4-star and 2-star to “Transformers” and “SpongeBob”, respectively;
  4. The probability that the user gives “Interstellar” a 5-star conditioned on giving 4-star, 2-star and 3-star to “Transformers”, “SpongeBob” and “Teenage Mutant Ninja Turtles”, respectively;

Each conditional is modeled by a feed-forward neural network, and parameters of these neural networks are shared among all the networks. Training CF-NADE is by minimizing the negative log-likelihood of the probability of the vector among all users. With a trained model, given a user’s rating history, we can predict his/her preference over an unrated movie.

Ordinal Nature of Rating Data

There are several tricks to improve the performance of CF-NADE, such as sharing parameters between ratings, etc. We refer the readers to [5] for more details. Notably, one of the most interesting tricks is to make use of the ordinal nature of ratings.

Figure 3. Ordinal Nature of Rating Data

1- to 5-star ratings could simplistically be viewed as 5 different labels, as in previous works, and then predictions are 5-class classification. But there is more structural information to be exploited in ratings. That is, if a user gives a 4-star rating to an item, his/her preference to give 4-star, 3-star, 2-star and 1-star for that item should decrease monotonically, and similarly, the preference on 4-star and 5-star should also decrease monotonically. We propose to optimize CF-NADE for an ordinal cost: instead of maximizing the probability of the true ratings, maximize the probability of the ranking of the ratings, as depicted in Figure 3. For example, if the true rating is 4-star for an item, we train the model to optimize for maximum probability of the rankings [4>3>2>1] and [4>5], where [a>b] means that a has a higher predicted probability than b.

Visualization of CF-NADE

As stated in [5], CF-NADE is state-of-the-art on several public benchmarks. A by-product of the predictive model is that hidden representations of movies/TV shows can be learnt, which we visualize as follows.

Figure 4. tSNE of the representations of movies for MovieLens 1M dataset.

Figure 4 is the tSNE of the representations of movies in MovieLens 1M dataset, where each dot corresponds to a different movie and is colored according to its genre. As movies in MovieLens are usually labeled with more than one genres, plotting all of them would render the figure incomprehensible. Here we are only showing two very different genres, “Documentary” and “Children”. One can see from the figure that the hidden representations of movies corresponding to the two genres are mostly separateable.

Figure 5. Top 5 most similar movies of the input movies based on the learned movie embedding.

Hidden representations can also be interpreted as embeddings in a low-dimensional vector space, where we could identify movies similar to a movie by, for example, finding the nearest neighbors. Figure 5 depicts the top 5 nearest neighbors of the input movies (left-most column) using cosine similarity. We could see that the top 5 most similar movies to “Star Trek VI” are other Star Trek movies; top 5 most similar movies to “Sleepless in Seattle” are all romance movies; and top 5 most similar movies to “Lion King” are all animations.

With what is shown above, we could comfortably use representations learnt in CF-NADE as features in other machine learning-related tasks.

Conclusion and Future Work:

CF-NADE is a successful application of deep learning in recommendation systems, it is the state-of-the-art on several public benchmarks with explicit feedback, where users give each item an explicit rating. While in practice, explicit feedback is rare, but implicit feedback like watch/browse/search/purchase behaviors are abundant. Adapting CF-NADE to implicit feedback will be our focus in near future.


  1. Koren, Yehuda. “Factorization meets the neighborhood: a multifaceted collaborative filtering model.” Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2008.
  2. Mnih, Andriy and Salakhutdinov, Ruslan. Probabilistic matrix factorization. In Advances in neural information processing systems, pp. 1257–1264, 2007.
  3. Lee, Joonseok, Kim, Seungyeon, Lebanon, Guy, and Singer, Yoram. Local low-rank matrix approximation. In Proceedings of The 30th International Conference on Machine Learning, pp. 82–90, 2013.
  4. Salakhutdinov, Ruslan, Mnih, Andriy, and Hinton, Geoffrey. Restricted boltzmann machines for collaborative filtering. In Proceedings of the 24th international conference on Machine learning, pp. 791–798. ACM, 2007.
  5. Yin Zheng, Bangsheng Tang, Wenkui Ding, Hanning Zhou. A neural autoregressive approach to collaborative filtering, In Proceedings of the 33th international conference on Machine Learning, NYC, 2016.