Youtube recommendation architecture
When users are watching videos on Youtube, a list of recommended videos are displayed on the side which the user might like in a certain order. As the adoption grew across Youtube, this meant that the more people spend time on Youtube, the more Ads they were served which in turn meant more revenue for Youtube. However, this has to be balanced with providing them with useful content which they would like to watch.
The team internal to Youtube wanted to optimize for engagement (time spent) of the user without compromising on satisfaction (likes) for the user. In addition, there is implicit selection bias introduced by the users due to the position of the recommendations, like the recommended videos right next to the video player get played more often, inspite of videos placed lower which might have a better engagement and satisfaction of the user.
Youtube uses Deep Learning to achieve this. The team uses a combination of different approaches —
- A foundational wide and deep model.
- A tuning layer to achieve the best performance.
Wide and Deep Model
The described model in this paper focuses on the two main objectives. A Wide & Deep model architecture was used which combines the power of a wide model linear model (memorisation) alongside a deep neural network (generalizations). This model will generate a prediction for each of the defined (both engagement and satisfaction) objectives. The objectives are grouped in binary classification problems (i.e. liking a video or not) and regression problems (i.e. the rating of a video).
In the deep part of the Wide and Deep model a Multi-gate Mixture of Experts (MMoE) model is adopted. Features of the current video (Content, topic, upload time, title etc.) and the user that is watching (time watched, user profile, etc.) are used as input. The foundational concept behind the MMoE model is based on efficiently sharing weights over different objectives. The shared bottom layer is split into multiple experts which all are used for predicting the different objectives. For every objective there is a gate function. This gate function is a softmax function which has input of the original shared layer and the different expert layers. This softmax function is used to determine which expert layers are important for the different objectives. A reason to have different experts is because it is more important for different objectives and is one of the outcomes which has become evident with this model. Training in the MMoE model is less affected if the difference objectives have a low correlation compared to models with a shared-bottom architecture.
The wide part of the model is focusing on reducing the selection bias in the system introduced by the position of the recommended videos. This wide part is referred to as a “shallow tower” which can just be a simple linear model that is using simple features as the position videos got clicked on and the device that is used to watch the video.
The output of the shallow tower is combined with the output of the MMoE model. In this way the model will focus more on the position of the video. During training a dropout rate of 10% is used to prevent the position feature to become too important in the model. This position feature had to be build into the model. If you would not use the Wide & Deep architecture and add the position as a single feature the model might not focus on that feature at all. That will result in misalignment between the models and a poor outcome. This is why the combination of the wide and deep part is a foundational aspect of this model.
A separate ranking model is added on top of this model. This is just a weighted combination of an output vector which are the different predicted objectives. These weights are manually tuned to achieve the best performance of the different objectives.
The result of this recommendation architecture is that replacing shared-bottom layers with MMoE is increasing the performance of the model for both engagement (time spent watching recommended videos) and satisfaction (survey responses).
In addition, the engagement metric is improved by reducing the selections bias as a result of using the shallow tower. This is a significant improvement in comparison with just adding the input features in the MMoE model.
MMoE models can be efficient when you need a model with multi-objectives however, even when having a great and complex model architecture, humans are still manually adjusting the weights in the last layer which determines the actual ranking based on the different objective predictions. The complex wide and deep model helps you to design a network that predefines some features that are important to you.
Subscribe to our Acing AI newsletter, if you are interested:
Subscribe to the Acing AI/Data Science Newsletter. It is FREE! Reducing the entropy in data science. Helping you with…
Interested in learning how to crack machine learning interviews?