Recommendation Systems using Reinforcement Learning

Aishwarya Srinivasan
IBM Data Science in Practice
4 min readDec 31, 2019

In industries like e-commerce, retail, news-group or music apps, recommendation system models are one of the most important aspects in customer retention. Presenting to the users what might interest them most is crucial. Also, identifying the most attractive content, and getting customers hooked to specific contents could result in significant revenues to the company.

Based on various data entities, including user details, interests, trending content, etc. models are built to recommend the most relevant content to customers. Companies like Spotify, Netflix, HBO use sophisticated recommendation systems for video and song recommendations. Targeted marketing is a segment of recommendation systems.

A few years ago, Netflix organized a challenge “Netflix Prize”, where they invited people to build a better recommender system than what they had in return for prize money.

Traditional Recommendation Systems

The traditional recommender systems have been modeled with two paradigms, collaborative filtering and content-based systems.

In collaborative filtering-based methods, the recommendation is built over the “user-item interaction matrix”, which are records of users’ past interaction with the items. The underlying concept for collaborative filter-based methods is to detect similar users and their interest based on their proximity. A collaborative filtering algorithm can be built on the following methods: memory based, and model based. In the memory-based method, for a new user, the most similar user is identified, and their most liked content is recommended. In the memory-based method, there is no concept of variance or bias as the error cannot be quantified. In the model-based method, a generative model is built on top of the user-item interaction matrix and the model is then used to predict new users. In this variant of modeling, model bias and variance are observed.

In content-based recommendation systems, apart from the user-item interaction, the user information and preferences are also taken into account, and other details related to content like popularity, description, or purchase history, etc. The user features and content features are fed into a model, which works like a traditional machine learning model with error optimization. As this model, contains more descriptive information related to the content, it tends to have high bias, but lowest variance compared to other modeling methods.

Dive into Reinforcement Learning

By now, you might have understood that the goal is to recommend the right content to the right user in order to get better content reviews. With the Markov property in a reinforcement learning models, recommendation systems are well built. The reinforcement learning problem can be formulated with the content being the state, action being the next best content to be recommended and the reward to be the user-satisfaction/ conversion or review. Each of the content for training the model can be converted as a vector embedding, which brings us to realize that the action space is not discrete, rather it is continuous. The embeddings prepared from the content can vary based on the requirement and is completely dependent on the subject matter expertise. Natural Language Understanding in itself is a huge research domain, with multiple techniques involving information extraction from the content. This concept has also been referred to as “maximization of preserved information”.

In accordance to the Markov property, the recommendation system problem can be interpreted as a new recommendation given to a user — independent of the previous recommendations. In this context, we find another advantage using a reinforcement learning model, which accounts for balance in exploration and exploitation. Not only does the algorithm suggests the users, the content they might find most useful, but it will also suggest some random content, inducing new interest in them. Reinforcement Learning models will also be continuously learning, which means as and when the interest of the user changes, the recommended content would change, which makes the model robust.

The data is prepared in a way that content viewed by the user, is taken as the state, the next content seen by the user is the action and the reward can be formulated by review, or the user interaction time, or if the user purchased it (in case of e-commerce).

The model can be built and validated based of various reinforcement learning algorithms like actor-critic, q-learning, DDPG, or temporal learning. The model can be either evaluated based on a metric if the ground truth of the reward for each prediction is available, or the model can be used as a beta testing app with some dummy-users and evaluated based on their reaction to the recommendation.

You can find the datasets for recommendation systems here:

Thank you so much for reading! You can start building your own recommendation system using reinforcement learning.



Aishwarya Srinivasan
IBM Data Science in Practice

LinkedIn Top Voice 2020- Data Science || MS Data Science - Columbia University || IBM- Data Science Elite || Unicorn in Data Science || Scikit-Learn Contributor