Contextual multi-armed bandit — (Intuition behind Netflix Artwork Recommendation )

Saket Garodia
May 7, 2020 · 5 min read

Understanding the multi-armed bandit algorithm through Netflix’s artwork example

Various artworks for Stranger Things

Let me start with an interesting case. Suppose Netflix is just about to release Stranger Things. How will it decide on the artwork (or the thumbnail) of the series in order to attract more viewers? You must have heard of A/B testing which is used across all companies for different use cases to test various kinds of UX and recommendation changes but do you think Netflix can use A/B testing in such cases? Well, first let me answer what A/B testing is, and then I will come back to this question. A/B testing is a statistical test or an experiment in which some customers are shown the website with some newly added feature (experiment set A, eg: the changed color of the sign-up button) and the other set of people are shown the website without the implemented feature (Control set B) and then various statistical tests are used to analyze whether the new feature should be implemented. Most of the online retailers and big technology companies run A/B testing almost regularly in their work. For example, to check whether changing the color and size of the sign-up button leads to an increased Click-Through-Rate, statisticians need to do A/B testing to analyze and conclude if the change is statistically significant to be implemented. If you want to know more in detail about A/B testing I suggest going through this course by Google’s top statisticians: https://www.udacity.com/course/ab-testing--ud257.

Now let’s come back to our question, can Netflix afford to conduct A/B testing for the artwork of a new series like Stanger Things? To conduct this test, it might take a week or two to have sufficient data to analyze the statistics on which artwork works better. But, what if Netflix loses the customer's interest because of some artwork that was used for the test? If that bad artwork is continued to be used for a week or two, there might be many customers that Netflix would lose leading to losses. In such cases where companies can’t afford A/B testing because of time constraints, they go for a reinforcement learning concept called Multi-Armed Bandits. Let me explain to you the intuition behind the Multi-Armed Bandit algorithm. Imagine you go to a casino where there are 3 machines. All 3 machines require the same amount of money to play but have different rewards. Your goal is to maximize your rewards. You can either start with a win on one machine and keep playing on that machine without checking on the rewards on other machines or you can try exploring other machines and end up continuing to play on the machine with the best reward. This problem starts becoming complex when there are many machines that are analogous to the number of artworks in our Netflix example. We cannot afford to lose customers because of the bad artwork shown for a long period of time.

So basically, the Multi-Armed bandit problems boil down to the problem of exploration and exploitation. Exploration means to explore new things to understand their rewards and exploitation means to exploit the current best without going for exploration.

So, in our problem, companies like Netflix can start with a set of artworks for Stranger Things shortlisted through market research and surveys. Let’s say they have around 20–30 artworks. Now they can start with randomly allocating artworks to different customers and then adapt according to the test and clicks in real-time using the Multi-Armed Bandit algorithm. There is a tradeoff with A/B testing that there are no statistics on how much each one is better or worse than others but when it comes to decisions like this that come with a time constraint, getting more number of views in less time is more important and therefore this concept of reinforcement learning is used. This advantage of the Multi-Armed Bandit algorithm over the A/B testing is also termed as minimizing the regret since it helps to reduce the bad example shown for a long period of time thereby losing customers.

Now that we have an intuition behind the Multi-Armed Bandit algorithm, let’s understand what the Contextual Multi-Armed bandit algorithm is. If you are a data-nerd you must have noticed that the artwork for movies and shows on Netflix is different for different people. This is where ‘Contextual’ comes into play. Netflix shows different artworks for the same content to different types of people.

Let us consider trying to personalize the image to depict the movie Good Will Hunting. Here Netflix might personalize this decision based on how much a member prefers different genres and themes. Someone who has watched many romantic movies may be interested in Good Will Hunting if Netflix shows the artwork containing Matt Damon and Minnie Driver, whereas, a member who has watched many comedies might be drawn to the movie if Netflix uses the artwork containing Robin Williams, a well-known comedian.

Well, I found this concept very interesting. I hope you understood the intuition.

To summarize, the world of conversion and experience optimization has to run several experiments and tests on a daily basis. Even for these tests, it depends on what purpose is the test used for. If the purpose is to minimize the regret as in our example, Multi-armed bandit algorithm can be used instead of A/B testing whereas the case in which statistics and the statistical difference matters more than the regret or when the chance of losing the customer because of experimenting is less than A/B testing is used.

Thanks. Kindly clap if you like the blog. Happy learning data science.

References:

https://vwo.com/blog/multi-armed-bandit-algorithm/