The Explore-Exploit dilemma

Saikat Kumar Dey
Data Science for Everyone
5 min readDec 10, 2018


You want to learn about “Machine Learning”. You make a Google search. There are 10–15 results linking to the various sources of information. You open the links one by one. You take a one-shot look at them before deciding which ones among the lot you would invest time in reading. This way, you would get to know of some credible sources for reading about ML, which you will use the next time. We call the act of going through the links one by one, Exploration. We call the act of using the sources next time when you want to read about ML, Exploitation. Exploration gave you information about the sources (websites, blogs, youtube channels). Based on it, you create a mental note (or bookmarked) of which ones to use the next time (Exploitation).

Exploration is a time-consuming process. It is necessary because you have no information about the source at hand. Exploitation is when you make use of the hard work done in exploration. But how long do you explore? When do you know you should exploit the information?

There’s a very interesting problem called Multi-armed bandit related to Explore-Exploit trade-off. So, imagine that you are in a casino playing a 3-armed slot machine. The slot machine has 3 levers. You pull one of them and you get some money in exchange. In the beginning, you have no knowledge about the machine. You do not know which lever will give you the most reward (or money in this case). Each one of the levers has a certain probability of reward associated with it. What would you do to maximize the reward for N trials (say 100)? Let’s see, we could take the following approaches :

  1. Always Exploiting: You could always pull one of the levers (say, lever 1). The problem is that you wouldn’t know if you could earn more by pulling other levers.
  2. Always Exploring: You could pull a lever at random every time. The problem is that you wouldn’t learn anything from your previous actions. (Always exploring)
  3. Epsilon Greedy: You pull the best lever 70% (say) of the time (exploitation). You pull a lever 30% of the time at random (exploration). Let’s say that we have a coin with a probability that a head comes up is 0.3. So, we’ll toss a coin, if the probability is 0.3 or less, we’ll pull a lever at random. If the probability is more than 0.3, we’ll select the best lever so far. The value of epsilon is 0.3, which is also called exploration probability.
  4. Epsilon First strategy: You pull all levers for a fixed number of times (say, 20 times each). Observe which one gave you the most reward (say, lever 3). Then, keep pulling lever 3 every time.
  5. Epsilon Decreasing strategy: This is the same as the epsilon-greedy approach. But, the epsilon decreases with the number of trials in this case. This means, in the beginning, we’ll explore more (and exploit less). After good enough trials, we’ll exploit more (and explore less).

These are not an exhaustive set of approaches that we could take. There are other approaches like Adaptive epsilon-greedy strategy based on value differences (VDBE) and Contextual-Epsilon-greedy strategy. Which approach is the best? We shouldn’t use approach 1 or 2 because of the obvious problems mentioned. Epsilon-Greedy is the simplest to implement and we could increase the sophistication of the algorithm.

Where do we use the explore-exploit trade-off in the industry?

This is the real question that you might have been wondering while reading this article. If you’ve come this far (exploitation phase for you), here’s your reward. We’ll discuss two approaches in this post.

  1. An alternative to A/B testing: What is A/B testing? You are launching a new website and you have 2 designs for it with a call to action (click on “buy now” for example). You are not sure which one to use, so you show version A to 1000 users and version B to another 1000 users. The design which has the greatest conversion rate wins. What is the issue here? User tastes are volatile. Here, we are running a one-shot experiment and selecting the design. What if the taste of our user-base changes after a certain point? Should we run the experiment again? We could reduce this problem to a multi-armed bandit problem. We could use the strategies (epsilon-*) mentioned above. Let’s say that we use the epsilon-greedy approach. We choose an epsilon of 0.5. This means, we toss a coin and if it is HEADS, we select one of the versions at random. If it is TAILS, we select the best version over last N trials, (say A). We could either choose the epsilon probability manually or we could use other strategies like epsilon-decreasing or adaptive-epsilon strategy (where epsilon gets updated over time). This approach is robust since you always have an element of exploration and can reward the lesser-chosen version if the user taste changes.
  2. Ranking Algorithms: Let’s say you are building a “People Also Search for” feature for Google. [ The following approach is, of course, an oversimplified version. Google might not be using this. ;) ]

What do you want to optimize using this feature? You want to make sure that the user keeps on searching relevant items (so you could show ads and make money). People search a lot of stuff before or after searching titanic. You want to predict what items people might search for next and show it to them. This means that you are figuring out P(x/Titanic) = P(#clicks/search on movie x/# search for Titanic). This means the probability that user clicks on movie x given that the user has searched for Titanic. How do you solve this if you don’t have a lot of data? Sure, you could always use content-based recommendation algorithms. Show the movies which are like Titanic (genre, director, actors, synopsis, etc). But, in what order do we show them? Do we always show the most similar items at the top? Or do we change them depending on some signals?

Let’s formalize the problem as a contextual multi-armed bandit problem. You could rank the items based on P(x/Titanic) and the reward.

1. Let’s say you have a set of movies [x1, x2, x3, ……, xN ] that are like Titanic.

2. You have to fill N slots.

3. For each slot, you perform steps 4–6 until all k slots are full.

4. Remove already selected items from the set.

5. You generate a random number between [0,1].

6. If it is less than epsilon, explore — you choose an item from the set at random and put it in the next available position.

If it is more than epsilon, exploit — you choose the item having the highest reward. This could be an expected value P(x/Titanic) * reward, depends on how you formulate the equation)

7. When the user clicks on item xi, update reward(Titanic, xi) by +r (say +5).

Choose epsilon based on one of the strategies mentioned earlier.

The items will be in random order at first. The system learns the ranking of the items from the user-interactions.

The best part? This system is always going to learn. Let’s say Titanic III comes up next. And it excites a lot of people. So, the clicking/searching behavior should reflect on the ranking.

You could use this approach to bubble up new users in the system (Users/Channels/Content-Creators/Artists).

So, who learned something new? :)