Exploration vs. Exploitation

Algorithms we use in our lifes.


In Machine Learning, the “exploration vs. exploitation tradeoff” applies systems that want to acquire new knowledge and maximize their reward at the same time. Think of a row of slot machines in a casino. Each machine has its own probability of winning. As a player you want to make as much money as possible. Here’s the dilemma: How do you figure out which of the machine has the best odds, while at the same time maximizing your profit? If you constantly played one of the machines you would never learn anything about the odds of the other machines. If you always picked a machine at random you would learn a lot about the odds of each machine but probably wouldn’t make as much money as you could have always playing the “best” machine. One major application of band algorithms is A/B testing in website optimization, where website variations corresponds to bandits that yield different returns.

It’s an interesting thought exercise to apply these principles to the choices we make in our lives. Do we try drastically different things to explore what makes us happy, or do we exploit our current situation and knowledge to make the best out of it?

It seems like today’s society is loosely based on what is called an epsilon-decreasing strategy. With this strategy weexplore with probability epsilon, and exploit with probability 1 — epsilon. Epsilon decreases over time. In other words, we try to figure out what we want to do at a young age, and then stick with it throughout our lives. Throughout high school and college we explore a variety of subjects and are open to new experiences. The older we get the more likely we are to settle on a path, and major life or career changes become less likely.

Is this strategy optimal? Or is it something that society has imposed on us? What would happen if we tried a different strategy?

Before looking at other strategies let’s take a step and think about our goals. In the slot machine example we wanted to make as much money as possible. This was the metric we tried to optimize. But what is the corresponding metric for our lives? How much money we make? How much free time we have? How safe our family is? Clay Cristensen, the author of the Innovator’s Dilemma, has a written a book about the topic and has an interesting metric for himself: “How much you help individual people to become happier”.

Let’s take a look at a couple of strategies and think about how they could play out in our lives:

Epsilon-Decreasing with Softmax

This is the strategy briefly described above. We exploit the current situation with probability 1 — epsilon and explore a new option with probability epsilon, with epsilon decreasing over time. In the case of exploring a new option, we don’t just pick an option at random, but instead we estimate the outcome of each option, and then pick based on that (this is the softmax part). I think this strategy is pretty close to how most of us live our lifes. In a sense, epsilon here models our risk aversion. As we become older we become more risk-averse and less likely to explore new options, like a major career change, even if they could yield high returns.

Upper-confidence bound strategy

This stratey loosely corresponds to living a very optimistic life. In addition to estimating the outcome of each option, we also calculate the confidence of our estimation. This gives us an interval of possible outcomes. For example, joining a startup could yield $0 of profits or $1B of profits. Since most startups fail the expected outcome would probably be much closer to $0 than $1B, but the chance of making $1B still exists.

Now, here’s our strategy: We always pick the option with the highest possible outcome, even if it that outcome very unlikely. The intuition behind this strategy is that options with high uncertainty usually lead to a lot of new knowledge. We don’t know enough about the option to accurately estimate the return and by pursuing that option we are bound to learn more and improve our future estimations. In simulated settings this algorithm does well when we have many options with very different variances. It seems like this strategy could work well in part sof the world that are very forgiving of failures, for example Silicion Valley.

Contextual-Epsilon-greedy strategy

This strategy is similar to epsilon-greedy, but we choose the value of epsilon based on how critical our situation is.When we are in a critical situation (large debt, need to provide for a sick family) we will always exploit instead of explore — We do what we know works well. If we are in a situation that is not critical we are more likely to explore new things. This strategy makes intuitive sense, but I believe that it is not commonly followed. Even in non-critcal situation we often choose to keep doing what we have always done due to our risk-averse nature.

Do you think that the strategies in our lives make sense? Are there others you can think of?

Note: I originally posted this at http://dennybritz.com/2014-04/exploration-vs-exploitation.html