The Multi-Armed Bandit — to explore or exploit?

Tom Connor

Follow

Published in

10x Curiosity

6 min readJun 12, 2018

--

When faced with a decision to go with what you know or strike out in a new direction, which do you choose?

Life is a balance — blow all your money now on a good time and you will have nothing for retirement. Defer your good times to work hard now and you might find an unexpected twist changes those dream plans. Striking this balance is a continual juggle and one that I’ve been thinking about recently relating to projects and idea generation.

Spend too much time focusing on the current fires and not experimenting, broadening your horizons or learning new skills will eventually mean you have no new ideas to draw on. However ignore those current fires at your peril for without some form of short term stability or success you may not have a future to apply those new ideas to.

A classic conundrum in business is the explore vs exploit algorithm. Given limited resources you can only do one, or the other or a bit of each. All of your future cash flows and business break throughs will come from exploring new ground and experimenting — but doing this work often has a low short term pay off. Given the short term focus of shareholders and the ever present need to pay the bills, often you will exploit your current cash cows to their full extent — good while it lasts but what have you got in development behind it? What are you doing to avoid your business being disrupted, becoming the next Blockbuster or Kodak? Too much exploring will limit your current cash flow options, too little and you will impact your future cash flows.

Mathematically this problem has received much attention — most famously known as the “multi-arm bandit” problem which investigates the best strategy to exploit with pokies.

Naturally, you’re interested in maximizing your total winnings… it’s clear that this is going to involve some combination of pulling the arms on different machines to test them out (exploring), and favoring the most promising machines you’ve found (exploiting). (Algorithms to live by)

More recently this algorithm has become important in the machine learning field of reinforcement learning, where the algorithm explores the problem space, looking for solutions that maximise value. Of course the algorithm might settle on a solution that is a local maximum with a better solution being possible — for this reason the algorithms are programmed to occasionally select a random direction so that they can unexpectedly find new solutions. (Pedro Domingos — The Master Algorithm)

An important consideration in deciding your strategy is the time frame over which you are investing:

Early on, when there’s much to learn, it makes sense to explore a lot. Once you know the territory, it’s best to concentrate on exploiting it. That’s what humans do over their lifetimes: children explore, and adults exploit (The Master Algorithm)

Jeff Bezos has an interesting way to frame this decision with his “regret minimisation framework”:

So I wanted to project myself forward to age 80 and say, “Okay, now I’m looking back on my life. I want to have minimized the number of regrets I have.” I knew that when I was 80 I was not going to regret having tried this. I was not going to regret trying to participate in this thing called the Internet that I thought was going to be a really big deal. I knew that if I failed I wouldn’t regret that. (Algorithms to live by)

A fascinating outcome highlighted by the multi-armed bandit problem is the solution (the Gittins Index) shows that if you are in doubt, you should always bias your decision towards exploring:

something you have no experience with whatsoever is more attractive than a machine that you know pays out seven times out of ten! (The Master Algorithm)

Practically applying this concept in your work is explored very neatly by Jeff Patton in his article on Dual Track development. Jeff describes how it is important for teams to be working on both “Discovery” and “Development” workflows in parallel — the Discovery stream focused on maximising learning velocity and the Development theme on maximising the release velocity to get your ideas shipped and into the world. Astro Teller and the crew from Base Camp also look at similar concepts.

On a more personal level, Cal Newport in his books and Todd Henry with his FRESH framework both outline methods through their work on how to ensure you can set up systems which ensure you continue exploration in your daily activities, providing a systematic counter to the firefighting world of exploiting.

The Background Ops Finale provides an excellent summary of how to think about the explore / exploit conundrum:

The world is constantly full of “explore vs exploit” tradeoffs — whether you should take known gains, or take the exploration cost to see if better options are are available.
Oftentimes, people don’t realize that they’re making explore/exploit tradeoffs. Once you know this mental model, you start noticing these tradeoffs everywhere, and can make the choices more explicitly.
Boredom and mental fatigue are very possibly evolutionary-evolved prompts to switch from exploit mode to explore mode.
In an era of addictiveness and more options than ever before, we’ll benefit very heavily from explicitly designing and operationalizing how you spend your exploration time — instead of just spending it on Farmville or internet surfing.
Seriously think through what would maximally benefit you from your exploration time
Keep in mind maximum sustainable pace, reactance, and the usefulness of hard rules when designing your exploration time.
When large opportunities hit, switch to exploit mode maximally — go hard and take advantage of the gains. Minimize leisure within reason during those periods, and fill up the remaining leisure with maximal recharging and motivation.

When we talk about decision-making, we usually focus just on the immediate payoff of a single decision — and if you treat every decision as if it were your last, then indeed only exploitation makes sense. But over a lifetime, you’re going to make a lot of decisions. And it’s actually rational to emphasize exploration — the new rather than the best, the exciting rather than the safe, the random rather than the considered — for many of those choices, particularly earlier in life. (Algorithms to live by)

Let me know what you think? I’d love your feedback. If you haven’t already then sign up for a weekly dose just like this.

More like this from 10x Curiosity

Luck and reversion to the mean — Are you fully acknowledging the role luck is playing in the outcomes? Does a good outcome reflect your good skill, or did you just get lucky?
System of Profound Knowledge — Deming developed his system of profound knowledge to describe the work of organisations.
Normalisation of Deviance — Normalisation of deviance is a trap in human psychology that has led to many disasters over the years
Antifragile — Becoming stronger with failure — Antifragile. A thought provoking concept developed by Nassim Taleb in a book by the same name.
Helping people make better choices — Nudge Theory and Choice architecture — Can you make the default option one that nudges people towards better outcomes?

The Multi-Armed Bandit — to explore or exploit?

More like this from 10x Curiosity

Written by Tom Connor