Using machine learning to win at every casino

Applying machine learning techniques to create valuable innovation in a commercial environment

Dave Hulbert
May 9, 2016 · 4 min read

This post is the second in a series about integrating R&D into a commercial environment. You might want to check out the first one: Making innovative R&D work commercially viable before reading this.

Multi-armed bandits

I remember hearing about the “multi-armed bandit problem” from doing statistics at school. All I remember of it was difficult maths equations with weird symbols that would never apply to real life. Turns out I was wrong.

The multi-armed bandit problem is a nice way to visualize the decision making and risk taking we do every day. The problem is this:


Slot machines in Vegas

You’re at a casino in front of a row of slot machines. The machines are random, but you know that 1 machine pays out slightly more often than the others. How to do you pick which machine to play on, to maximize your winnings?


As each payout is random, trying a machine just one time won’t give you a clear picture of whether it will always pay out well. Constantly going from machine to machine means most of your time is spent on machines with lower payouts. Sticking to 1 machine that seems OK could mean you’re loosing out on higher winnings from another.

There’s loads of complex academic papers about different solutions to the problem, which makes the problem seem unapproachable unless you have a degree in statistics. The beauty of this problem, however, is that there’s also an incredibly simple strategy, called the “Epsilon-greedy strategy”, that is only a small percentage less effective than the complex solutions. The simple strategy goes like this:


  • 90% of the time, pick the machine that’s paid out the best so far
  • 10% of the time, pick a machine at random

Note: the 90%/10% split can be tweaked, based on the context. This strategy is so simple that it can be remembered and applied to real life situations. When you’re predicting a return from an investment but there’s an unknown result then you can use ideas from the epsilon-greedy strategy to help decide what to do.

This works well in the context of R&D and innovation: there’s a cost to invest in R&D and the return is unknown. Like the multi-armed bandit problem, we learn by carrying out work and applying that knowledge the next time.

A simple (somewhat naïve) strategy, therefore for investing in R&D work is:


  • 90% of the time, do the work that you know has the highest value so far
  • 10% of the time, do R&D work you think might be more valuable at random

Again, the 90%/10% split can be tweaked, based on the context.

You’re probably thinking this is a bit crazy, but bear with me…

Valuing value

Most people, teams and companies already do the 90% bit: they try to do what is valuable as much of this as possible. The difficult bit here is defining what is valuable. Before going any further, getting this bit right is key.

Value can either be qualitative or quantitative. Qualitative value has to be subjectively compared against larger company vision and strategy. This isn’t too hard to do, but needs to be communicated clearly for teams to understand. Quantitative value has already had the high-level strategy work done but requires regular measurements to be worthwhile. I won’t go into measuring value here, as there are already loads of books on the subject.

Increasing value

The next part is doing something different— something risky — some of the time. There’s a big challenge with this: choosing work at random seems illogical! Doing this doesn’t mean casting off knowledge and judgement that’s been built up over the years. It means you don’t pick just from the top idea you have, but pick from a set of good ideas. Doing this means you’re more likely to avoid “local maximums”, where your field of view is so small that you don’t see big opportunities slightly further away.

To be innovative, we must do new things that are risky.

Ideally, companies need to have predictable sources of revenue. In many cases this means there’s no innovation at all, and whilst a company may be doing well, it could be performing many times better.

Too busy to improve?

On the flip side, for companies to stay profitable they (usually, at least) can’t just do innovative R&D work. To disrupt, companies have to create valuable innovation whilst at the same time staying profitable. This means the 2 sides have to be balanced and managed together.

If implemented well, the split of the Epsilon-greedy strategy allows us to balance the two with great results. Google is well known for introducing “20% time”, which has been incredibly valuable in creating products like GMail and AdSense. Other companies have done the same (such as 3M, before Google). Now Google has more structure to innovation but the idea is still the same.


So in a world where there’s new approaches, developer tools, platforms and SaaS products coming out every day, how can we learn and integrate them with reduced risk and maximum value? These are the kinds of experiments that — with the right implementation — have the potential to make companies perform 10x better.

In the next post in this series I’ll talk about what’s needed to be able to apply this practically, promoting experimentation and reducing the negative impact of failure. As always, feedback is welcomed.

Thanks to Sam Westlake

Dave Hulbert

Written by

Engineering director at @wearebase / @passengerteam. @phpdorset cofounder. @dave1010 at gmail.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade