Jack, the Gambler Pirate (or What is Reinforcement Learning? What is ε-Greedy?)

Christian Zambra
productmanagerslife
6 min readAug 29, 2021

Here, with this funny tale, I want to share a glimpse about how a Reinforcement Learning algorithm works. But remember, it is just a glimpse, and so is just about what this tool can do, how you can use it, not about how the complex math magic happens.

Jack is a Pirate.

Oh Lord, yes, he is a pirate.

And as a good pirate he loves to sail along unknown seas, drink a lot of rum and gamble.

Jack is a Gambler Pirate.

Christian Zambra

But Jack is not a common Pirate.

He comes from Pacific, he lived in Sillicon valley. And he has an Parrot Robot called AI.

Jack is a smart pirate. While he drinks his rum, the parrot gamble. And AI always wins!

Christian Zambra

Drinking with other pirates, Jack discovered that the Casino has a problem. The One Arm Badits are too old. And because of it, they are unbalanced. Gambling with some of those you win more… and with some of those you lost more…

Christian Zambra

While drunk, Jack visualized the options for the One Arm Bandits…

Well, he dreamed about 3 bandits…

If he wins, the badits gave the money he put in double.

In one, he wins, 70% chance — This is the Green one.

In the other, 50% to him, 50% to Casino — Yellow one.

In other, Casino win, only 10% chance to him — Red one.

Christian Zambra

The first dream was about put it all in one bandit… Let’s go, aggressive, pedal to the metal, all or nothing…

Remembering a little bit of math…

Jack will put $1000 bucks on one bandit.

If red, he get back $100 *2 = $200

If yellow, he get back $500*2=$1000

If green, he get back $700*2 = $1400

Christian Zambra

Jack knows nothing about the 3 Bandits.

He will rotate the bottle, gamble and chose one. Choosing 1 in 3, he has 1/3 chance of red, 1/3 of yellow or 1/3 of green.

The expected return on risk is $866,67 (less than the investment) . It was a nightmare as Jack looses, and all those numbers gave him a terrible headache.

(For more info about Expected Return on Risk: https://www.investopedia.com/terms/e/expectedreturn.asp)

Christian Zambra

With that terrible Headache he have another dream:

Just give up about the math and put 1/3 of the money in each machine.

Another nightmare… 866 is less than $1000, Jack looses again.

Jack wake up and Scream:

AI PARROT! TAKE MY MONEY AND WIN!!!

The Parrot Said:

Reinforment Learning, Reinforment Learning!

Christian Zambra

Jack Said:Reinforcement what? Maybe you are talking about Conditioning? Skinner? What The F* Skinner has to be with gambling?

At this time Jack sadly remembered his Pirate’s training…

…Learning how to fight with a sword…

The objective was to do the perfect move and hit the adversary’s head.

Every time he do it right, he received a reward (chocolate with rum)

Every time he do it wrong, he received a punishment (10 One-Arm Push Ups).

He remembered that his teacher said something about a Psychologist and some rats that learn in the same way… another nightmare, but now wake up, maybe has something with the rum…..

(For more info about Skinner: https://en.wikipedia.org/wiki/Operant_conditioning_chamber)

Christian Zambra

But when Jack was close to break down in tear with this memory, the AI PARROT SAID:

REINFORCEMENT LEARNING! AI! MACHINE LEARNING!

The Parrot explained that it looks like the nightmare, but a little bit different.

You have and objective, like learn to fight with a sword or win with the One Armed Bandits.

Let us think about the Bandits.

The Parrot with money is the Agent. He will take an action, like choose randomly a One Arm Bandit and gamble.

The environment is the group of One Arm Bandits, and the result of that action will be evaluated according to the objective. If it generates a good result (In this example, win) the system/interepreter/reward function will give a reward, and reinforce that the action was good and need to be taken again. Otherwise, the agent will take another action that gave more rewards.

Christian Zambra

The Parrot followed screaming:

REINFORCEMENT LEARNING! MULTI ARMED BANDIT!

LEARN TO WIN! EXPLORE THAN EXPLOIT!

Jack do now understood very well, but said:

Go and make your magic AI Parrot.

I will learn by watching.

(For more info about Multi Armed Bandit: https://en.wikipedia.org/wiki/Multi-armed_bandit)

Christian Zambra

And the AI Parrot made it!

In the first “session”, he bet $60, putting randomly $20 in each machine. And won $52.

This kind of randomly bet in order to understand wich machine is the best is called exploration, so here he put all his bets in exploration.

Christian Zambra

Now that the AI Parrot have the results of the first session, and in it the green machine gave the best results, he will adopt another strategy: Put part of the money ramdomly, in order to confirm if the green one is still the best, (again, exploration) and part in the one that now he knows is the best according to the previous results (this is called exploitation). So, he will put 50% of the money in exploration and 50% in exploitation.

Christian Zambra

Here, he bet $300 and won 270. With those results, he confirmed that the green one is the best, and so in the last round, he gonna do more exploitation (85%) and less exploration (15%).

(This strategy of change the percentage of Exploitation and Exploration according to the results is called E-Greedy, and more info about can be found here:https://www.youtube.com/watch?v=uAB9WOQCSxw)

Christian Zambra

In this last attempt the AI Parrot bet $640 and won $806.

Taking the total, the AI Parrot Bet $1000 and won $1.166.

So, the end of this tale was:

Do not gamble while you are drunk, it shoudl be better to let a Parrot do it for you!!!

If you have and AI Parrot, it shoud be best, you should won!!!

And… Reinforcement Learning works! Like other machine learning algorithms, it was inspired in how humans work. How we learn, specially in uncertain scenarios.

If you have a scenario were the results are unknown, like those arm bandits that we don´t know wich was the best, let the algorithm learn by you!

Thanks!

Other interesting examples:

Contextual Multi Armed Bandit Application on Recomendation by Netflix:

https://pt.slideshare.net/JayaKawale/a-multiarmed-bandit-framework-for-recommendations-at-netflix

--

--

Christian Zambra
productmanagerslife

Passionate to learn; believes that new products are made to change people’s life for better; Fuzzy AND Techie :) B. Engineering & Advertising. Alma Matter: USP