The Epsilon-Greedy Algorithm for Reinforcement Learning

Avery Parkinson
Analytics Vidhya
Published in
4 min readDec 2, 2019

Let’s say that you and your friends are trying to decide where to eat. In the past, you’ve always gone to a Mexican restaurant around the corner, and you’ve all really enjoyed it. However, this time, one of your friends mentions that a new Lebanese place has opened up down the street, and it’s supposed to be really good. None of you guys can come to a consensus — should you go to the Mexican restaurant which you know to be really good, or should you try the Lebanese place which has the potential to be better or worse?

Decisions?!

Quick Aside:

Reinforcement learning is a subtype of artificial intelligence which is based on the idea that a computer learn as humans do — through trial and error. It aims for computers to learn and improve from experience rather than being explicitly instructed.

But, how does the computer know to do this?

Well, by using a learning algorithm.

Learning algorithms are mathematical tools implemented by the programmer which allow the agent to effectively conduct trial and error when performing a task. Learning algorithms interpret the rewards and punishments returned to the agent from the environment and use the feedback to improve the agent’s choices for the future.

In reinforcement learning, our restaurant choosing dilemma is known as the exploration-exploitation tradeoff. At what point should you exploit options which you think to be the best rather than exploring options which have the potential to be better or worse (or vice-versa)?

This tradeoff plays into something known as the multi-armed bandit problem.

Let’s say that your mom gives you a bag of quarters to use at a series of four vending machines. However, these vending machines are special (of course), because you can’t see what’s in them. Each vending machine has different percentages of different types of chocolate bars, and each one costs a quarter. You know that you prefer kit kats to oh henry and oh henry to coffee crisp and coffee crisp to mars bars. How should you spend your quarters across the four vending machines in such a way as to maximize your overall satisfaction with the chocolate bars that you get?

This is the multi-armed bandit problem — how should one dedicate a fixed amount of resources to several different options when you can never be certain what will come of pulling each?

How??

Well, luckily, we have the Epsilon-Greedy Algorithm!

The Epsilon-Greedy Algorithm makes use of the exploration-exploitation tradeoff by

  1. instructing the computer to explore (i.e. choose a random option with probability epsilon)
  2. and exploit (i.e. choose the option which so far seems to be the best) the remainder of the time.

Usually, epsilon is set to be around 10%.

In this way, as time goes on, and the computer is choosing different options, it will get a sense of which choices are returning it with the highest reward. However, from time to time it will choose a random action just to make sure that it’s not missing anything. Using this learning algorithm, the computer can converge to the optimal strategy for whatever situation it’s trying to learn.

Alright, let’s put our vending machine conundrum into python!

Let’s say that we have

  • A list called “vending_machines” which contains all the different vending machines we have in front of us.
  • A list called “values” which contains the average value we have received from each vending machine thus far.

So, here’s how we can represent the epsilon-greedy portion as code:

  • We define the “explore” function which instructs us to choose randomly from the list of vending machines
  • We define the “exploit” function which instructs us to choose the vending machine that correlates to the highest number in the values list (our “greedy” action)
  • We define the “choose_vending_machine” function which generates a random number between 0 and 1. If it’s greater than epsilon, it directs us to exploit function. Otherwise, it directs us to the explore function.

And there we have it, the perfect algorithm to maximize our kit kat consumption (take that, painted over vending machines)!

So, the next time your mom gives you a bag of quarters and you go to a bunch of vending machines which you can’t see into, make sure to bring a pencil and a pad of paper, and do some epsilon-greedy — or, you know, just go find a different vending machine :).

--

--

Avery Parkinson
Analytics Vidhya

Activator at The Knowledge Society | A Sandwich or Two Founder