Reinforcement Learning (RL for the intimates)

If you’re a geek of our league, I probably don’t need to talk about the recent success of RL, but I will anyway.

Let’s start start with DQN?
The first algorithm to really hit the media, learning to play Atari games from scratch using only pixels of the screen as input!


And you can consider that rather “old”, 2013 news.

Major RL Breakthroughs

It simple doesn’t end!!!

All of this might seem a little bit magical and almost impossible, but it’s not, for now just take my word that it is really simple. Sure, some fancy tricks and cool math are used here and there, but it all comes down to one simple thing:

Maximize the sum of future rewards.
That seems really complicated

Wow! Calm down! , I know. Hopefully by the end of this post everything will start to make sense. So let’s start to get our hands dirty.

The Problem

I present you to the “Hello World” of RL, the Grid World!

The rules are very simple:

  • You can move only one square per turn
  • Moving in diagonals is not allowed.
  • You don’t want to die so if you step in a blue square, “The Cliff” kills you.
  • You start at the “You’re here” spot.
  • You’re in a rush, so you want to get to your goal as fast as possible.

First step, represent our problem as something the computer can interact with, we can represent the grid as a 2D array:

  • 0 represents the empty squares
  • -1 represent the cliff
  • 1 the goal and 2 the player.

The numbers in bold are the indexes. The choice of what the numbers represent doesn’t really matter, you just need some way to differentiate the squares.


Next we need to code the rules I just described above. We act in turns, we perform one action and the world gives us the consequence of that action. Generally speaking the world returns the next state, the reward and a flag signalizing if game is over.

So stepping into “The Cliff” means death and we can represent that by giving the agent a negative reward, let’s say -100, and signalizing the game is over. As I said, we’re in a hurry, so every time step we’ll receive a reward of -1 except if we reached the goal, then the reward will simply be zero and the world will signal the game is over.

Now, remember what I said:

“The goal of all reinforcement learning is to maximize the sum of future rewards”.

We’re now ready to understand exactly what this means. Our agent receives a reward of -1 every time it makes a move, stepping into “The Cliff” gives us a reward of -100, reaching the goal gives us a 0 reward and both actions end the game. So pause and think a little bit, what it means to maximize the sum of future rewards here? Since all of our rewards are negative we want to finish the game with a sum of rewards closest to zero as possible, and how we do that?? Finding the quickest route to the goal!!

So we defined our problem in a way that maximizing the sum of futures rewards (I have to say it a gazillion times, since it is indeed important) causes the agent to “solve” the world.

So by now you might be thinking

“Cool… But how on earth we actually do this?”.

I present to you the Q-learning algorithm! Your (possibly) first reinforcement learning algorithm.

Q-learning pseudo-code, stolen from Sutton’s book

Spoiler Alert: 
We’ll get there in the Next Post, Stay Tuned.

Huge thanks to Sanyam Bhutani for collaborating with the series! Check out his website.