Introduction to Reinforcement Learning (Coding Q-Learning) — Part 3

Adesh Gautam
The Startup
Published in
5 min readJul 9, 2018


Talk is cheap. Show me the code — Linus Torvalds

In the previous part, we saw what an MDP is and what is Q-learning. Now in this part, we’ll see how to solve a finite MDP using Q-learning and code it.

OpenAI gym

As stated on the official website of OpenAI gym:

Gym is a toolkit for developing and comparing reinforcement learning algorithms.

We’ll use this toolkit to solve the FrozenLake environment. There are a wide variety of games available like the Atari 2600 ones, text based games etc. Check out all of them here.


Install gym using the steps provided here. First install the gym library then, the OS specific packages.

Now, let’s see how to use gym toolkit.

Import gym using:

import gym

Then, specify the game you want to use, we’ll use the FrozenLake game.

env = gym.make('FrozenLake-v0')

The environment of the game can be reset to the default/initial state using:


And, to see the game GUI, use:


The official documentation can be found here where you can see the detailed usage and explanation of gym toolkit.

FrozenLake Game

Now, lets see what is the FrozenLake game.

Imagine, you are standing on a frozen lake. The lake is not all frozen, there are some parts where the ice is very thin. You goal is to go from place S to G without falling into the holes.


Here, S is the starting point, G is the goal, F is the solid ice where the agent can stand and H is the hole where if the agent goes, it falls down.

The agent has 4 possible moves which are represented in the environment as 0, 1, 2, 3 for left, right, down, up respectively.

For every state F, the agent gets 0 reward, for state H it gets -1 reward as in state H the agent will die and upon reaching the goal, the agent gets +1 reward.

The game upon rendering in the terminal looks like this:


The states here are F, S and G. That is there are 4x4=16 states and 4 actions.

To solve this game using Q-learning we’ll make use of the theory we saw in the previous part.

This is the code for solving the “FrozenLake-v0” environment using Q-learning. It is pretty straight-forward and you’ll feel comfortable with it till the end of the post.

Let’s dissect it.

Lines 1–3 are importing the libraries we’ll use. Numpy for storing the Q-table and pickle to save our Q-table as “pkl” file.

Line 5 initializes out FrozenLake environment.

Lines 7–12 initializes our variables. epsilon for the epsilon-greedy approach, gamma is the discount factor, max_episodes is the maximum amount of times we’ll run the game, max_steps is the maximum steps we’ll run for every episode and lr_rate is the learning rate.

Line 14 initializes our Q-table as a 16x4 matrix filled with zeros. env.observation-space.n tells the total number of states in the game and env.action-space.n tells the total number of actions.

On line 30, we start running the episodes.

On line 31, the variable state stores the initial state using env.reset().

On Line 32, t is used for storing the number of time steps.

Using line 35, the environment is rendered.

On line 37 the appropriate action is chosen. This is done using the epsilon-greedy approach. See the lines 16–22, here we randomly generate a number between 0 and 1 and see if it’s smaller than epsilon. If it’s smaller, then a random action is chosen using env.action_space.sample() and if it’s greater then we choose the action having the maximum value in the Q-table for state: state. Eg:

State 10 with q values

Suppose, for the actions 0–3 in state 10, it has the values 0.33, 0.34, 0.79 and 0.23. The maximum Q-value is 0.79, for the action 2 and this action 2 is chosen for state 10.

On line 39, the action we chose is taken in the environment and the next state, reward for the action are returned. done returns true if the episode has terminated and info stores the extra information that is used for debugging.

At this point we have:

1. The previous state, state.

2. The next state, state2.

3. And the action and reward for the state2.

On line 41, we use the above information to update our Q-table using the function learn(state, state2, reward, action) by the following equation:

Q-value updation equation

After updating the Q-table, we set the previous state, state as the current state, state2 on line 43.

The time-steps are increased on Line 45. Line 47–48 checks if done is true that is, if the episode is finished.

On lines 54 and 55 we save our Q-table into “frozenlake_qTable.pkl” file.

And that’s it. Feel free to play around with the code.

FrozenLake in action

Running the code above you’ll see the game in action. But, please be patient as it’ll take some time for 10000 episodes to finish.

Agent in action

You can load the Q-table afterwards and play the game using the code below. It’s fairly simple. Only the training part is removed from the code below.

Stay tuned for more fun with Reinforcement Learning. Happy exploring 😄.

Please click on the 👏 button if you liked the post and hold it for giving more love.

If you wish to connect:

Twitter Instagram LinkedIn Github

This story is published in The Startup, Medium’s largest entrepreneurship publication followed by 343,876+ people.

Subscribe to receive our top stories here.