Level up — Understanding Q learning

7 min readApr 24, 2020

In my first series of the tutorial, I have almost covered all the basic terminologies and also you must have got a clear picture of how Reinforcement Learning works. Added to that, we have seen different approaches to Reinforcement learning such as Value-based, Policy-based, Model-based. Assuming you have read my previous article, I’ll directly dive into the important concepts of RL.

Hero of this article's topic — Q learning

So let’s start understanding the implementation of the Q learning. First and foremost, we need to learn about the environment. Here we have, a 5x5 grid environment which is where our taxi runs to pick our passenger from the selected place and drop them to the desired location. To sum up, we have 25 possible taxi locations, four (4) destinations, and five (4 + 1) passenger locations.

So, our taxi environment has 5×5×5×4=500 total possible states.

The agent encounters one of the 500 states and it takes an action. In other words, we have six possible actions:

south
north
east
west
pickup
dropoff

Now that we have a clear picture of all the states and the actions, we will define our rewards.

There are 4 locations (labeled by different letters) and your job is to pick up the passenger at one location and drop him off in another.
You receive +20 points for a successful dropoff and lose 1 point for every timestep it takes. This -1 will help the agent to take the shortest distance to drop the passenger.
There is also a 10 point penalty for illegal pick-up and drop-off actions.

From the above information, we got to understand the environment’s state, actions, rewards.

Exploration and Exploitation trade-off

Exploration is finding more information about the environment.
Exploitation is exploiting known information to maximize the reward.

Initially, the agent is not aware of the environment, so we introduce a hyperparameter called epsilon which ranges from 0 to 1. So, Initially, we set it to 1 so that our agent explores the environment. At each time step, the exploration rate is reduced exponentially with the help of the decay rate which we need to set. By hyperparameter tuning, we should find the ideal decay rate. Once it decays, our agent starts to exploit the environment, it becomes greedy to take all the rewards.

In our algorithm, we create a random number, so when the epsilon is greater than the random number, the agent explores the environment ie. ( random number < epsilon).
If our epsilon reduces and falls below the random number, the agent exploits the environment (greedy action)

Remember, the goal of our RL agent is to maximize the expected cumulative reward.

This is what we call the exploration/exploitation trade-off.

Mathematics behind the Q learning

The Q-function uses the Bellman equation and takes two inputs: state (s) and action (a).

With the help of the equation, we can compute our Q-value for our state-action pairs. This is an iterative process and this falls under the value-based learning approach.

Train the agent — Q table

Q-Table is just a name for a simple lookup table where we calculate the maximum expected future rewards for action at each state. The ‘Q’ term is called the Quality

The Q-table is a matrix where we have a row for every state (500) and a column for every action (6). The size of our Q-table is 3000.

It’s first initialized to 0, and then values are updated after training.

Start exploring actions: For each state, select any one among all possible actions for the current state (S).
Travel to the next state (S’) as a result of that action (a).
For all possible actions from the state (S’) select the one with the highest Q-value.
Update Q-table values using the equation.
Set the next state as the current state.
If the goal state is reached, then end and repeat the process.

Hyperparameters and optimizations

The values of `alpha`, `gamma`, and `epsilon` were mostly based on intuition and some “hit and trial”, but there are better ways to come up with good values.

Ideally, all three should decrease over time because as the agent continues to learn, it actually builds up more resilient priors

α: (the learning rate) should decrease as you continue to gain a larger and larger knowledge base. The learning rate should be in the range of 0 -1. The higher the learning rate, it quickly replaces the new q value. We need to optimize it in a way so that our agent learns from the previous q values. A learning rate is a tool that can be used to find how much we keep our previous knowledge of our experience that needs to keep for our state-action pairs.
γ: as you get closer and closer to the deadline, your preference for near-term reward should increase, as you won’t be around long enough to get the long-term reward, which means your gamma should decrease.
ϵ: as we develop our strategy, we have less need for exploration and more exploitation to get more utility from our policy, so as trials increase, epsilon should decrease.

Implementation of Q-learning algorithm

The Q-learning algorithm is implemented with the help of the openai gym environment which is a platform that helps us to play around with the game environment to understand the concepts that we are learning. I encourage everyone to play around to get a closer look at how our agents are trained and tested in the environment.

Install the openai gym environment with the below command

!pip install gym

Then we need to create a taxi environment. The below image shows us the command on how to render our environment

Note: As mentioned earlier we mention our epsilon and decay rate. Added to it, we also need to mention our minimum and maximum epsilon value so that our epsilon would not reach 0. Here in the below code, we have our minimum value as 0.1 so our epsilon value reduces until this range and exploits the environment

Initialize all Q-values in the Q-table to 0.
For each time-step in each episode:

Choose an action ( considering the exploration-exploitation trade-off).
Observe the reward and the next state.
Update the Q-value function ( using the formula mentioned below)

Our new Q-value is equal to a weighted sum of our old value and the learned value which is the alpha the learning rate. The old value in our case is 0 since this is the first time the agent is experiencing this particular state-action pair, and we multiply this old value by (1−α).

Our learned value is the reward the agent receives from moving to a certain position from the starting state plus the discounted estimate of the optimal future Q-value for the next state-action pair (s′,a′) at time t+1. This entire learned value is then multiplied by our learning rate.

Above mentioned pseudo-code is implemented using python.

Findings after playing around with the environment

Q-learning is the value iteration method that is used to update the value at each time step.
The above-mentioned algorithm can be used in the discrete environment spaces.
The important hyperparameters are alpha(learning rate), gamma (discount factor), epsilon value.
We use the Bellman equation to update the Q-value

Check your knowledge from the below questions

When the epsilon value is 1, what the agent does?

(explore, exploit)

2. What is the range of the learning rate alpha parameter?

(0–1, 1, 0.5)

3. how to calculate the size of our q-table ________________

(size of state * actions)

Deep Q Learning — Explained

References used for python implementation

The below-mentioned links are one of the best tutorials that I have come across and used as a guide to understanding the concepts in more depth. These articles also inspired me to create my own articles with the knowledge I gained. Thanks to these authors.

Thank you for reading. Please email me at nancyjemimah@gmail.com for further suggestions and questions.