Q-Learning: The first and foremost algorithm in Reinforcement Learning

Published in

Clique Community

4 min readJun 24, 2020

Introduction

When studying about Reinforcement Learning, the most basic and the easiest algorithm one needs to know is Q-Learning. There are many terms related to RL that everyone should know to begin with, and understand them. Q-Learning is not only an easy algorithm but also a very interesting one that can be applied to get almost all the real-world solutions.

First of all, we should know that ‘Q’ in Q-Learning stands for “quality”. It determines the action taken by our agent and helps in providing the maximum reward/return when the agent acts. The main aim in RL is to maximize the rewards attained by an agent, as this would help in building a successful model that can achieve its goals. Q-Learning is a value-based RL algorithm designed to maximize these rewards.

How does it work?

It determines what action an agent must take to reach from the initial state to the final state. The state of the agent is the description of the circumstances of an agent interacting with the world. It can be represented by a real-value vector or matrix or a higher-order tensor. The Q-Learning algorithm finds the best action an agent should take to get the next state with the highest increase in reward.

The action taken by an agent can be either exploitation or exploration based on the epsilon(ε) value provided by us. The epsilon value is known as the exploration rate that is initialized by a value between 0 and 0.1.

Exploration and exploitation are the actions taken by the agent based on the value of ε.

Exploration is taking action randomly, i.e., the agent explores and discovers a new state that it may avoid during exploitation.

Exploitation, on the other hand, means that the agent uses the information which we have from previous observations.

After each action taken, Q-value should get updated in the table known as the Q-matrix.

The Algorithm

The steps involved in making a Q-table are as follows:

Initializing the table.
Choose the action to be taken by the agent.
The agent performs the action and we measure the reward/penalty on that action
Update the table using the Bellman Equation.

We’ll look at implementation now along with step wise NumPy code snippets.

1. Initializing the table:

Let us initialize our Q-table with zero.

2. Choose the action to be taken by the agent:

Actions in Q-learning are taken based on the value of the exploration factor, better denoted by epsilon(ε). The epsilon value determines whether the action taken is exploration or exploitation. We set the ε value according to the extent we want our agent to explore.

We generate a random number between 0 and 1 using the random library and compare it with ε. If the value is less than ε, we choose exploration, otherwise exploitation.

3. The agent performs the action and we measure the reward/penalty on that action:

The initial state of the agent is let us say s1, and on performing an action (determined from above) it reaches a state s2.

Now, to reach this s2, the action performed by our agent should be such that it maximizes our reward. Hence, we choose the maximum reward, but for that, we need to look at our next step.

4. Update the table using the Bellman Equation:

The first equation that I learned in RL is the Bellman Equation. It is the final solution to evaluating the values of the Q-table and is hence very important to understand and know all the terms in it.

Here, the two new terms introduced are:-

alpha: It is the learning rate of the agent. In other words, it is a factor to determine how much you accept the new value v/s the old value.
gamma: usually ranging between 0.8 to 0.99, it is the discount factor used to balance immediate and future rewards.

Conclusion

So, here we are now with a basic understanding of Q-learning from a beginner’s point of view. It is one of the most basic building blocks of RL. Everyone must know this before diving deeper into the field of RL. The one given above is a basic structure of the algorithm and can be used accordingly on any agent to observe its state and action. But as we know that there are loopholes even in perfect theories, and the Q-learning algorithm is no exception to it. Hence, there are certain cases where it fails. If our environment or action space is large, then too many possible outcomes will be created. As a result, the maintenance of the Q-table becomes very difficult. Further research about Deep Q-learning(DQN) has improved this algorithm. Also this has advanced more to Double DQN, Prioritized Experience Replay, Dueling DQN and Deep Recurrent Q-Learning.