(Deep) Q-learning, Part1: basic introduction and implementation

This article is a beginning guide for novice.

Amber
11 min readApr 9, 2019

The source code used here is available in my github.

In this article, we are going to discuss about the basic concept of Q-Learning and its implementation. Therefore, we will give readers some insights instead of digging into mathematical details such as TD-learning and Bellman equation. In order to demonstrate, we implement basic Q-Learning in a simple and intuitive way.

There are several sections we are going to discuss about.

  • What is Reinforcement Learning (RL)
  • What is Q-Learning algorithm (QL)
  • Understanding RL with Q-Learning — give readers insights by visualizing the procedures of Q-learning algorithm in Reinforcement Learning Problem.
  • Simple Q-Learning Implementation with Python3 — Implement QL with tabular method, a step-by-step tutorial.
  • Deep Q-Learning Implementation with Python3Implement QL with Deep Neural Network (Deep Q Network, DQN), a step-by-step tutorial.

What is Reinforcement Learning (RL)

Reinforcement Learning (RL), a branch of Machine Learning, is applied to solve such problems that we know WHAT we want, but not really sure HOW to get there. For example, how do we train an Agent which can recognize whether there exists a tree in the image? Not even humans really know how to identify a tree, we just know when we see it, right? Hence, as to machine, this task is complex and impractical by using traditional method.

Let’s pause, think how to teach a puppy to sit. When giving SIT instruction to a puppy, it doesn’t understand at first and takes a random action. Fortunately, it may accidentally sit in some time points. In these moments, we give it rewards. As time goes by, the puppy sits down correctly for expected rewards when it receives SIT instruction. In other words — it has learned when/how to sit.

This is where the magic happens in Reinforcement Learning, Rewards.

Inspired by the rewarding process when training a puppy, we reward the Agent with desired actions such as successful recognization and/or punish undesired ones. Finally, the Agent tends to take specific action to get more expected reward. In general, Reinforcement Learning (RL) is nothing but a rewarding process. Our goal is to train a rational Agent which can learn by itself with rewards and behave properly in the environment.

By the way, Agent is the term in AI which means an entity that can perceive and act in the environment.

Reinforcement Learning Procedure

  1. An Agent perceives from the environment and gets current-state.
  2. According to current state, the Agent takes an action with strategy/policy in the environment.
  3. The Agent gets the reward from the environment and update its strategy/policy.
  4. After taking the action, the environment updates and reaches to next-state.
  5. Repete 1–4.

Overall, we need to design/define two parts first when solving RL problem.

  • Environment accepts the action, returns rewards and updates to next state based on its internal rules.
  • Agent perceives current state, takes the action based on its policy and receives rewards for updating(learning) the policy.

What is Q-Learning algorithm

Q-Learning is an algorithm in RL for the purpose of policy learning.

The strategy/policy is the core of the Agent. It controls how does the Agent interact with the environment. If an Agent learns the policy perfectly, then it is able to determine the most proper action in any given state. For instance, the Agent can identify whether there’s a tree in an image correctly if it has learned the best policy.

Mathematically, the policy in Q-Learning is modeled as an action-value function Q(s, a), which represents how much is the Agent willing to take an action a in given state s. In addition, the value of Q(s, a) returned is called Q-value. The larger Q-value, the larger possibility of the action a will be taken since the Agent expects to get more rewards if it has large Q-value.

Analogy to Supervised Learning

In Supervised Learning, each data has the true label (i.e. right answer). Our goal is to find the best hypothesis function h(x) to predict the label. The closer value between the predicted label and true label, the better.

In Reinforcement Learning, the hypothesis function h(x) we modeled by using Q-learning algorithm is called action-value function Q(s, a). We use Q(s, a) as the policy of the Agent. The closer value between learned policy and the best policy, the better. If the best Q(s, a) is found, then the Agent has learned the policy perfectly. Therefore, our goal here is to find the best Q(s, a), or called TD-target in term.

The Cost Function (or Loss Function) is used for measuring the quality of h(x) in Supervised Learning and it can be calculated precisely since the label (i.e. right answer) is known.

There’s also Loss Function, a.k.a TD-error, in Reinforcement Learning. Unfortunately, the TD-target is always unknown. Then, how do we estimate the quality of the policy, i.e. Q(s, a), that the Agent learned?

The way we apply to estimate the best Q(s, a) in Q-learning algorithm is Bellman Equation. The procedure is described as following.

Step 1 — In time t, the Agent takes an action a_t in given current state s_t. Then, the Agent gets a reward, denoted R_t+1, when it arrives to next state s_t+1.

Step 2 — In according to Q(s, a), we are able to pre-calculate the maximum Q-value in given state s_t+1 by considering all possible actions it can take. The discount factor γ is for weighting the importance of maximum next-state Q-value.

Step 3 — Combine step 1 and step 2, the estimated best Q(s, a), or TD-target, is completed.

Minimize Loss Function (TD-error)

There are many ways that can be applied to minimize Loss Function such as Gradient Descent which we used in Deep Q-learning implementation. In general, the core procedure of updating Q(s, a) is based on the current rewards and the maximum future rewards (expected rewards).

  • Learning Rate — a hyper-parameter for controlling the convergent speed of updating procedure.
  • Discount Factor — a hyper-parameter for weighting the importance of estimated future reward. The value of discount factor is between 0~1. The more closer to 1, the more important the future reward is.

To understand the learning procedure, let’s take an example and see what happens.

Understanding RL with Q-Learning

Problem Definition

Taking the problem FrozenLake-v0 as example. Our environment can be described as a 4x4 map with 16 states. Each state contains one situation — S, F, H or G. In any given state, the Agent has four actions (left, down, right, up) to arrive next state. The Agent starts at S, aims to arrive G.

Note that we assume there is no slippery in F here which is different from the original problem. Thus, the movement in the environment is deterministic.

# Environment
SFFF (S: starting point, safe)
FHFH (F: frozen surface, safe)
FFFH (H: hole, fall to your doom)
HFFG (G: goal, where the frisbee is located)
# Action
0:left, 1:down, 2:right, 3:up

Training Procedure

At the beginning the Agent has an empty action-value function Q(s, a) since it doesn’t know about the environment.

Simply, we can view Q(s, a) as a matrix with states as rows and actions as columns, called Q-matrix, as the following right figure shows. Each element in Q-matrix is the Q-value of Q(s, a) in given state s and action a.

Now, the Agent needs to explore the environment.

Suppose the Agent takes a random action — right.

  • The Agent arrives state 1 and gets current rewards.
  • The environment returns the maximum expected reward.
  • The Agent updates Q(s, a) based on current and expected rewards.

It keeps going, takes down and arrives state 5. The Agent also gets rewards and updates Q(s, a). However, this training episode is terminated as the Agent falls into the hole. Therefore, we reset the environment for next training episode.

After resetting, the Agent starts at state 0 again and according to pass experience, it has learned the policy Q(s, a). The Agent keeps learning in this training episode and takes action by considering current Q(s, a).

At first, suppose the Agent takes right again. Next, it takes right instead of down after considering current Q(s, a). At each step, it also gets rewards and updates Q(s, a)

As time goes by, the Agent has learned Q(s, a) well. Now we reset the environment again for testing. Surprising! The Agent can reach the goal easily with a series of proper actions which means it has learned the best policy.

Simple Q-Learning Implementation with Python3

Step 0 — Overview

Step 1 — Environment Construction

We use Gym library to derive the environment and solve the FrozenLake problem. However, instead of using the original FrozenLake environment, we make a new one with no slippery to make sure the movement in the environment is deterministic. Here, we named this environment as FrozenLakeNoSlippery-v0.

Step 2 — hyper-parameters and Q-table initialization

In line 7, the discount factor is used to measure the importance of future reward. Its value is 0~1. The more closer to 1, the more important the future reward is.

In line 8–9, the parameters exploration_rate and exploration_decay is used for training state and affects which kind of action, random action or optimal action, would the Agent take.

In line15–17, we create the simplest form of Q(s, a)a matrix with states as rows and actions as columns, called Q-table/Q-matrix. Since the Agent knows nothing at the beginning, all Q-values are initialized to zeros.

Step 3— Setting the Policy that the Agent follows.

In training stage, the Agent can take random or optimal action. The former is for Exploration in order to get better future reward and the later is for Exploitation and always choose the best known action in a given state.

In testing stage, the Agent always takes the optimal action.

  • Optimal Policy (Exploitation) — choose the most valuable action in any state.
  • Random Policy (Exploration) — take a random action in given state.
  • Control the Exploration Rate — Start with 100% exploration, move slowly towards roughly 0% since we want less and less exploration and focus on the optimal policy.

Why Exploration and Exploitation

Without Exploration, the Agent only cared about past performance and guaranteed results. However, the past performance is no guarantee or indication of future results. With the Exploration, the agent is more eager to explore riskier options to get larger rewards in the future.

However, we don’t want the Agent always takes a random move without considering the optimal policy. Therefore, we gradually decay the exploration rate in order to make the Agent focus on the best policy to reach the goal, a.k.a Exploitation.

Step 4 — Training with Q-learning Algorithm

Step 5 — Testing

In the testing procedure, the Agent always takes the Optimal policy in any given state, see line8.

Step 6 — How to use the code

The completed code above is available here, if you are interested in, then you can run and expand it by yourself.

Deep Q-Learning Implementation with Python3

Why going deep

With the matrix method, the space we need is n states by m actions. When these numbers are large such as 1000 states and 1000 actions per state, we would need a matrix of 1 million elements. Thus, the matrix method doesn’t work in most interesting problems such as chess, which needs 10¹²⁰ states space. Instead of recording Q-value precisely with matrix, the approximately way is more feasible such as Neural Network (NN) framework.

There is reference 1 and 2 for reviewing NN.

Briefly, Deep Q-Learning is the method for approximating Q(s, a) with Deep Neural Networks, called Deep Q Network (DQN).

Step 0 — Overview

Step 1 — Environment Construction

We use the same FrozenLakeNoSlippery-v0 environment, a version without slippery against the original version in Gym library.

Environment

Step 2 — Neural Network Model Definition (NN reference 1 and 2)

We build the NN model by using Tensorflow, a machine learning framework. Since the problem is relatively simple, our NN model is also simple and only contains one hidden layer.

  • Input Layer — 16 units since there are 16 states in the environment
  • Output Layer — 4 units since there are 4 actions that Agent can take.
  • Hidden Layer — 10 units for demonstration.
line 3–15
  • Loss Function — Mean Square Error
  • Optimal Method for minimizing the loss function — Gradient Descent

Step 3 — Initialize hyper-parameters and Neural Network Model

Step 4 —Training

Step 5 — Testing

Step6 — How to use

The completed code above is available here, if you are interested in, then you can run and expand it by yourself.

Summary

In this article, we show the overview of Reinforcement Learning and one popular algorithm, Q-learning algorithm. There are lots of details in Q-learning such as TD-learning, if you are really interested in these details, then you can check [2], Chapter6.Temporal-Difference Learning.

Next, we are going to discuss about one of the Deep Q-Learning method, Double Deep Q-Learning, or called Double Deep Q Network (Double DQN).

Reference

If you like it, please give me a 👏. Any feedbacks, thoughts, comments, suggestions, or questions are welcomed!

--

--