# (Deep) Q-learning, Part1: basic introduction and implementation

*The source code used here is available in my github.*

In this article, we are going to discuss about the basic concept of *Q-Learning* and its implementation. Therefore, we will give readers some insights instead of digging into mathematical details such as *TD-learning* and *Bellman equation*. In order to demonstrate, we implement basic *Q-Learning* in a simple and intuitive way.

## There are several sections we are going to discuss about.

**What is Reinforcement Learning (RL)****What is Q-Learning algorithm (QL)****Understanding RL with Q-Learning**— give readers insights by visualizing the procedures of*Q-learning*algorithm in*Reinforcement Learning Problem*.**Simple Q-Learning Implementation with Python3**— Implement QL with tabular method, a step-by-step tutorial.**Deep Q-Learning Implementation with Python3****—**Implement QL with Deep Neural Network (Deep Q Network, DQN)**,**a step-by-step tutorial.

# What is Reinforcement Learning (RL)

** Reinforcement Learning (RL), a branch of Machine Learning, is applied to solve such problems that we know WHAT we want, but not really sure HOW to get there. **For example, how do we train an Agent which can recognize whether there exists a tree in the image? Not even humans really know how to identify a tree, we just know when we see it, right? Hence, as to machine, this task is complex and impractical by using traditional method.

Let’s pause, think how to teach a puppy to sit. When giving SIT instruction to a puppy, it doesn’t understand at first and takes a random action. Fortunately, it may accidentally sit in some time points. In these moments, we give it rewards.* *As time goes by, the puppy sits down correctly for expected rewards when it receives SIT instruction. In other words — it has learned when/how to sit.

This is where the magic happens in Reinforcement Learning,Rewards.

Inspired by the rewarding process when training a puppy, we reward the Agent with desired actions such as successful recognization and/or punish undesired ones. Finally, the Agent tends to take specific action to get more expected reward. In general, Reinforcement Learning (RL) is nothing but a rewarding process. Our goal is to train a rational Agent which can learn by itself with rewards and behave properly in the environment.

By the way, Agent is the term in AI which means an entity that can perceive and act in the environment.

## Reinforcement Learning Procedure

- An Agent perceives from the environment and gets
*current-state*. - According to
*current state*, the Agent takes an*action*with*strategy/policy*in the environment. - The Agent gets the
*reward*from the environment and update its*strategy/policy.* - After taking the
*action*, the environment updates and reaches to*next-state.* - Repete 1–4.

Overall, we need to design/define two parts first when solving RL problem.

**Environment**accepts the*action*, returns*rewards*and updates to*next state*based on its internal rules.**Agent**perceives*current state*, takes the*action*based on its*policy*and receives*rewards*for updating(learning) the*policy.*

# What is Q-Learning algorithm

Q-Learning is an algorithm in RL for the purpose of policy learning.

**The strategy/policy is the core of the Agent. It controls how does the Agent interact with the environment**. If an Agent learns the

*policy*perfectly, then it is able to determine the most proper

*action*in any given

*state*. For instance, the Agent can identify whether there’s a tree in an image correctly if it has learned

*the best policy.*

**Mathematically, the policy in Q-Learning is modeled as an action-value function Q(s, a), **which represents how much is the Agent willing to take an action

*a*in given

*state*

*s.*In addition, the value of

*Q(s, a)*returned is called

*Q-value.*The larger

*Q-value,*the larger possibility of the action

*a*will be taken since the Agent expects to get more rewards if it has large

*Q-value*.

## Analogy to Supervised Learning

In Supervised Learning, each data has the *true label (i.e. right answer)*. Our goal is to find the* best hypothesis function h(x) *to predict the *label. *The closer value between* *the* predicted label *and *true label*, the better.

In Reinforcement Learning, the *hypothesis function h(x) *we modeled by using* Q-learning* algorithm is called ** action-value function Q(s, a). **We use

**as the policy of the Agent. The closer value between**

*Q(s, a)**learned policy*and the

*best policy*, the better. If the

*best Q(s, a)*is found, then the Agent has learned the

*policy*perfectly. Therefore, our goal here is to find the

*best Q(s, a),**or called*

**in term**

*TD-target***.**

The Cost Function (or Loss Function) is used for measuring the quality of *h(x) *in Supervised Learning and it can be calculated precisely since the *label (i.e. right answer)* is known.

**There’s also Loss Function, a.k.a TD-error, in Reinforcement Learning. Unfortunately, the TD-target is always unknown. Then, how do we estimate the quality of the policy, i.e. Q(s, a), that the Agent learned?**

The way we apply to estimate the *best Q(s, a) *in *Q-learning* algorithm is Bellman Equation. The procedure is described as following.

Step 1 — In time t, the Agent takes an* action a_t *in given current state* s_t. *Then, the Agent gets a *reward, *denoted* R_t+1, *when it arrives to next state *s_t+1.*

Step 2 — In according to *Q(s, a)*, we are able to pre-calculate the maximum *Q-value* in given state* s_t+1* by considering all possible actions it can take. The *discount factor γ *is for weighting the importance of maximum next-state* Q-value.*

Step 3 — Combine step 1 and step 2, the estimated *best Q(s, a), *or *TD-target, *is completed.

**Minimize Loss Function (***TD-error*)

*TD-error*)

There are many ways that can be applied to minimize Loss Function such as* Gradient Descent *which we used in Deep Q-learning implementation. In general, the core procedure of updating *Q(s, a) *is based on the *current rewards *and the *maximum future rewards (expected rewards).*

**Learning Rate**— a hyper-parameter for controlling the convergent speed of updating procedure.**Discount Factor**— a hyper-parameter for weighting the importance of estimated future reward. The value of*discount factor*is between 0~1*.*The more closer to 1, the more important the*future reward*is.

To understand the learning procedure, let’s take an example and see what happens.

# Understanding RL with Q-Learning

## Problem Definition

Taking the problem *FrozenLake-v0* as example. Our environment can be described as a 4x4 map with 16 *states*. Each *state* contains one situation — *S, F, H* or* G*. In any given *state*, the Agent has four *actions* (*left, down, right, up)* to arrive *next state. *The Agent starts at *S,* aims to arrive *G.*

Note that we assume there is no slippery in *F* here which is different from the original problem*. *Thus, the movement in the environment is deterministic.

# Environment# Action

SFFF (S: starting point, safe)

FHFH (F: frozen surface, safe)

FFFH (H: hole, fall to your doom)

HFFG (G: goal, where the frisbee is located)

0:left, 1:down, 2:right, 3:up

## Training Procedure

At the beginning the Agent has an empty action-value function *Q(s, a) *since* *it doesn’t know about the environment.

Simply, we can view *Q(s, a) *as a matrix with *states* as rows and *actions* as columns, called *Q-matrix, *as the following right figure shows. Each element in *Q-matrix *is the *Q-value* of *Q(s, a)* in given *state s *and *action a.*

Now, the Agent needs to explore the environment.

Suppose the Agent takes a random* action — right.*

- The Agent arrives
*state 1*and gets*current rewards.* - The environment returns the maximum
*expected reward.* - The Agent updates
*Q(s, a)*based on*current*and*expected rewards.*

It keeps going, takes *down* and arrives *state 5. *The Agent also gets *rewards *and* *updates *Q(s, a). *However, this training episode is terminated as the Agent falls into the hole. Therefore, we reset the environment for next training episode.

After resetting, the Agent starts at state 0 again and according to pass experience, it has learned the policy *Q(s, a).* The Agent keeps learning in this training episode and takes action by considering *current Q(s, a).*

At first, suppose the Agent takes *right* again. Next, it takes* right* instead of* down *after considering* current Q(s, a). *At each step, it also gets* rewards *and* updates* *Q(s, a)*

As time goes by, the Agent has learned *Q(s, a) *well. Now we reset the environment again for testing. Surprising! The Agent can reach the goal easily with a series of proper actions which means it has learned the* best policy.*

# Simple Q-Learning Implementation with Python3

## Step 0 — Overview

## Step 1 — Environment Construction

We use Gym library to derive the environment and solve the FrozenLake problem. However, instead of using the original FrozenLake environment, we make a new one with no slippery to make sure the movement in the environment is deterministic. Here, we named this environment as *FrozenLakeNoSlippery-v0.*

## Step 2 — hyper-parameters and Q-table initialization

In *line 7,* the *discount factor* is used to measure the importance of* future reward. *Its value is 0~1. The more closer to 1, the more important the *future reward* is.

In *line 8–9*, the parameters *exploration_rate* and *exploration_decay *is used for training state and affects which kind of action, *random action *or* optimal action,* would the Agent take.

In *line15–17*, we create the simplest form of* Q(s, a)** — *a matrix with *states* as rows and *actions* as columns, called *Q-table/Q-matrix*. Since the Agent knows nothing at the beginning, all *Q-values* are initialized to zeros.

## Step 3— Setting the Policy that the Agent follows.

In training stage, the Agent can take random or optimal action. The former is for *Exploration *in order to get better future reward and the later is for *Exploitation* and always choose the best known *action* in a given *state.*

In testing stage, the Agent always takes the optimal action.

- Optimal Policy (Exploitation) — choose the most valuable
*action*in any*state*. - Random Policy (Exploration) — take a random
*action*in given*state*. - Control the Exploration Rate — Start with 100% exploration, move slowly towards roughly 0% since we want less and less exploration and focus on the optimal policy.

**Why Exploration and Exploitation**

Without *Exploration*, the Agent only cared about past performance and guaranteed results. However, the past performance is no guarantee or indication of future results. With the *Exploration*, the agent is more eager to explore riskier options to get larger rewards in the future.

However, we don’t want the Agent always takes a random move without considering the optimal policy. Therefore, we gradually decay the *exploration rate* in order to make the Agent focus on the *best policy* to reach the goal, a.k.a *Exploitation.*

## Step 4 — Training with Q-learning Algorithm

## Step 5 — Testing

In the testing procedure, the Agent always takes the *Optimal policy* in any given *state,* see* line8.*

## Step 6 — How to use the code

The completed code above is available here, if you are interested in, then you can run and expand it by yourself.

**Deep Q-Learning Implementation with Python3**

## Why going deep

With the matrix method, the space we need is* n* states by *m* actions. When these numbers are large such as 1000 states and 1000 actions per state, we would need a matrix of 1 million elements. Thus, the matrix method doesn’t work in most interesting problems such as chess, which needs 10¹²⁰ states space. Instead of recording *Q-value* precisely with matrix, the approximately way is more feasible such as Neural Network (NN) framework.

There is reference 1 and 2 for reviewing NN.

Briefly, Deep Q-Learning is the method for approximating

Q(s, a)with Deep Neural Networks, called Deep Q Network (DQN).

## Step 0 — Overview

## Step 1 — Environment Construction

We use the same *FrozenLakeNoSlippery-v0 *environment, a version without slippery against the original *version* in *Gym library**.*

## Step 2 — Neural Network Model Definition (NN reference 1 and 2)

We build the NN model by using* **Tensorflow*, a machine learning framework. Since the problem is relatively simple, our NN model is also simple and only contains one hidden layer.

**Input Layer**— 16 units since there are 16 states in the environment**Output Layer**— 4 units since there are 4 actions that Agent can take.**Hidden Layer**— 10 units for demonstration.

**Loss Function**— Mean Square Error**Optimal Method**for minimizing the loss function — Gradient Descent

## Step 3 — Initialize hyper-parameters and Neural Network Model

## Step 4 —Training

## Step 5 — Testing

## Step6 — How to use

The completed code above is available here, if you are interested in, then you can run and expand it by yourself.

# Summary

In this article, we show the overview of *Reinforcement Learning* and one popular algorithm,* Q-learning* algorithm. There are lots of details in *Q-learning *such as *TD-learning, *if you are really interested in these details, then you can check [2], Chapter6.Temporal-Difference Learning.

Next, we are going to discuss about one of the Deep Q-Learning method, Double Deep Q-Learning, or called Double Deep Q Network (Double DQN).

# Reference

If you like it, please give me a 👏. Any feedbacks, thoughts, comments, suggestions, or questions are welcomed!