Q-learning: a value-based reinforcement learning algorithm

Dhanoop Karunakaran
Intro to Artificial Intelligence
5 min readSep 17, 2020
Q-learning

Please follow this link to understand the basics of Reinforcement Learning.

Let’s explain various components before Q-learning.

Policy-based vs value-based RL

In policy-based RL, the random policy is selected initially and find the value function of that policy in the evaluation step. Then find the new policy from the value function computed in the improve step. The process repeats until it finds the optimal policy. In this type of RL, the policy is updated directly

In a value-based approach, the random value function is selected initially, then find new value function. This process repeated until it finds the optimal value function. The intuition here is the policy that follows the optimal value function will be optimal policy. Here, the policy is implicitly updated through value function. In Q-learning updating the value function(Q-value) to find the optimal policy

Three basic approaches of RL algorithms

These algorithms are basis for the various RL algorithms to solve MDP. Source[1]

Temporal-Difference(TD) learning is a combination of Monte-Carlo and Dynamic Programming (DP) methods. Like the Monte-Carlo method, TD method can learn directly from raw experience without a model of the environment’s dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they call bootstrapping). Basically, in Q-learning, we are using 1 step TD learning approach. It means, we update the Q value by taking a single action, rather than waiting till the end of the episode to update the value function. This will be more clear when we introduce the equation later in the article.

Value functions

The value function measures the goodness of the state(state-value) or how good is to perform an action from the given state(action-value)[1][2].

Backup diagram of stat-value and action-value functions: a) state-value b) action-value. Source:[2]

state-value function

The state-value Vπ(s) is the expected total reward, starting from state s and acts according to policy π.

If the agent uses a given policy π to select actions, the corresponding value function is given by:

Optimal state-value function: It has high possible value function compared to other value function for all states

In a value-based RL, If we know optimal value function, then the policy that corresponds to optimal value function is optimal policy 𝛑*.

Action-value function

It is the expected return for an agent starting from state s and taking an action a then forever after act according to policy 𝛑. The state can have multiple actions, thus there will be multiple Q value in a state.

The optimal Q-function Q*(s, a) means highest possible Q value for an agent starting from state s and choosing action a. There, Q*(s, a) is an indication for how good it is for an agent to pick action while being in state s.

Since V*(s) is the maximum expected total reward when starting from state s, it will be the maximum of Q*(s, a) over possible Q* value of other actions in the state s. Therefore, the relationship between Q*(s, a) and V*(s) is easily obtained as:

and If we know the optimal Q-function Q*(s, a), the optimal policy can be easily extracted by choosing the action a that gives maximum Q*(s, a) for state s.

Q-learning

Q learning is a value-based off-policy temporal difference(TD) reinforcement learning. Off-policy means an agent follows a behaviour policy for choosing the action to reach the next state s_t+1 from state s_t. From s_t+1, it uses a policy π that is different from behaviour policy. In Q-learning, we take absolute greedy action as policy π from the next state s_t+1.

Computation of Q-value in Q-learning. Source[4]

As we discussed in the action-value function, the above equation indicates how we compute the Q-value for an action a starting from state s in Q learning. It is the sum of immediate reward using a behaviour policy(ϵ-soft, ϵ-greedy or softmax) and from state s_t+1, it takes the absolute greedy action (choose the action that has maximum Q value over other actions).

Basic update rule in Q-learning

It is important to mention the update rule in Q-learning. New Q value is the sum of old Q value and TD error.

Expanding the TD error in Q-learning

TD error is computed by subtracting the new Q value from the old Q value.

Updating rule in Q-learning. Source[3]

The above equation shows the elaborate view of the updating rule.

Q-table

We are going to discuss the Q-learning using Q-table. When we use Neural Networks, it is called DQN and we can discuss that in another article.

Q-table. Source:[3]

Q-table contains q values for each and every state-action pair. During the learning process, Q values in the table get updated.

Pseudo-code for Q-learning in the episodic task using Q-table

  1. Initialize with Q-table with zero value.
  2. Episode begins
  3. Perform action a_t from state st and observe the next state s_t+1 and reward r
  4. Compute the new Q value using the below equation and update the Q-table
Source[4]

5. s_t+1 is the new state s_t and repeat steps 3 to 4 until the s_t+1 reaches the terminal state

6. Episode ends

7. Repeat steps 2 to 6 until the optimal Q value is reached

In the next article, we can discuss the Deep Q-learning(DQN)

If you like my write up, follow me on Github, Linkedin, and/or Medium profile.

--

--