Q-learning: a value-based reinforcement learning algorithm

Dhanoop Karunakaran

Published in

Intro to Artificial Intelligence

5 min readSep 17, 2020

Please follow this link to understand the basics of Reinforcement Learning.

Let’s explain various components before Q-learning.

Policy-based vs value-based RL

In policy-based RL, the random policy is selected initially and find the value function of that policy in the evaluation step. Then find the new policy from the value function computed in the improve step. The process repeats until it finds the optimal policy. In this type of RL, the policy is updated directly

In a value-based approach, the random value function is selected initially, then find new value function. This process repeated until it finds the optimal value function. The intuition here is the policy that follows the optimal value function will be optimal policy. Here, the policy is implicitly updated through value function. In Q-learning updating the value function(Q-value) to find the optimal policy

Three basic approaches of RL algorithms

These algorithms are basis for the various RL algorithms to solve MDP. Source[1]

Temporal-Difference(TD) learning is a combination of Monte-Carlo and Dynamic Programming (DP) methods. Like the Monte-Carlo method, TD method can learn directly from raw experience without a model of the environment’s dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they call bootstrapping). Basically, in Q-learning, we are using 1 step TD learning approach. It means, we update the Q value by taking a single action, rather than waiting till the end of the episode to update the value function. This will be more clear when we introduce the equation later in the article.

Value functions

The value function measures the goodness of the state(state-value) or how good is to perform an action from the given state(action-value)[1][2].

state-value function

The state-value Vπ(s) is the expected total reward, starting from state s and acts according to policy π.

If the agent uses a given policy π to select actions, the corresponding value function is given by:

Optimal state-value function: It has high possible value function compared to other value function for all states

In a value-based RL, If we know optimal value function, then the policy that corresponds to optimal value function is optimal policy 𝛑*.

Action-value function

It is the expected return for an agent starting from state s and taking an action a then forever after act according to policy 𝛑. The state can have multiple actions, thus there will be multiple Q value in a state.

The optimal Q-function Q*(s, a) means highest possible Q value for an agent starting from state s and choosing action a. There, Q*(s, a) is an indication for how good it is for an agent to pick action while being in state s.

Since V*(s) is the maximum expected total reward when starting from state s, it will be the maximum of Q*(s, a) over possible Q* value of other actions in the state s. Therefore, the relationship between Q*(s, a) and V*(s) is easily obtained as:

and If we know the optimal Q-function Q*(s, a), the optimal policy can be easily extracted by choosing the action a that gives maximum Q*(s, a) for state s.

Q-learning

Q learning is a value-based off-policy temporal difference(TD) reinforcement learning. Off-policy means an agent follows a behaviour policy for choosing the action to reach the next state s_t+1 from state s_t. From s_t+1, it uses a policy π that is different from behaviour policy. In Q-learning, we take absolute greedy action as policy π from the next state s_t+1.

As we discussed in the action-value function, the above equation indicates how we compute the Q-value for an action a starting from state s in Q learning. It is the sum of immediate reward using a behaviour policy(ϵ-soft, ϵ-greedy or softmax) and from state s_t+1, it takes the absolute greedy action (choose the action that has maximum Q value over other actions).