A brief explanation of state-action value function (Q) in RL

Dhanoop Karunakaran
Intro to Artificial Intelligence
5 min readAug 22, 2023

The reinforcement Learning(RL) algorithm finds an optimal policy that maximizes the return by interacting with the environment that is modelled as a Markov decision process (MDP).

There are a few important components to discuss before we go with the state-action value function.

Reward and Return

The reward determines how good the action from the state s to reach the next state. This is the crucial component of RL that determines the learning of the RL agent.

On the other hand, the return is the cumulative sum of weighted rewards from the current state to the goal state. Gamma ensures that immediate reward is weighted high compared to rewards towards the end.

States

The state s is a complete description of the state of the world where the states are fully observable.

Action

The agent performs the action in the environment to reach the next state from the current state. For instance, in a navigation task, turning left or turning right is an example of action.

Policy

The policy is a mapping from states to actions. In other words, policy determines the agent's behaviour from a specific state.

Value function

The value function returns a value of a state or state-action pair. There are two value functions: state value function and state-action value function. The state-value function gives the expected return if we start from state, s and act according to the policy. Whereas, the state-action value function, Q(s, a) gives the expected return if we start from state, s and take an arbitrary action, a and then act according to the policy.

State-action value function

Let’s explain more about the state-action value function using a rover example.

Rover example to explain state-action value function

We have a rover that can move left or right. There are 2 terminal states which have the rewards 100 & 40 respectively and the rest of the states have zero rewards. The gamma is set as 0.5.

Consider the example of a rover starting at state 5

Now imagine our policy of the rover is to go left always and the rover starts from state 5. So, the total return of state 5 is calculated as below:

Calculating the return at state 5 with the policy of going left

Similarly, we can compute the return of the rover starting at state 4 with the policy of going left.

Calculating the return at state 4 with the policy of going left

Finally, we can compute the return of all state which has a policy of going left as shown below.

Consider the new policy of going right always. Here, we can compute the return at each state using the return formula we applied above. This can give us different returns at each state as the policy is different now.

Return of each state with a policy of going right always

So, we can plot the different returns of the rover at each state by following the different policies.

Returns based on different policies ( going left always and going right always)

As we have the returns from both policies, we can easily define our best policy for the rover based on the highest return at each state as shown below.

The best policy is based on the highest return from each state.

We can safely say that going left is the best policy in every state except state 5 where going left only gives a return of 6.5 and going right can gain 20.

State-action value or Q value is the total return of starting from state s, taking arbitrary action a, then act according to the policy thereafter.

Let’s compute the Q value starting from state 2 and taking arbitrary action as going right. Now the rover is state 3 and according to the Q definition, it follows the policy thereafter and the policy here is to go left.

Taking arbitrary action in the first step and then following the policy thereafter.

As shown in the above figure, it moves from state 3 to state 2, then state 2 to state 1 to reach the terminal state. So, we can compute the Q value as below.

Similarly, if the arbitrary action was going right in the first step, then following the best policy, we can compute the Q value as follows.

This way, we can compute Q values of all possible actions from all possible states. This will give us the below figure with all possible Q values.

Computing all possible Q values of all possible actions from possible states

Bellman equation

Now, based on the definition, we can define the formal equation of state-action value, Q.

Bellman equation of computing the Q value. Where R(s) is the reward of the action from the current state. Also, we can define max Q(s¹, a¹) as the best possible return from state, s¹. For instance, at state 2, max Q(s¹, a¹) is 50 as the right action gives the highest Q value.

The intuition behind the equation is that the Q value of action, a from the state, s is the sum of the reward of the current state and the best possible Q value from the next state, s¹. Consider the example of computing the Q value, starting from state 4, taking the arbitrary action of going left.

After taking the arbitrary action, the rover reached state 3. Two possible actions are going left and going right from state 3. Upon computation, we can see that going left has the highest Q values from state 3. So, max Q(3, a¹) will be the Q value of going left which turned out to be 25.

The credit to the overall content goes to Andrew Ng’s Machine learning specialization course in Coursera.

If you like my write-up, follow me on Github, Linkedin, and/or Medium profile.

Reference

https://www.v7labs.com/blog/deep-reinforcement-learning-guide

--

--