Policy Gradients in Reinforcement Learning, Homework 2 — CS294

Published in

CS294–112 Berkley — Deep Reinforcement Learning

6 min readJul 20, 2019

This are our notes to the Homework 2 for the course CS294–112 Berkley — Deep Reinforcement Learning. This document is organised as follows:

Lecture Review: review of the theoretical concepts, here are some english language comments to the formula
Problem 1: we try to prove that a state-dependent value function is an unbiased baseline for the policy gradient.

1. Lecture Review

1.1 Formula (1) : objective function

(1) This is the objective function, tau is a trajectory. A trajectory is a sequence of (observation, action).
`tau ~ pisubtetha` means that the trajectories are distributed as the probability distribution `pi` that comes out of the current policy `p`. If you compare to throwing dices, tau is the outcome of how you throw the dice and pi_theta is the distribution that tells you how the dice behave.
`pisubtheta(tau)` is the probability that the given trajectory tau will happen.
the reward of a trajectory `r(tau)` is just the sum of all the rewards of the states on the trajectory
More formally, `π(θ)` is a probability distribution over the action space, conditioned on the state.
In the agent-environment loop, the agent samples an action at from πθ(·|st) and the environment responds with a reward r(st,at).

1.2 Formula (2) : gradient definition

We simply take the gradient of the objective function and we apply gradient descent. This is just some multivariable calculus math.
How do we compute it? Using a batch of trajectories

1.3 Formula (4): gradient computation

The computation of the gradient in batches is just the gradient in the classical supervised learning problem, just weighted by the reward.
Not super clear yet how to compute since we have `pisubtheta(tau)` where tau is a trajectory.

1.4 Formula (5): gradient computation with exploded trajectories

This is the key section of the whole policy gradient lecture.

We are going through batches of trajectories tau . So here we explode the trajectories so we can understand better what pisubtheta means.
This formula is important for the computation so we might as well spend a bit more time on it. Let’s examine the single parts.

1. `∇θ log πθ( action| state)`, here we are taking the gradient of the logarithm of `pi`. This is because `pisubtheta` is not a number but is a probability distribution. In common English, the meaning of this expressions is: how should I change my parameters theta in order to increase the probability that the next time I will be in state I will take action .

2.`r(state, action), here we sum all the reward and we get the score of the given trajectory.

Key Insight on Policy Gradient: The policy gradient term and the reward term are multiplied because I am not sure the next time I am in this state I actually want to take this action. If I really want to do it is told to me by the r(state, action). If the reward is negative, I will actually decrease the probability of taking this action again. If the reward is positive, I will increase the probability of taking this action again.

What we are doing here is summing the gradients of all the steps of the trajectory and then multiplying by the `total reward of the trajectory.`
What does this mean? Well in supervised learning (when you compute the gradient of the log loss), that gradient tells you where to move towards your truth label, that’s because the log loss is computed for the truth labels. Here the policy `pi` is not the truth label, so our `pi` gradient does not tell us a priori that the gradient direction is the improvement direction. To know if it is the correct direction we need to multiply by the total trajectory reward: that’s how we know the improvement direction.

This is one of the key concepts of reinforcement learning. One problem we have though is that there is too much variance over the reward term: it could change a lot from trajectory to trajectory, and for this reason, we would lose signal on how good our gradient.

1.5 Formula (6) : reward-to-go

Looks like now we are moving the sum of rewards inside the sum of gradients. The idea here is that I am multiplying step by step the current gradient by the reward to go.
This reduces the variance, you can imagine you are assigning your ‘truth labels’ in a less generic and more accurate way compared to before.

1.6 Formula (7) and (8): discount factor

We multiply the reward at every step `t1` by a discount factor `gamma ** (t1 — t)`. This reduces the importance of future rewards since they are less influenced by the gradient at the current steps. This explains (8) whereas is not clear how this is useful in applying the discount factor to the whole trajectory like in (7).

1.7 Formula (9): baseline subtraction

We can subtract a baseline constant with respect to `tau` (so it is constant in all the trajectory `tau`, but is not necessarily constant in time `t` of the simulation).
This also reduced variance as explained in lecture. Why? This is a interesting point. Here you are multiplying by the gradient, but what happens if you reward is always positive? The model will think that what he is doing is always correct, in the sense that step by step the gradient will be multiplied by a positive return, thus it will indicate that the gradient direction is the improvement direction.
In this case we are actually giving the model a wrong signal, and a good work around can be subtracting a baseline. The baseline is chosen to be the average of the rewards for the trajectory so that your single rewards will be correctly calibrated in being positive or negative (since we subtract the average).

1.8 Formula (10): value function as a baseline

In the homework we use a state-dependent baseline: it’s a value function as it is a function of a state (it expresses the value of a particular state). It is defined as the the sum of future rewards starting from a particular state, given the current policy.

Why we use the value function? The value function is the expectation of how good is the state you are currently in, i.e. how much reward you will get from there. So intuitively it makes sense: you are in a state, and you want to subtract (as a baseline) what you already know about that state (how good it is).

An interesting point here is that actually, we don’t know if this baseline is unbiased with respect to the gradient: in the lecture was proven that a baseline that is trajectory independent is unbiased. The homework will go through this later.

1.9 Formula (11): final policy gradient expression

This just puts everything together: we get the baseline of (6) of the sums of gradients multiplied step by step by the reward to go. There is the (8) discount factor and the (9) baseline subtraction.