Policy Gradients and Log Derivative Trick

Amina Mollaysa
5 min readSep 16, 2018

This article will give a high-level picture for those who want to use Reinforcement tricks to solve their ML models but also don’t want to dig too deep into RL field, mainly use Sergey Levine slides for gradient policy method and Shakir Mohamed’s blog for log derivative trick

Policy gradient

The goal of the Reinforcement learning is to learn a policy which will give us the maximum expected reward:

The expectation is taken over the trajectories, theta is the parameter of the policy. The policy can be defined as a distribution of the action for a given state. The sum is taken over time steps, at each time step, for a given state, policy tells us which action to take, we get the reward, and this action will transit us to the next state, then we sample again an action from the policy on the new state, get the reward and go on…. So the final reward is the sum of the rewards we get at each time step. For simplicity, we do not use the discounted reward here.

To solve such optimization problems, we have to take the derivative of the expectation, which is not as easy as in the case where we optimize empirical loss instead of the true expected loss because we do not know the true data distribution. In the previous case, the parameter we want to optimize does not depend on the sample. However, in…

--

--