[ Archived Post ] Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO
Please note that this post is for my own educational purpose.
Today we are going to talk about advance optimization methods with a policy gradient. (Seems like PG is a great method.).
What we covered last time was vanilla PG method, and we can upgrade that method. (here we are going to change the optimization algorithm).
Things we need to improve on → 1) pretty hard to choose a good step size for the optimization is hard (In RL the data that NN receive is non-stationary, which means that depending on how we optimize the input data changes → this means that statistic of the input changes over time and setting a one good step size is actually really hard → bad step size is much more damaging.).
2) Sample Efficiency → we collect data, and after we collect the data compute the gradient and throw it away it is a waste. (there is more juice to extract.). (also gradients are not scaled properly.).
Modern ML → numerical optimization. Learning is about making a prediction on the data we have not seen before. (PG → great since we are directly optimizing what we want, the performance of the policy but we are not using all of the data.). → this lecture gain more information out the data.
What loss should we use? → instead of writing it as a log probability we can write it as a ratio, and we differentiate that we can the same gradient.
That form also has an interesting meaning related to importance sampling. (importance sampling → what would be the expected value be if we sample from one distribution even thou I collected sample from another distribution.).
To prevent from too big of an update → I have a function that I wanna optimize and I have some local approximation of this function which is accurate, but with a limit to this region, the values get inaccurate. So better to stay within the trust region. (so this is a very good method → to optimize safely.). (So this seems like some sort of regularization method.).
But we are going to add on the KL divergence between the distributions. (and the beta parameter can be used as penalizing method → also there are attempts to use Western distance and more.).
There is much more theory behind the regularization method → if we use max KL? instead of mean KL we get a lower bound. (so maximizing this objective improve the overall policy.).
Putting this all together → gathering the sample and acting on it is the same, but there is a constraint in the optimization problem. (using conjugate gradient.).
So since we are optimizing a non-linear function → rather than solving it directly make an approximation and solve it iteratively. (natural policy gradient.). (Fisher information matrix → tell us how sensitive prob distribution is to different directions in parameter space → if I move in this direction how will it change the prob distribution?). (conjugant gradient → method of solving linear equations.).
Quite a complicated overall method → since it relies on the conjugate gradient as well as Fish information matrix.
The optimization method seems to have a different form, in which have a different kind of KL penalty version.
How the TRPO method connects to other methods the RL optimization problem. But TRPO also has limits in which → hard to use with an arch with multiple outputs. (deep CNN and RNN is not compatible for some reason.).
Something similar to TRPO → KFAC the details are skipped.
There seems to be a couple of variations from the original KL penalty optimization method.
Reference
- Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO. (2018). YouTube. Retrieved 30 December 2018, from https://www.youtube.com/watch?v=xvRrgxcpaHY&t=5s