Summary: Conservative Policy Iteration

Zac Wellmer
Arxiv Bytes
Published in
1 min readMay 7, 2019

Conservative Policy Iteration has 3 goals: (1) an iterative procedure guaranteed to improve a performance metric, (2) terminate in a “small” number of steps, and (3) find an “approximate” optimal policy. These three goals are hit by relying on a few assumptions. We assume policies are mixture policies(soft updates) and advantage estimates are accurate.

mixture policy

In the above π’ represents an updated π_old. Furthermore, in the lower bound shown below describing monotonic policy improvements, we assume that we have access to an accurate Advantage estimator that can be run over the entire state space.

Simplified lower bound estimate for Conservative Policy Iteration from TRPO

This work lays down key foundation to TRPO, however, some short comings of this work are calculating \epsilon with respect to the max state would be very difficult in large/continuous state spaces, and the restriction to mixture policies.

--

--