Trust Region Policy Optimization Explained

Henry Wu
8 min readMar 9, 2024

--

This post explains Trust Region Policy Optimization (TRPO) and shows how it addresses common issues in vanilla policy gradient methods.

Before exploring TRPO, make sure you’re comfortable with reinforcement learning basics. This post will mainly follow the original paper [1].

Photo by Erik Mclean on Unsplash

Policy Gradient

Policy gradient methods aim to directly optimize the cumulative reward by optimizing parametrized policies with respect to expected returns. The goal is to learn the policy parameter based on the policy gradient through gradient ascent.

Most policy gradient methods follow similar steps shown below:

Problems with Vanilla Policy Gradient Methods?

In policy gradient methods, the basic principle is to use gradient ascent to step in the direction of the maximum expected returns. However, a wrong step in gradient ascent can lead to suboptimal actions, which may result in bad states. This, in turn, can lead to poor policy updates and poor sample data collection. The cycle continues, which can collapse the policy performance and ruin the training process. This differs from standard supervised learning because our training data are non-stationary as they are generated based on the currently running policy.

Trust Region Policy Optimization (TRPO) tries to address this issue by taking the largest step possible to improve policy performance while satisfying a special constraint on how close the new and old policies are allowed to be. For the constraint, the paper uses Kullback-Leibler (KL) divergence, which measures how one probability distribution, p, differs from a second, reference probability distribution, q:

KL Divergence

What are we trying to optimize?

We want to optimize πœ‚(πœ‹), the expected discounted return of policy πœ‹:

The following useful identity expresses the expected return of another policy Ο€Λœ in terms of the advantage over Ο€, accumulated over timesteps. It allows us to relate the new policy Ο€Λœ with the old policy Ο€:

[1] Appendix A for proof

The paper rewrites the above identity with the sum over states,

where p_πœ‹ is the discounted state visitation frequencies:

Equation (2) implies that any policy update πœ‹ β†’ πœ‹~ that has a nonnegative expected advantage at every state, s,

is guaranteed to increase the policy performance πœ‚, or leave it constant if the expected advantage is zero everywhere.

The paper then introduces the following local approximation to πœ‚, L_Ο€:

The difference between L_Ο€ and πœ‚ is the use of p_πœ‹ instead of p_πœ‹~. p_πœ‹~ is the state visitation frequency based on the new policy, which results in a complex dependency of p_Ο€Λœ(s) on Ο€Λœ that is difficult to optimize directly. The local optimization L_Ο€ uses p_πœ‹ instead, ignoring changes in state visitation density due to changes in the policy.

We will next show how the local approximation of πœ‚(πœ‹~) is accurate within a trust region and gives a monotonic improvement guarantee.

Monotonic Improvement Guarantee for General Stochastic Policies

In 2002, Kakade & Langford [3] proposed a policy updating scheme called conservative policy iteration, for which they could provide explicit lower bounds on the improvement of Ξ·. The bound is restrictive in practice and only applies to mixture policies.

The paper modified the proposed bound to make it slightly weaker but simpler so that it applies to general stochastic policies. The modified lower bound is,

[1] Appendix A for proof

where D_KL^max is the maximum KL divergence between the new and old policies:

We will next show how an approximate policy iteration scheme based on the lower bound is guaranteed to generate a monotonically improving sequence of policies:

Approximate Policy Iteration Scheme

Algorithm 1 is an approximate policy iteration scheme based on the lower bound:

The bound implies that a policy update that improves the right-hand side is guaranteed to improve the true performance Ξ·.

Proof:

By maximizing M_i at each iteration, we guarantee that the true objective Ξ· is non-decreasing.

What we have discussed so far are the theoretical foundations. Trust region policy optimization (TRPO) is a practical algorithm that gives an approximation to Algorithm 1 and works with finite sample counts and arbitrary parameterizations.

For the rest of the sections, we will change our notations from using πœ‹ to ΞΈ, since we are working with parameterized policies. This changes the maximization to:

Trust Region Policy Optimization (TRPO)

TRPO is an approximation to Algorithm 1, which uses a constraint on the KL divergence rather than a penalty to allow large updates.

In practice, the penalty coefficient C recommended by the theory would result in very small step sizes. One way to take larger steps is to use a constraint on the KL divergence between the new policy and the old policy (a trust region constraint), where 𝛿 is a hyperparameter:

This optimization problem imposes a constraint that the KL divergence is bounded at every point in the state space, which is impractical to solve due to the large number of constraints. Instead, we can use a heuristic approximation that considers the average KL divergence,

which gives us the following average KL divergence constraint optimization problem:

In the paper, their experiments show that the average KL divergence constraint update has a similar empirical performance to the maximum KL divergence constraint.

We will next show how the average KL divergence constraint optimization problem can be expressed in terms of expectation.

We get the following constraint optimization problem after expanding L:

  • We replace the left summation in the objective with the expectation:
  • We replace the sum of the actions with an importance sampling estimator. Using q to denote the sampling distribution, the contribution of a single sample data s_n to the loss function is:

We finally arrive at the constraint optimization problem in terms of expectation, which is what the paper proposed to generate a policy update:

We will next show how we can approximately solve this constraint optimization problem.

Approximately Solve the Constraint Optimization Problem

We have the following constraint optimization problem:

The objective function L is approximated using linear approximation (first-order Taylor):

The KL constraint is approximated using quadratic approximation (second-order Taylor), where H is the Hessian matrix:

We now have the following approximate constraint optimization problem:

To approximately solve this constrained optimization problem, we first compute a search direction, and then rescale it to satisfy the KL divergence constraint (details in [1] Appendix C).

Search direction: The search direction is computed by approximately solving Hx=g, where H is the Hessian matrix. However, it is costly to form the full matrix H, so we will use the conjugate gradient algorithm to approximately solve for

, by forming the Hessian-vector product function without explicitly forming the full matrix H.

Rescale: Compute the maximum step length 𝛽 such that πœƒ + 𝛽s satisfies the KL divergence constraint.

Parameter update: Update policy parameter with backtracking line search. The process iteratively scales down the step length by a factor of 𝛼 (the backtracking coefficient) until a step size is found that both improves the sample loss (or surrogate objective) and satisfies the KL divergence constraint. j is the smallest value that improves performance while satisfying the KL-divergence constraint.

TRPO Algorithm

Now that we understand how TRPO works, we will briefly talk about two sampling schemes the paper uses to approximate the objective and constraint functions using Monte Carlo simulation.

Sample-Based Estimation of the Objective and Constraint

The paper presents two sampling schemes:

Figure 1: Left is Single Path. Right is Vine.
  • Single Path: It collects a sequence of states by sampling individual trajectories.
  • Vine: It first generates a number of trajectories, and then chooses a subset to form the rollout set. For each state in this rollout set, multiple actions are then explored, creating new trajectories from each of these points.

Vine has a much lower variance given the same number of samples in the surrogate objective and gives much better estimates of the advantage values. However, we must perform far more calls to the simulator for each of these advantage estimates and it requires us to generate multiple trajectories from each state in the rollout set, which can be resource-intensive.

Reference

--

--