anindya sarkar
8 min readFeb 13, 2019

Continuous Control task (“Reacher environment”) using Proximal Policy Optimisation(PPO) -Udacity Deep RL Nanodegree Project

continuous control “Reacher task” using PPO:

Problem Definition:

In this problem I need to control 4 continuous valued parameters of the arm to move it to a target location accurately.

Input state is 33 dimensional which includes many physical parameters of reacher environment. I needed to design an algorithm which predicts 4 controlling parameter of the arm,so that the arm can reach as quick as possible to a target location given a 33 dimensional state input at each time stamp.

Solution:

To solve the problem I have used PPO algorithm.

First, let’s review the REINFORCE algorithm, and figure out what are the issues REINFORCE algorithm face for continuous control task.

REINFORCE works as follows: First, we initialize a random policy πθ​(a;s)(here, theta is the parameter of the policy), and using the policy we collect a trajectory — or a list of (state, actions, rewards) at each time step:

s1​,a1​,r1​,s2​,a2​,r2​,…

Second, we compute the total reward of the trajectory R=r1​+r2​+r3​+…, and compute an estimate of the gradient of the expected reward, g:

g=Rt∑​∇θ​logπθ​(at​∣st​)

Third, we update our policy using gradient ascent with learning rate α:

θ←θ+αg

The process then repeats.

But, the main issues are as follows:

1. The update process is very inefficient! We run the policy once, update once, and then throw away the trajectory.

2. Here, we are updating parameters of the policy, But the policy itself is very complicated function

of its parameter,as a result, it may happen that when we are updating theta(theta is the parameters of the policy) in theta space by a small amount using gradient descent update rule, the change in policy is high in policy space,the following figure illustrate this scenario:

and we have no control in the amount of policy update in policy space, this may lead to a very bad scenario, where policy may stuck in a region in policy space where expected reward is very less and there is zero gradient for theta.The following figure illustrate this behaviour.

To solve these above mentioned issues, we introduce PPO. Now, let’s see how PPO solve these issues.

Let’s discuss how PPO solves the issues:

This is where importance sampling comes in. Let’s look at the trajectories we generated using the policy πθ​. It had a probability P(τ;θ), to be sampled.

Now Just by chance, the same trajectory can be sampled under the new policy, with a different probability P(τ;θ′)

Imagine we want to compute the average of some quantity, say f(τ). We could simply generate trajectories from the new policy, compute f(τ) and average them.

Mathematically, this is equivalent to adding up all the f(τ), weighted by a probability of sampling each trajectory under the new policy.

Now we could modify this equation, by multiplying and dividing by the same number, P(τ;θ) and rearrange the terms.

It doesn’t look we’ve done much. But written in this way, we can reinterpret the first part as the coefficient for sampling under the old policy, with an extra re-weighting factor, in addition to just averaging.

Intuitively, this tells us we can use old trajectories for computing averages for new policy, as long as we add this extra re-weighting factor, that takes into account how under or over–represented each trajectory is under the new policy compared to the old one.

Now Let’s a closer look at the re-weighting factor.

Because each trajectory contains many steps, the probability contains a chain of products of each policy at different time-step.

This formula is a bit complicated. But there is a bigger problem. When some of policy gets close to zero, the re-weighting factor can become close to zero, or worse, close to 1 over 0 which diverges to infinity.

When this happens, the re-weighting trick becomes unreliable. So, In practice, we want to make sure the re-weighting factor is not too far from 1 when we utilize importance sampling.

Suppose we are trying to update our current policy, πθ′​. To do that, we need to estimate a gradient, g. But we only have trajectories generated by an older policy πθ​. How do we compute the gradient then?

Mathematically, we could utilize importance sampling. The answer just what a normal policy gradient would be, times a re-weighting factor P(τ;θ′)/P(τ;θ):

We can rearrange these equations, and the re-weighting factor is just the product of all the policy across each step — I’ve picked out the terms at time-step t here. We can cancel some terms, but we’re still left with a product of the policies at different times, denoted by “…”.

Can we simplify this expression further? This is where proximal policy comes in. If the old and current policy is close enough to each other, all the factors inside the “…” would be pretty close to 1, and then we can ignore them.

Then the equation simplifies

It looks very similar to the old policy gradient. In fact, if the current policy and the old policy is the same, we would have exactly the vanilla policy gradient. But remember, this expression is different because we are comparing two different policies

The Surrogate Function

Now that we have the approximate form of the gradient, we can think of it as the gradient of a new object, called the surrogate function

So using this new gradient, we can perform gradient ascent to update our policy — which can be thought as directly maximize the surrogate function.

But there is still one important issue we haven’t addressed yet. If we keep reusing old trajectories and updating our policy, at some point the new policy might become different enough from the old one, so that all the approximations we made could become invalid.

The Policy/Reward Cliff

What is the problem with updating our policy and ignoring the fact that the approximations are not valid anymore? One problem is it could lead to a really bad policy that is very hard to recover from. Let’s see how:

Say we have some policy parameterized by πθ′​ (shown on the left plot in black), and with an average reward function (shown on the right plot in black).

The current policy is labelled by the red text, and the goal is to update the current policy to the optimal one (green star). To update the policy we can compute a surrogate function Lsur​ (dotted-red curve on right plot). So Lsur​ approximates the reward pretty well around the current policy. But far away from the current policy, it diverges from the actual reward.

If we continually update the policy by performing gradient ascent, we might get something like the red-dots. The big problem is that at some point we hit a cliff, where the policy changes by a large amount. From the perspective of the surrogate function, the average reward is really great. But the actually average reward is really bad!

What’s worse, the policy is now stuck in a deep and flat bottom, so that future updates won’t be able to bring the policy back up! we are now stuck with a really bad policy.

How do we fix this? Wouldn’t it be great if we can somehow stop the gradient ascent so that our policy doesn’t fall off the cliff?

Clipped Surrogate Function

Here’s an idea: what if we just flatten the surrogate function (blue curve)? What would policy update look like then?

So starting with the current policy (blue dot), we apply gradient ascent. The updates remain the same, until we hit the flat plateau. Now because the reward function is flat, the gradient is zero, and the policy update will stop!

Now, keep in mind that we are only showing a 2D figure with one θ′ direction. In most cases, there are thousands of parameters in a policy, and there may be hundreds/thousands of high-dimensional cliffs in many different directions. We need to apply this clipping mathematically so that it will automatically take care of all the cliffs.

Clipped Surrogate Function

Here’s the formula that will automatically flatten our surrogate function to avoid all the cliffs:

Now let’s dissect the formula by looking at one specific term in the sum, and set the future reward to 1 to make things easier.

We start with the original surrogate function (red), which involves the ratio πθ′​(at​∣st​)/πθ​(at​∣st​). The black dot shows the location where the current policy is the same as the old policy (θ′=θ)

We want to make sure the two policy is similar, or that the ratio is close to 1. So we choose a small ϵ (typically 0.1 or 0.2), and apply the clip function to force the ratio to be within the interval [1−ϵ,1+ϵ] (shown in purple).

Now the ratio is clipped in two places. But we only want to clip the top part and not the bottom part. To do that, we compare this clipped ratio to the original one and take the minimum (show in blue). This then ensures the clipped surrogate function is always less than the original surrogate function Lsurclip​≤Lsur​, so the clipped surrogate function gives a more conservative “reward”.

Implementation Link:

Demo of the task by a trained agent :