Improvements

Published in

Robotic Arm Control using Deep Reinforcement Learning

3 min readNov 14, 2019

This article aims to suggest some improvements to the previously explained approach :Deep Deterministic Policy Gradient.(DDPG)

1. A3C (Asynchronous Advantage Actor-Critic)

[paper | code]

A3C aims increase the accuracy of DDPG and also provides the facility to parallelism. Multiple instances of Actor class are working simultaneously on copies of the same environment. The critics learn the value function while multiple actors are trained in parallel and get synced with global parameters from time to time. Hence, A3C provides facility for parallel training. The gradients of all the actor instances are accumulated for a fixed number batches. After that, the parameters of global network are updated with the help of stored gradients. The parameters of the global network get corrected by a small amount in the direction of each training thread independently.

A3C : Architecture (https://medium.com/emergent-future/)

2. A2C

[paper|code]

A2C is a synchronous, deterministic version of A3C. In A3C, each agent updates the global parameters independently. Hence, it is possible that the thread-specific agents are playing with policies of different versions and therefore the aggregated update would not be optimal. To resolve the problem of inconsistency, A2C provides a Coordinator, which is used for updating the global parameters. The coordinator waits for all the parallel actors to finish their work before updating the global parameters and then in the next iteration parallel actors begin from the same policy. The synchronized gradient update keeps the training more cohesive and potentially to make convergence faster.

In addition to that, it has been observed that A2C utilizes GPUs more efficiently and work better with large batch sizes while achieving same or better performance than A3C. Following figure shows the architectural difference between A2C and A3C. As we can see, A2C has a coordinator which control the update of each of the actors.

3. PER (Prioritized Experience Replay)

[paper]
PER was introduced to overcome the problems faced with the normal approach of Experience Replay. Prioritized Experience Replay tries to prioritized the tuples stored in the Replay Memory in order to give the actor only the experience which is useful for improving the policy. PER shows significant improvement in the Accuracy as well Loss.

4. D4PG (Distributed Distributional DDPG)

[paper]

D4PG tries to improve the accuracy of DDPG with the help of distributional approach. A softmax function is used to prioritize the experiences and provide them to the actor. Followings are some of the improvements that D4PG makes in DDPG.

→ Distributional Critic

→N — Steps Return : When calculating the TD error, D4PG computes NN-step TD target rather than one-step to incorporate rewards in more future steps.

→ Multiple Distributed Parallel Actors : D4PG uses K independent actors, collecting experience in parallel environment and feeding data into common replay buffer.

→ Prioritized Experience Replay(PER)

5. PPO (Proximal Policy Optmization)

[paper]

Given that TRPO is relatively complex, the aims is to implement a similar constraint, proximal policy optimization (PPO) which simplifies it by using a clipped surrogate objective while retaining similar performance.

First, let’s denote the probability ratio between old and new policies as

6. SAC(Soft Actor Critic)

[paper]

Soft Actor-Critic (SAC) includes the measure of entropy of the policy into the reward to increase exploration for the agent. We expect the model to learn a policy that acts as randomly as possible and also it is still able to succeed at given task. It follows maximum entropy reinforcement learning framework.

SAC aims to learn following 3 functions

Policy (πθ) : parameterized by θ
Soft-Q function (Qw) : parameterized by w
Soft state value function (Vψ) : parameterized by ψ

Improvements

1. A3C (Asynchronous Advantage Actor-Critic)

2. A2C

3. PER (Prioritized Experience Replay)

4. D4PG (Distributed Distributional DDPG)

5. PPO (Proximal Policy Optmization)

6. SAC(Soft Actor Critic)

References

Written by KT2713