SOFT ACTOR-CRITIC ALGORITHMS IN DEEP REINFORCEMENT LEARNING

Published in

Analytics Vidhya

10 min readJul 19, 2021

In the previous series of articles, we talked about Policy Gradient methods, DDPG, and Trust region methods. Here we also discussed the corresponding shortcoming of each method.

Like PG-based methods are sample inefficient as they throw away samples after one gradient update. In complex tasks, throwing away samples can lead to slow updates and often convergence to sub-optimal policy.
DDPG tries to solve this by having a Replay Buffer data structure, where it stores transition tuples. We sample a batch of transitions from the replay buffer to calculate critic loss which helps in optimizing the parameters of our critic network. We get a better estimate about the state after we visit it enough number times (like pulling a bandit's arm multiple times can lead to a better estimate about its reward probability distribution), so replay buffer improves sample complexity. But DDPG has brittle convergence as it is sensitive to seed values (initialization of weights and biases of layers) and hyperparameters. It hurts the generalization of our RL agent at multiple tasks, as we need to manually check the seed and hyperparameters values for which the agent generates a maximum reward in the specific task.
TRPO and PPO are on-policy algorithms, as they use the old policy as a proxy for replay buffer and they update the weights only in the trusted region. These techniques stabilize the training, but not using replay buffer leads to sample inefficiency. Trust-region on-policy…

Written by Astarag Mohapatra