Soft Actor Critic

Jeremiah Masseus
10 min readApr 28, 2022

--

Jeremiah Masséus

This post is for the class EEL 6812 Introduction to Neural Networks at the University of South Florida, and based on the research paper ‘Soft Actor Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor’ by Tuomas Haarnoja , Aurick Zhou, Pieter Abbeel, Sergey Levine

Presentation Youtube Link: https://www.youtube.com/watch?v=DcC17SHWVeE

Background & Introduction to Reinforcement Learning

Reinforcement learning is a form of unsupervised learning that in its most basic form, it consists of an agent acting in an envoirnment. The agent exists wholly in this enviornment and can interact with it by taking actions that can change the eniornment. A complete description of the current world and environment is called a state, and a partial description of the current world is called an observation.

Based on the environment the agent can take different types of actions. The actions available to an agent in the environment are called action spaces, & these can be discrete or continuous. A discrete action space is essentially a finite number of actions that an agent can take. For example, if we using reinforcement learning to play the game PacMan, the action space would be discrete since PACMAN can only move in a few directions. Conversely if we are using reinforcement learning to determine how to move a robotic arm, the action space is continuous, since the arm could move to any point in 3D space within its range. The agent decides which action within the action space to take based on the current policy, which is essentially a description of how the agent is deciding which action to take. Policies can be deterministic, which is the implementation of one action in a given state with a 100% probability of occurence, or stochastic, which is a distribution of possible actions that could be taken, each with their own respective probability of occurrence.

The agent makes an observation on the environment, takes an action within the action space based on the current policy, the environment changes and the agent receives a reward, which is an indication of how good or bad a state is.

https://www.freecodecamp.org/news/a-brief-introduction-to-reinforcement-learning-7799af5840db/

The model uses information from the reward to adjust the policy, over time and additional iterations, the goal is to approach a policy that the agent can act according to within its environment that maximizes the expected return. This can be thought of as a cumulative reward over a span of state action pairs.

Model-Free Reinforcement Learning

There are two main forms of Reinforcement Learning: Model-free and model based learning.

https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html

We will focus on model-free learning, as it is the more popular method and more relevant to the Soft Actor-Critc algotithm. Model-free learning essentially has 2 main approaches when it comes to training agents: Policy Optimization & Q-Learning.

Policy Optimization

Policy Optimization focuses on maximizing the parameters of a neural network via gradient ascent, usually on the expected return. This is an on-policy method, which means that each update only uses data collected from the most recent version of the policy. Through gradient ascent, the probabilities of actions yielding a higher return are increased, while the probabilities of actions yielding a lower return are decreased until the optimal policy is reached. As the number of iterations increases, the randomness in an action space decreases since the policy takes advantage of found rewards while updating. Taking an exploitative approach as opposed to an explorative approach does have the advantage of building upon previous success, but the tradeoff is that the algorithm is less incentivized to find an even better solution. This can express itself quantitatively by having the policy to become trapped in local optimal points. Another limitation seen with policy optimization is the relative fragility that can be seen in performance. A small change in the policy can lead to drastic changes in performance, so as a result, new and old policies are usually very close to each other in a parameter space, which can lead to inefficient optimization due to small iterative changes.

A more advanced version of the policy gradient optimization, called Proximal Policy Optimization (PPO) attempts to address this issue of small iterative improvements by determining the largest possible “safe” step that a policy can take to improve efficiency without degrading performance. PPO is based on a previous algorithm called Trust Region Policy Optimization (TRPO), but has the advantage of simpler implementation. PPO does not take a direct approach to maximizing performance; instead it maximizes a surrogate function that conservatively estimates how much a policy update will change the expected return. In the equation below, the objective is determined by taking the minimum value between the 2 terms in the equation below:

The first portion of the argument is a ratio of the new policy distribution to old policy distribution, multiplied by advantage, or how much better a given state-action pair is compared to an on policy action. The second term is a clipped version, of the policy distribution ratio bounded by 1 - epsilon and 1 + epsilon.

Proximal Policy Optimization Algorithms, Schulman et al.

The by taking the minimum between the clipped and non clipped term, we are able to keep the new policy from straying too far from the old policy. When the advantage is positive or negative, the corresponding action becomes more or less likely to occur, but the clipping puts a limit on how much the objective can increase or decrease, which can be seen in the above graph.

Another policy-optimization based algorithm is the Asynchronous Advantage Actor Critic (A3C). Actor Critic networks have a critic element that provides a feedback to the Actor on the actions taken, but does not directly interact with the environment itself. The critic takes in the reward from the environment,

and provides the actor with a Value based on a value function . The value function in an on-policy application is the expected result of taking an action a in a given state s and acting accroding to the policy. The role of the critic here is to determine how much better (or worse) a given action is compared to the expected reward. The Critic optimizes the value function, while the Actor optimizes the policy.

https://medium.com/@shagunm1210/implementing-the-a3c-algorithm-to-train-an-agent-to-play-breakout-c0b5ce3b3405

In this structure seen above, multiple agents interact with mutliple environments in a parallel manner. This parallelization is the source of the asynchronous portion of the A3C algorithm; each agent works independently exploring different parts of the environment. This reduces training time and allows the use of on policy learning methods to train in a stable manner.

Q Learning

One of the biggest limitations of training with on-policy methods is the sample inefficiencies that one encounters. When training in an off-policy manner, optimization is done by learning an approximator for the optimal action-value function. This action value function can be thought of as a giant lookup table of expected values given an action in a particular state. This is done using the Bellman equation, which allows one to deconstruct the value function into to sections: the reward for being in the current state, plus the value of the next state. In this way, the Bellman equation allows the value function to be determined in a recursive manner which is very useful for RL algorithms.

While training, the action selected follows the epsilon greedy strategy, where the action with the largest value is selected with probability 1-epsilon, and random otherwise. This allows the agent to still explore the environment. The optimized policy is essentially choosing the action with the maximum value for a the current state at each step, so optimizing the action state equation indirectly optimizes the policy.

Deep Q Learning (DQN) uses a neural network to model and optimize the action-value function.The agents’s experiences are stored in a dataset stored over many episodes. The Q learning updates are applied to samples of this experience data set which are chosen at random. This process is called experience replay, and following this, the actor then chooses an action from the policy. In addition, the weights in the network are not updated after each step, but rather periodically, since the current policy is different from the policy that makes up the sample experience data set.

Deep Q learning is much mroe sample efficient than other on-policy algorithms, it does have several limitations. Deep Q learning can be much less stable than its on policy counter parts, and can only be applied to discrete action spaces.

Deep Deterministic Policy Gradient (DDPG) is an algorithm that also uses Q learning, but can be applied to continuous action spaces. It learns a Q Function and a policy concurrently, taking elements from both model-free learning approaches that we have discussed above. DDPG learns an approximator for the optimal action, instead of the an approximator for the optimal action value function. The two are related as such:

which basically states that the optimal action is the maximum value of the optimal action-value function of for a given state action pair. As stated before, DDPG is specific to continuous action spaces, so the max(Q*) function is differentiable with respect to the action and state. Given this fact, we can use gradients create a better approximator for the max(Q*) function, to the point that max(Q*) can be approximated using a Q function dependent on the policy and given state. The objective of DDPG is to learn a policy that gives the action that maximizes Q*(s,a). This policy is found using gradient ascent with respect to policy parameters off policy, while the Q function is optimized via Q learning.

During training, noise is incorporated to ensure that the actions taken are varied enough to facilitate exploration and mitigate the potential for becoming trapped in local optima.

Soft Actor-Critic

Soft Actor Critic (SAC) shares many similarities with DDPG in that it uses Q learning and policy optimization, but the main differentiator is that SAC is a form of entropy regularized reinforcement learning. The main benefit of Entropy regularized reinforcement learning is that it encourages exploration throughout training by incorporating an entropy term within the policy equation and Q function.

By adding a bonus proportional to the entropy of the current time step, exploration is encouraged while learning. The amount of exploration vs exploitation can be controlled via the alpha coefficient in front of the entropy term. The optimized policy chooses an action to maximize expected future return plus future entropy.

In terms of Q learning, the Soft Actor-Critic actually learns 2 Q functions (as opposed to 1 like DDPG) which are learned through the Mean Sqaure Bellman Error minimization, which determines how closely the Q functions are approximating the targeted Bellman equation. The smaller of the two Q values is selected, which mitigates the risk of overestimation in the Q function.

Summary of Reinforcement Learning Algorithms

Test Case

The researchers compared the effectiveness of the Soft Actor Critic Algorithm to the 5 other algorithms: DDPG, SQL (soft Q-learning), PPO, TD3 (Twin Delayed DDPG). To do this, the algorithms were used in 6 Open AI environments with continuous action spaces, a description of which can be seen below:

OpenAI test Environments

The average return with respect to the number of steps for each algorithm was then plotted, which is shown below:

T. Haarnoja, A. Zhou, P. Abbeel and S. Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”, 2018

In the Hopper-v1 environment, SAC finished with the highest average return, with PPO yielding similar, but slightly inferior results. In Walker-2d-v1, SAC was outperformed by TD3 and performed similarly to PPO once again. In HalfCheetah-v1, SAC outperformed the other algorithms by a significant margin, while maintaining relatively stable results (shown by the shading). In addition, SAC was able to obtain the highest rate increase of average return over the course of training. SAC also yielded the highest average return in Ant-v1, with the lowest amount of variance. In Humanoid-v1, SAC exhibits an extremely high rate of average return increase compared to the other algorithms (outside of SQL) but saturates and finishes with a similar result to PPO and SQL. In Humanoid (rllab) SAC outperforms all other algorithms by a wide margin in terms of increase in average return rate, and average return.

Based on these results SAC is shown to be not only more efficient in terms of steps and iterations needed than other algorithms on average, but also obtains high average returns with good stability. The researchers also explored different implementations and hyper-parameters to determine which values worked best.

T. Haarnoja, A. Zhou, P. Abbeel and S. Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”, 2018

In the humanoid (rllab) environment, a stochastic policy with the Soft Actor Critic Algorithm yielded slightly better average returns with drastic improvements in stability.

Conclusion

The Soft Actor Critic algorithm is able to take advantage of the sample efficiency that comes with off policy reinforcement learning combats the instability found in off policy learning by incorporating entropy maximization. The entropy maximization also encourages exploration, which combats the propensity of some RL algorithms to get trapped in local optima, curtailing performance. In this way, the Soft Actor Critic is able to obtain the benefits seen in both on policy and off policy learning in one algorithm.

--

--