6 Established RL Models — 1 Minute Each

DQN | PPO | A2C | DDPG | TD3 | SAC — Intuition For Practitioners/Enthusiasts

James Koh, PhD

Published in

MITB For All

9 min readOct 21, 2023

Image created by DALL·E 3 based on the prompt “*Create a realistic looking square image of a minute hourglass on top of an open book, with words on the book. The background should have trees.*”

To explain something in great detail takes substantial understanding. To summarize things, the same is true.

Not everyone wants to sit and listen for an hour. What if you had to sum up the key ideas in a minute?

Perhaps it is in an interview. Perhaps you have to convey the idea to your CEO or client. Your explanation should cater to your target audience.

Objective

Today, I will attempt to do this. You, the readers, will be my CEO or client. At the end of each 1-minute, you should have a high level understanding of why that particular algorithm is used.

Of course, this knowledge is not sufficient for you to code it out by yourself, nor derive the mathematics step by step. (I explained these 6 models to my class in around 90 minutes). That is not the point here.

Why do I choose these 6 in particular? Because all of them can be implemented in stable baselines3.

DQN

Deep Q-Network.

This is a value-based method, and we do not learn any policy directly.

A universal function approximator is used to learn an action-value function Q. Q is the expected total (discounted) rewards from taking a particular action a in a particular state s.

Once we get a reasonably good estimate of this function, we can predict which actions would bring us more rewards in the long term.

Unlike tabular Q-learning, we can deal with continuous state-spaces using DQN. The use of neural networks also allows generalization —if we have visited the neighbors of state X, that says something about X, even if X itself had not been visited.

Neural net architecture for DQN. Image by author.

The input is the current state s, which in general is a dₛ-dimensional vector. The output is a value for each possible action. In the above, the action space has aₘ distinct actions that could be taken in state s.

During rollouts, ie. playing sample episodes, tuples of state-action pairs taken and their resulting rewards and next-state will be collected in a buffer. This is used in experience replay, which serves to allow less-frequent encounters to be used as training samples again while training the neural network.

How is the neural network in DQN trained?

Its parameters are updated to minimize the MSE (mean squared error) of the predicted vs. the target Q-values. Predicted values are from the neural network. Target values are from the actual return sampled added to the discounted value of the next state based on the ‘best’ action.

PPO

Proximal Policy Optimization.

This is a policy-based method, where the policy is learnt directly.

The intuition is this. In real life, we choose our actions on a relative, rather than absolute, basis.

I know that if I am on the bus, punching another passenger is a very bad idea. I don’t know what the expected cumulative rewards is, and neither does it matter. It is just bad, and not to be taken.

The PPO objective function is:

PPO objective function. Image by author.

Your CEO or client won’t care. Still, just flash it to show that years of education is required. It’s a real world out there.

Here, we seek to maximize the probability of taking an advantageous action, and avoid taking disadvantageous actions.

What happens if all actions are advantageous? That won’t happen, it goes against the very definition of advantage — how could everything be above average?

And what happens if multiple actions are advantageous? The ones with higher advantage will have its probability increased more.

So then, what is ‘advantage’? It is how much better (in terms of the returns G) an action is in a given state, relative to the returns expected when in that state. This baseline which we subtract from can simply be a moving average of G in that state (if discrete) or given by a function approximator, based on history.

A positive advantage means that the action taken surpass our expectations; naturally we want to increase the likelihood of such action being selected.

The clipping is there to disincentivize large changes to the policy. Unlike TRPO (PPO’s predecessor), placing this within the objective function simplifies as well as speeds up the implementation.

How is the neural network in PPO trained?

Notice that the objective function is parameterized by θ. The other values will simply be treated as constants when computing the partial derivatives. We perform gradient descent (add a negative sign; we want to maximize the objective — do not get confused by the ‘min’ operator).

θ will be updated so that advantageous actions are more likely to be taken.

A2C

Advantage Actor Critic.

This is an actor-critic method, where we learn both the policy and value function.

Where A2C stands in the spectrum. Image by author.

The actor learns a policy, while the critic learns a value function, both using neural networks as universal function approximators.

Similar to what was discussed in PPO, advantage is how much better an action is in a given state, relative to a baseline (the value function learned by the critic).

The actor takes the state as input, and outputs a probability distribution over possible actions. If the action space is discrete, this can be achieved simply from a softmax activation. If the action space is continuous, the output can be the parameters of a probability distribution (eg. mean and standard deviation for a Gaussian).

The actor is trained to increase the probability of taking advantageous actions.

Let the actor be parameterized by θ. Through mathematical derivations as described in REINFORCE, we end up with log-probabilities components, and each step of the gradient update can be performed as follows.

Updating the parameters of the Actor; \hat{A} is the advantage, and is a treated as a constant when taking partial derivative wrt θ. Image by author.

For a critic to be accurate, its prediction of each current state value (from the neural network) should be equal to the actual sampled reward (from reality) plus the discounted next state value (from the neural network). It is trained as such — to minimize the MSE between these.

DDPG

Deep Deterministic Policy Gradient.

This is also an actor-critic method.

DDPG is meant for problems with a continuous action space. Rather than outputting the probability distributions, it deterministically gives a specific action with p=1.

Unlike A2C and PPO, in which the update of the policy network is coupled with the advantage and requiring on-policy training, DDPG is off-policy. This means we can use replay buffers which aid in training the neural network.

On-policy vs off-policy. Image by author.

Unlike the actor in A2C, which seeks to maximize the advantage, or PPO which seeks to maximize some variant of advantage, the actor in DDPG seeks to maximize the predicted action value as evaluated by its critic parameterized by 𝜙.

Let θ parameterize the deterministic policy μ, and 𝜙 parameterize the value function Q.

Objective which μ seeks to maximize. Image by author.

Therefore, the effectiveness is condition on the critic being accurate. The actor is trained to produce actions which have high action-values (predicted by the critic).

We can take refuge in Math, however. The critic will be trained to minimize the MSE between the predicted action values (of the current state) and the target action values (obtained from the reward and next state). As the MSE decreases, it indicates that the critic is becoming a more reliable predictor of the action value Q(s,a).

Target networks are used for stability, and soft updates are applied on the parameters (ie. polyak averaging, which is really just an exponential moving average).

TD3

Twin Delayed DDPG.

This is another actor-critic method; and is obviously an extension of DDPG. The D3 refers to ‘delayed’, ‘deep’, and ‘deterministic’.

The twin refers to the use of two (independent) Q functions. The target, for which both Q networks are trained, is based on the actual reward obtained plus the lower of discounted value of the value from the next state-action pair taken.

Target to train Q-networks. Equation by author.

Why do we take the lower?

Suppose you want to make an investment in a particular stock, and you consult two professional analysts for their opinion (let’s assume they are honest and helpful).

If both of them say it is a good investment, won’t you think there is a decent chance that it is going to turn out well? Even the pessimistic outcome is good, and if the optimistic outcome happens, all the better.

Next comes the delayed in TD3. Here, the policy is updated less frequently than Q, with the intention of dampening volatility. We want to make sure the targets are stable, before switching our actions towards those targets.

The parameters of the policy network in TD3 is, similar to the case of DDPG, updated towards the direction of producing actions with high action-values (as predicted by the critic).

In addition, clipped noise is also added to the deterministic action given by the policy function approximator. This serves as a form of regularization. We want to train a model that allows for some margin of error — Suppose your doctor asks you to take 5ml of medicine. You should be fine taking 4.8ml or 5.2ml. It would be horrific if some slight deviations from 5ml would cause you to die of poisoning.

The noise is clipped, of course. If you were to take 20ml of medicine due to a large standard deviation, the doctor takes no responsibility if you die of overdose.

SAC

Soft Actor Critic.

Again, an actor-critic method like A2C, DDPG and TD3.

Here, the entropy of the policy is added to rewards of each step. The intuition here is somewhat in line with the idea of adding noise in TD3 (to ensure that 4.8ml or 5.2ml of medicine will be fine), but done in a different way.

Having a high entropy means that the action probabilities are well distributed. This means that in any particular given state, the policy should not narrow down on just one particular action, and instead allow for other options with non-negligible probabilities.

The policy network in SAC is trained to maximize the expected discounted sum of rewards and entropy, rather than just rewards alone. Notice that if α = 0, we simply get back the usual: G.

Equation for SAC. α is the trade-off coefficient (not the learning rate). Image by author.

As you may have already noticed, I believe in using analogies. Think of SAC here as your parent who tells you that you can choose to be a scientist/engineer/doctor/lawyer/teacher and succeed in life. Isn’t this much better than saying you must be a lawyer and there is no other choice?

The critics in SAC learn the value functions where entropy is added to the rewards. Like in the cases above, gradient updates are done to minimize the MSE. The target is the lower of the two predictions (based on the same principles in TD3).

The actor hence predicts actions which have a high expected future cumulative rewards and entropy. This means that we want to have a policy that gives good rewards over the long run while keeping our options open at the same time.

Parting Words

In a nutshell, everything above, or even RL in general, can be summed as follows regardless of how fanciful it may seem.

It comes down to the underlying principles of exploration vs exploitation.

We train the agent such that it learns to take actions that allow us to exploit our current knowledge, and get the best expected return (not immediate reward).

We explore to visit different states and take different actions, so that we have a more accurate approximation of what is good in the long run.

Your Vote Counts

Do you like to see more of

Deep dives with math explained?
Deep dives with codes and results?
High level intuition like above?

If you can, please spare 5 seconds to highlight your choice. This will allow me to cater to your preferences.

Disclaimer: All opinions and interpretations are that of the writer, and not of MITB. I declare that I have full rights to use the contents published here, and nothing is plagiarized. I declare that this article is written by me and not with any generative AI tool such as ChatGPT. I declare that no data privacy policy is breached, and that any data associated with the contents here are obtained legitimately to the best of my knowledge. I agree not to make any changes without first seeking the editors’ approval. Any violations may lead to this article being retracted from the publication.