Policy Based Methods

5 min readJun 9, 2018

This is my second part to Deep Q learning overview. You can find the first write up here. Please check it out, it is comprehensive. This follows as a overview of Policy Based methods and calculations of policy gradients. All the equations are written in Latex and word.

Actor Critic Methods is something of a new-topic that is interesting and also a new approach to solve very difficult problems like Alpha-go, using Actor modeled as the Policy which dictates actions and critic modeled as the Q-function, which scores the decisions of the agent and hence makes the policy better with iterations.

Taken from the amazing link, where they are playing the game “Starcraft-2” with the help of RL.
I am highly influenced to write and study more of Deep learning and RL, from this UC Berkley PhD student, who has some amazing guides, Link

**All around Deep Reinforcement Learning path**

**Pseudo Code for Actor Critic Methods**

Why do we use Policy based methods, if value based methods work so well ?

Well one of the main reasons, Policy based is better controlled and can be used in two different ways as below

**Two ways Policy based methods are used**

They can learn true stochastic policies, that is like choosing from a set of numbers where the probability of picking up each number is based on some state variables that can be changed.
A Stochastic policy helps when we have aliased states, two or more states that are perceived to be identical, but are actually different, with the use of stochastic policies, it is easier to choose a correct action. But using a value function for this case, it might be difficult to choose a action as they both have the same value, or they are mapped to the same state. Sometimes due to these case, if we are using a value function instead of a policy based method the agent might be just oscillating and would not be able to come out of the state to get to the right action. For this particular case, sometime the agent can use a epsilon-greedy case, where the agent might be able to come out of the oscillations, but that is not efficient and with the high value of epsilon will take quite long time
A Value based approach tends to learn deterministic or near-deterministic policy, whereas the policy based approach can learn the desired stochastic policy
Policy based methods are well suited for “Continuous Action spaces” , as we can easily find a global max for a continuous value function for a particular action and choose that action.
For a High-Dimensional Action spaces, it would be nice if we could map a given state to an action directly, even if the resulting policy is a bit complex it would be efficient and the agent will be faster to act and that is something that policy based methods are good at.

Policy Function Approximation

Below I describe the policy function approximation process for discrete and continuous action spaces
For discrete we can use a Softmax funtion to approximate the action function where as for a continuous action space we use a gaussian/normal dstribution, the purpose is to be able to pick an action stochastically

Finally after choosing a function, we Calculate Objective or Loss function to determine how good the policy is
Expected Value can be calculated over multiple trajectories for episodic and continuous state environments, some examples :

Stochastic Policy Search

After choosing an Objective function above, we would need to maximize it.
Use a iterative approach to find the best policy, something similar to gradient descent, remember the object function is a very complicated function and would need a lot of effort to get a value that maximizes the objective funtion
For a gradient descent or hill climbing we would need a policy that gives a value of the Objective function to be the top of the hill, to reach to that, we would need a candidate policy and iterate from that
“Steepest Ascent Hill climbing — choose the next best step out of all the steps and move in that direction, its greedy but better than Naive hill climbing”
Simulated Annealing — another process to find the optimal policy, we use a large Noise parameter and then gradually reduce the noise as we approach the optimal solution, similar to hitting a molten iron to harden it. It is a good optimization algorithm
Adaptive Noise scaling — Reduce the search radius for the optimal policy once we find some policy that makes the objective function better

Policy Gradient calculations

Monte-Carlo Policy Gradients Algorithm

Finding Difference between two Policy distributions, for the case of two comparing policies, this helps in preventing the Objective function to be changed too much with a new Policy.

Using the KL divergence

**Calculating the Difference in Policies**

Conclusions

Policy is the probability distribution over a range of actions, and it is difficult to determine the optimal policy over a continuous action space
There are a lot of improvements being done in the calculation of the policy gradient and the improvement in the algorithm for calculating the objective function
The best approach will be to take up a paper and try to implement it

References and further Study

PARAMETER SPACE NOISE FOR EXPLORATION
Udacity Deep Learning Course
Constrained Policy Gradients
DDPG: https://arxiv.org/abs/1509.02971
https://blog.openai.com/better-exploration-with-parameter-noise/
Amazing videos and work here : https://github.com/telecombcn-dl