Policy Based Methods
This is my second part to Deep Q learning overview. You can find the first write up here. Please check it out, it is comprehensive. This follows as a overview of Policy Based methods and calculations of policy gradients. All the equations are written in Latex and word.
Actor Critic Methods is something of a new-topic that is interesting and also a new approach to solve very difficult problems like Alpha-go, using Actor modeled as the Policy which dictates actions and critic modeled as the Q-function, which scores the decisions of the agent and hence makes the policy better with iterations.
Why do we use Policy based methods, if value based methods work so well ?
- Well one of the main reasons, Policy based is better controlled and can be used in two different ways as below
- They can learn true stochastic policies, that is like choosing from a set of numbers where the probability of picking up each number is based on some state variables that can be changed.
- A Stochastic policy helps when we have aliased states, two or more states that are perceived to be identical, but are actually different, with the use of stochastic policies, it is easier to choose a correct action. But using a value function for this case, it might be difficult to choose a action as they both have the same value, or they are mapped to the same state. Sometimes due to these case, if we are using a value function instead of a policy based method the agent might be just oscillating and would not be able to come out of the state to get to the right action. For this particular case, sometime the agent can use a epsilon-greedy case, where the agent might be able to come out of the oscillations, but that is not efficient and with the high value of epsilon will take quite long time
- A Value based approach tends to learn deterministic or near-deterministic policy, whereas the policy based approach can learn the desired stochastic policy
- Policy based methods are well suited for “Continuous Action spaces” , as we can easily find a global max for a continuous value function for a particular action and choose that action.
- For a High-Dimensional Action spaces, it would be nice if we could map a given state to an action directly, even if the resulting policy is a bit complex it would be efficient and the agent will be faster to act and that is something that policy based methods are good at.
Policy Function Approximation
- Below I describe the policy function approximation process for discrete and continuous action spaces
- For discrete we can use a Softmax funtion to approximate the action function where as for a continuous action space we use a gaussian/normal dstribution, the purpose is to be able to pick an action stochastically
- Finally after choosing a function, we Calculate Objective or Loss function to determine how good the policy is
- Expected Value can be calculated over multiple trajectories for episodic and continuous state environments, some examples :
Stochastic Policy Search
- After choosing an Objective function above, we would need to maximize it.
- Use a iterative approach to find the best policy, something similar to gradient descent, remember the object function is a very complicated function and would need a lot of effort to get a value that maximizes the objective funtion
- For a gradient descent or hill climbing we would need a policy that gives a value of the Objective function to be the top of the hill, to reach to that, we would need a candidate policy and iterate from that
- “Steepest Ascent Hill climbing — choose the next best step out of all the steps and move in that direction, its greedy but better than Naive hill climbing”
- Simulated Annealing — another process to find the optimal policy, we use a large Noise parameter and then gradually reduce the noise as we approach the optimal solution, similar to hitting a molten iron to harden it. It is a good optimization algorithm
- Adaptive Noise scaling — Reduce the search radius for the optimal policy once we find some policy that makes the objective function better
Policy Gradient calculations
Monte-Carlo Policy Gradients Algorithm
Finding Difference between two Policy distributions, for the case of two comparing policies, this helps in preventing the Objective function to be changed too much with a new Policy.
Using the KL divergence
Conclusions
- Policy is the probability distribution over a range of actions, and it is difficult to determine the optimal policy over a continuous action space
- There are a lot of improvements being done in the calculation of the policy gradient and the improvement in the algorithm for calculating the objective function
- The best approach will be to take up a paper and try to implement it
References and further Study
- PARAMETER SPACE NOISE FOR EXPLORATION
- Udacity Deep Learning Course
- Constrained Policy Gradients
- DDPG: https://arxiv.org/abs/1509.02971
- https://blog.openai.com/better-exploration-with-parameter-noise/
- Amazing videos and work here : https://github.com/telecombcn-dl