REINFORCE — a policy-gradient based reinforcement Learning algorithm
The goal of any Reinforcement Learning(RL) algorithm is to determine the optimal policy that has a maximum reward. Policy gradient methods are policy iterative method that means modelling and optimising the policy directly.
It is important to understand a few concepts in RL before we get into the policy gradient. Please have a look this medium post for the explanation of a few key concepts in RL.
Policy Gradient algorithm
Policy gradient algorithm is a policy iteration approach where policy is directly manipulated to reach the optimal policy that maximises the expected return. This type of algorithms is model-free reinforcement learning(RL). The model-free indicates that there is no prior knowledge of the model of the environment. In other words, we do not know the environment dynamics or transition probability. The environment dynamics or transition probability is indicated as below:
It can be read the probability of reaching the next state st+1 by taking the action from the current state s. Sometimes transition probability is confused with policy. policy 𝜋 is a distribution over actions given states. In other words, the policy defines the behaviour of the agent.
Whereas, transition probability explains the dynamics of the environment which is not readily available in many practical applications.
Return and reward
We can define our return as the sum of rewards from the current state to the goal state i.e. the sum of rewards in a trajectory(we are just considering finite undiscounted horizon).
Where τ = (s0,a0,…,sT−1,aT−1).
For more information, follow this link.
Objective function
In policy gradient, the policy is usually modelled with a parameterized function respect to θ, πθ(a|s). From a mathematical perspective, an objective function is to minimise or maximise something. We consider a stochastic, parameterized policy πθ and aim to maximise the expected return using objective function J(πθ)[7].
Here R(st, at) is defined as reward obtained at timestep t by performing an action at from the state st. We know the fact that R(st, at) can be represented as R(τ).
We can maximise the objective function J to maximises the return by adjusting the policy parameter θ to get the best policy. The best policy will always maximise the return. The gradient ascent is the optimisation algorithm that iteratively searches for optimal parameters that maximise the objective function.
If we can find out the gradient ∇ of the objective function J, as shown below:
Then, we can update the policy parameter θ(for simplicity, we are going to use θ instead of πθ), using the gradient ascent rule. This way, we can update the parameters θ in the direction of the gradient(Remember the gradient gives the direction of the maximum change, and the magnitude indicates the maximum rate of change ). The gradient update rule is as shown below:
Let’s derive the policy gradient expression
The expectation of a discrete random variable X can be defined as:
where x is the value of random variable X and P(x) is the probability function of x.
Now we can rewrite our gradient as below:
We can derive this equation as follows[6][7][9]:
Probability of trajectory with respect to parameter θ, P(τ|θ) can be expanded as follows[6][7]:
Where p(s0) is the probability distribution of starting state and P(st+1|st, at) is the transition probability of reaching new state st+1 by performing the action at from the state st.
If we take the log-probability of the trajectory, then it can be derived as below[7]:
We can take the gradient of the log-probability of a trajectory thus gives[6][7]:
We can modify this function as shown below based on the transition probability model, P(st+1∣st, at) disappears because we are considering the model-free policy gradient algorithm where the transition probability model is not necessary.
We can now go back to the expectation of our algorithm and time to replace the gradient of the log-probability of a trajectory with the derived equation above.
Now the policy gradient expression is derived as
The left-hand side of the equation can be replaced as below:
REINFORCE
REINFORCE is the Mote-Carlo sampling of policy gradient methods. That means the RL agent sample from starting state to goal state directly from the environment, rather than bootstrapping compared to other methods such as Temporal Difference Learning and Dynamic programming.
We can rewrite our policy gradient expression in the context of Monte-Carlo sampling.
Where N is the number of trajectories is for one gradient update[6].
Pseudocode for the REINFORCE algorithm[11]:
- Sample N trajectories by following the policy πθ.
2. Evaluate the gradient using the below expression:
3. Update the policy parameters
4. Repeat 1 to 3 until we find the optimal policy πθ.
If you like my write up, follow me on Github, Linkedin, and/or Medium profile.
References
- https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/kaelbling96a-html/node20.html
- http://www.inf.ed.ac.uk/teaching/courses/rl/slides15/rl08.pdf
- https://mc.ai/deriving-policy-gradients-and-implementing-reinforce/
- http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_4_policy_gradient.pdf
- https://towardsdatascience.com/the-almighty-policy-gradient-in-reinforcement-learning-6790bee8db6
- https://www.janisklaise.com/post/rl-policy-gradients/
- https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#deriving-the-simplest-policy-gradient
- https://www.rapidtables.com/math/probability/Expectation.html
- https://karpathy.github.io/2016/05/31/rl/
- https://lilianweng.github.io/lil-log/2018/02/19/a-long-peek-into-reinforcement-learning.html
- http://machinelearningmechanic.com/deep_learning/reinforcement_learning/2019/12/06/a_mathematical_introduction_to_policy_gradient.html
- https://www.wordstream.com/blog/ws/2017/07/28/machine-learning-applications