REINFORCE — a policy-gradient based reinforcement Learning algorithm

Published in

Intro to Artificial Intelligence

5 min readJun 4, 2020

The goal of any Reinforcement Learning(RL) algorithm is to determine the optimal policy that has a maximum reward. Policy gradient methods are policy iterative method that means modelling and optimising the policy directly.

It is important to understand a few concepts in RL before we get into the policy gradient. Please have a look this medium post for the explanation of a few key concepts in RL.

Policy Gradient algorithm

Policy gradient algorithm is a policy iteration approach where policy is directly manipulated to reach the optimal policy that maximises the expected return. This type of algorithms is model-free reinforcement learning(RL). The model-free indicates that there is no prior knowledge of the model of the environment. In other words, we do not know the environment dynamics or transition probability. The environment dynamics or transition probability is indicated as below:

It can be read the probability of reaching the next state st+1 by taking the action from the current state s. Sometimes transition probability is confused with policy. policy 𝜋 is a distribution over actions given states. In other words, the policy defines the behaviour of the agent.

Whereas, transition probability explains the dynamics of the environment which is not readily available in many practical applications.

Return and reward

We can define our return as the sum of rewards from the current state to the goal state i.e. the sum of rewards in a trajectory(we are just considering finite undiscounted horizon).

Where τ = (s0,a0,…,sT−1,aT−1).

For more information, follow this link.

Objective function

In policy gradient, the policy is usually modelled with a parameterized function respect to θ, πθ(a|s). From a mathematical perspective, an objective function is to minimise or maximise something. We consider a stochastic, parameterized policy πθ and aim to maximise the expected return using objective function J(πθ)[7].

Here R(st, at) is defined as reward obtained at timestep t by performing an action at from the state st. We know the fact that R(st, at) can be represented as R(τ).

We can maximise the objective function J to maximises the return by adjusting the policy parameter θ to get the best policy. The best policy will always maximise the return. The gradient ascent is the optimisation algorithm that iteratively searches for optimal parameters that maximise the objective function.

If we can find out the gradient ∇ of the objective function J, as shown below:

The gradient ∇ of the objective function J: Source: [6]

Then, we can update the policy parameter θ(for simplicity, we are going to use θ instead of πθ), using the gradient ascent rule. This way, we can update the parameters θ in the direction of the gradient(Remember the gradient gives the direction of the maximum change, and the magnitude indicates the maximum rate of change ). The gradient update rule is as shown below:

Gradient ascent update rule: Source: [6] and [7]

Let’s derive the policy gradient expression

The expectation of a discrete random variable X can be defined as:

Expectation general equation: Source: [8]

where x is the value of random variable X and P(x) is the probability function of x.

Now we can rewrite our gradient as below:

We can derive this equation as follows[6][7][9]:

Derivation of the expectation: Source: [6], [7], and [9]

Probability of trajectory with respect to parameter θ, P(τ|θ) can be expanded as follows[6][7]:

Probability of trajectory: Source: [6] and [7]

Where p(s0) is the probability distribution of starting state and P(st+1|st, at) is the transition probability of reaching new state st+1 by performing the action at from the state st.

If we take the log-probability of the trajectory, then it can be derived as below[7]:

Log-probability of the trajectory: Source: [7]

We can take the gradient of the log-probability of a trajectory thus gives[6][7]:

The gradient of the log-probability of a trajectory: Source: [6] and [7]

We can modify this function as shown below based on the transition probability model, P(st+1∣st, at) disappears because we are considering the model-free policy gradient algorithm where the transition probability model is not necessary.

Final derivation of the gradient of the log-probability of a trajectory: Source: [6] and [7]

We can now go back to the expectation of our algorithm and time to replace the gradient of the log-probability of a trajectory with the derived equation above.

The gradient of the objective function

Now the policy gradient expression is derived as

Policy gradient expression: Source: [6] and [7]

The left-hand side of the equation can be replaced as below:

REINFORCE

REINFORCE is the Mote-Carlo sampling of policy gradient methods. That means the RL agent sample from starting state to goal state directly from the environment, rather than bootstrapping compared to other methods such as Temporal Difference Learning and Dynamic programming.