The Actor-Critic Reinforcement Learning algorithm

Dhanoop Karunakaran
Intro to Artificial Intelligence
5 min readSep 30, 2020
Actor-Critic architecture. Source:[1]

Policy-based and value-based RL algorithm

Please refer here for the policy gradient algorithm basics and refer here for the value-based RL algorithm basics.

In policy-based RL, the optimal policy is computed by manipulating policy directly, and value-based function implicitly finds the optimal policy by finding the optimal value function. Policy-based RL is effective in high dimensional & stochastic continuous action spaces, and learning stochastic policies. At the same time, value-based RL excels in sample efficiency and stability.

The main challenge of policy gradient RL is the high gradient variance. The standard approach to reduce the variance in gradient estimate is to use baseline function b(st)[4]. There are lots of concern about adding the based line will invite the bias in the gradient estimate. There is proof that baseline doesn’t bring the basis to the gradient estimate.

Proof that baseline is unbiased

Policy gradient expression of REINFORCE algorithm as shown below:

Expectation form of policy gradient expression of REINFORCE

We can write reward of trajectory, R(τ) as below:

Then adding baseline function modify the policy gradient expression as below:

Inserting the baseline function

We can call the reward and baseline term as advantage function. It can be denoted as below:

Advantage function

An important point to be noted in the above equation is baseline b is the function of s_t not s_t`[4]

We can rearrange the expression as below:

Source: [4]

The above equation is equivalent E(X-Y). Due to the linearity of expectation, then we can rearrange the E(X-Y) as E(X)−E(Y)[3]. So the above equation is modified as below:

Source: [3][4]

If the second term with baseline is zero, it can prove that adding baseline function, b has added no bias in the gradient estimate. That means

Source: [3]

We can generalize the expectation as below:

Proof for the second term is zero, as shown below:

Source: [3]

The derivation above prove that adding baseline function has no bias on gradient estimate

Actor-critic

In a simple term, Actor-Critic is a Temporal Difference(TD) version of Policy gradient[3]. It has two networks: Actor and Critic. The actor decided which action should be taken and critic inform the actor how good was the action and how it should adjust. The learning of the actor is based on policy gradient approach. In comparison, critics evaluate the action produced by the actor by computing the value function.

This type of architecture is in Generative Adversarial Network(GAN) where both discriminator and generator participate in a game[2]. The generator generates the fake images and discriminator evaluate how good is the fake image generated with its representation of the real image[2]. Over time Generator can create fake images which cannot be distinguishable for the discriminator[2]. Similarly, Actor and Critic are participating in the game, but both of them are improving over time, unlike GAN[2].

Actor-critic is similar to a policy gradient algorithm called REINFORCE with baseline. Reinforce is the MONTE-CARLO learning that indicates that total return is sampled from the full trajectory. But in actor-critic, we use bootstrap. So the main changes in the advantage function.

Original advantage function in policy gradient total return is changed to bootstrapping. Source: [3]

Finally, b(st) changed to value function of the current state. It can be denoted as below:

We can write the new modified advantage function for actor-critic:

Advantage function of Actor-Critic algorithm

Alternatively, advantage function is called as TD error as shown in the Actor-Critic framework. As mentioned above, the learning of the actor is based on policy-gradient. The policy gradient expression of the actor as shown below:

Policy-gradient expression of the actor

Pseudocode of Actor-Critic algorithm[6]

  1. Sample {s_t, a_t}using the policy πθ from the actor-network.
  2. Evaluate the advantage function A_t. It can be called as TD error δt. In Actor-critic algorithm, advantage function is produced by the critic-network.

3. Evaluate the gradient using the below expression:

4. Update the policy parameters, θ

5. Update the weights of the critic based value-based RL(Q-learning). δt is equivalent to advantage function.

6. Repeat 1 to 5 until we find the optimal policy πθ.

If you like my write up, follow me on Github, Linkedin, and/or Medium profile.

--

--