A Deep Dive Into Vanilla Policy Gradients

Published in

Analytics Vidhya

7 min readFeb 8, 2021

Computers have conquered humans — in games, that is.

It’s hard to believe that a computer can play games like humans, reasoning through actions and interacting with objects in a game. But the prowess of computers in games has already been proven; AI have mastered games like Go and Dota 2, beating top-ranked professionals in both games.

Watching these systems work leaves a lot of questions. How do these systems actually function? How did they learn to play video games without any human assistance? To answer such questions, we need to plunge into the field of reinforcement learning.

**Reinforcement learning** is what allowed this AI to play Atari Breakout so well! Source

What is Reinforcement Learning?

Reinforcement learning allows computers to autonomously learn to perform tasks by reinforcing positive behaviors.

The general reinforcement learning framework. Source

Reinforcement learning consists of two items, an agent and an environment. Continuing with the running example of playing video games, the agent is the part of the model playing the game while the environment is the game itself.

The agent interacts with an environment through actions. Moving a character forward or firing a weapon in a game are both possible actions that an agent can take.

To decide what action to take, the agent is given the state of the environment. Our video game agent takes in states in the form of individual frames from the game, analyzing them to determine the ideal action to respond to the given state. If the action leads to a positive outcome (such as an increase in score), a reward is given to the agent, encouraging actions similar to those leading to rewards. Penalties are often given for actions that lead to negative outcomes.

The agent in reinforcement learning models is a neural network known as the policy network. A state is inputted into a policy network, which outputs the ideal action to take given that state.

Note: If you’re not familiar with neural networks, just think of them as a mathematical function that can be trained to perform certain tasks. Read more about them here.

Training a Policy Network

Like other neural networks, the policy network has to be trained before it can decide on proper actions.

The network starts off randomly initialized, meaning that the model has not been trained. Because it is untrained, the network generates random actions. For example, an untrained video game agent might move its character aimlessly and score no points.

However, after many games, the agent eventually chooses a sequence of actions that lead to a positive outcome. These actions are positively reinforced, pushing the network to choose actions leading to similar positive outcomes. Conversely, actions leading to negative reward are negatively reinforced, deterring the algorithm from taking such actions.

This reinforcement framework leads the network to execute more informed actions, allowing the policy to reap more reward. After continually repeating this process, the policy gains an understanding of what sequences of actions lead to high rewards.

This process is known as the vanilla policy gradient algorithm. The policy is trained to take actions that cultivate higher reward and avoid actions that reduce reward.

Inside the Vanilla Policy Gradient Algorithm

The idea behind the vanilla policy gradient algorithm is rather simple, but it requires many mathematical components to work properly. Delving into these components can help with understanding the algorithm.

Training the policy always starts with the agent running in the environment and receiving a series of states, deciding on appropriate actions for each state. This series of states and actions are stored in a list called a trajectory.

The information in the trajectory is fundamental to calculating the rest of the components that the policy needs to learn. One of these components is the reward function.

Reward Functions

The reward function computes the reward gained from the actions in a trajectory. Because the performance of the agent entirely relies on the reward gained from its actions, it is essential to have the best reward function possible.

A few choices exist for possible reward functions.

The finite-horizon undiscounted return simply sums up the rewards from a set of actions. As it is simply a summation of rewards, it is quite simple and easy to implement, but alternative reward functions might be more effective.

Finite-horizon undiscounted return function formula. Source

Infinite-horizon discounted return is quite similar to the finite-horizon undiscounted return, but has a new coefficient referred to as the discount factor. The discount factor corresponds to how far in the future those rewards were collected and is a number between zero and one.

Infinite-horizon discounted return function formula. Source

Both reward functions, despite being extremely different, calculate the same thing — the overall reward gained from a set of actions.

Value Functions

The value function is used by the algorithm to predict the reward yielded by a set of actions.

The vanilla policy gradient algorithm uses an on-policy value function, which essentially means that the policy network is updated using experience collected from the latest interaction with the agent.

The primary value function utilized in the vanilla policy gradient algorithm. Source

The ‘E’ corresponds to the expected reward, and the ‘s’ term inside the brackets corresponds to the starting state. Putting this together, the value function attempts to find the expected reward (E) if the agent starts at a particular state (s0) and follows a given policy.

Vanilla policy gradients also require a second function to calculate advantage (we’ll talk about this soon). This function is called the on-policy action-value function. Similar to the on-policy value function, it calculates an expected reward, but also makes use of actions in its equation.

The second Q-function utilized by the vanilla policy gradient algorithm. Source

Once again, the ‘E’ corresponds to the expected reward and the ‘s0’ corresponds to the starting state. The ‘a0’ term, however, corresponds to a random starting action, thus enabling the function to use information from actions.

These two functions comprise the necessary value equations for the vanilla policy gradient calculation.

Advantage Functions

When you go out to buy a car, you often do not make a decision on a whim. You compare the different cars, attempting to find the advantages of buying one car over another.

Reinforcement learning utilizes a very similar concept, except instead of cars, the algorithm compares different actions. And to do this, the algorithm employs an advantage function.

The algorithm cannot compare and contrast ideas like humans; it has to find a way to quantify advantage. The advantage function does this using a rather simple formula.

The advantage is simply calculated by subtracting the on-policy action-value function from the on-policy value function.

Advantage, value, and reward functions are the main puzzle pieces necessary to understand the internal mechanisms of the vanilla policy gradient algorithm. Let’s put these pieces together.

Putting it Together

Before the previously-discussed functions can be employed, some foundational steps are necessary.

The policy and value functions need parameters to function and these parameters start off as random numbers. As the policy and value are trained, however, these parameters are gradually tweaked to allow both networks to perform better.

After parameters are defined, the algorithm begins training.

The policy runs in the environment and collects a series of states and actions which it stores in a trajectory. The reward function then uses this trajectory to compute a reward. The advantage function estimates reward using the value and action-value functions.

The next step computes a policy gradient, which is then used in the subsequent step to tune the policy network’s parameters. Just think of these two steps as the ‘learning steps’ of the algorithm in which the policy network learns to make better decisions.

The last step involves updating the parameters of the value function, allowing it to create better reward expectations. After this step, the model returns to the initial step of collecting trajectories and the process starts again, continually repeating until the policy network is able to make good decisions in its environment.

The vanilla policy gradient algorithm is only one algorithm in the wide scope of reinforcement learning. Other algorithms like Q-Learning and Deep Q-Networks have achieved remarkable feats as well! Regardless of the specificities of various algorithms, however, one thing is certain; reinforcement learning is here to stay and will undoubtedly further transform the field of AI as it is improved.

Check out OpenAI’s Spinning Up website for more detailed explanations of mathematical functions and their derivations. Much of this information came from their webpage; it offers detailed explanations of reinforcement learning concepts and I highly recommend visiting their site if you want to delve into reinforcement learning!

Thank you for reading to the end! Don’t forget to leave a 👏 as well!

A bit about me — I’m a 17-year old who’s really into disruptive technologies, primarily artificial intelligence. If you liked this article or want to talk about interesting deep learning/machine learning projects, research papers, or ideas, message me through my LinkedIn , Instagram, or e-mail (ayan.aji.nair@gmail.com)! You can also get updates on cool projects of mine and new articles I write through my monthly newsletter — subscribe here!