Exploring the Foundations of Modern Reinforcement Learning Algorithms

8 min readJul 10, 2023

Introduction

When I sat down to write this blog post, I was fascinated by how ChatGPT utilizes reinforcement learning (RL) in its training process. Initially, my plan was to explain RL in the context of ChatGPT. However, I’ve decided to take a different approach. Instead, I will begin by offering a beginner-friendly explanation of the key ideas behind modern RL approaches. This will help readers become acquainted with the basic principles of RL before delving into a separate blog post dedicated solely to ChatGPT.

This blog post explores the integration of two fundamental categories of reinforcement learning approaches: Temporal Difference (TD) learning and Policy Gradient Methods. Specifically, it focuses on two widely-used modern algorithms, Deep Q-Networks (DQN) and Proximal Policy Optimization (PPO), that seamlessly combine TD learning and policy gradients with the power of deep learning. The post dives into the inner workings of DQN and PPO, highlighting their unique characteristics and advantages. It discusses how DQN leverages TD learning to estimate state-action values using deep neural networks, while PPO directly optimizes policies through gradient ascent.

Reinforcement Learning (RL) Key Components:

RL has five key components which I tried to explain using an example here:

Agent: Imagine a robot in a maze. The robot is the RL agent that explores the maze, receives information through its sensors (like cameras or distance sensors), and decides which direction to move in the maze to find the exit door.
Environment: The maze itself represents the environment. It is the external system in which the robot operates. The maze provides feedback to the robot based on its actions. For example, if the robot moves towards the correct path or finds the exit door, it receives positive rewards. If it hits a wall or takes a wrong turn, it might receive negative rewards or penalties.
State: The state in this case is the robot’s current position in the maze. It helps the robot understand where it is about the walls, pathways, or potential hazards. The state assists the robot in making decisions about which direction to take next to navigate through the maze successfully.
Action: The actions available to the robot are the possible movements it can make in the maze, such as moving up, down, left, or right. The robot selects its action based on its current state and goal of reaching the exit while avoiding obstacles.
Reward: The rewards are the feedback signals the robot receives based on its actions in the maze. Positive rewards are given when the robot moves closer to the exit or finds the correct path. Negative rewards or penalties may be assigned if the robot hits a wall, gets stuck, or takes a wrong turn.

The robot as an RL agent interacts with the maze environment, observes its state, takes actions to navigate, and receives rewards or penalties based on its performance. Through trial and error, the robot learns to make better decisions and optimize its path-finding strategy to maximize the cumulative rewards by reaching the exit efficiently.

Temporal Difference (TD) Learning

TD learning is one of the learning strategies behind reinforcement learning algorithms, it combines elements from dynamic programming and Monte Carlo methods. To explain how TD learning works, I decided to explain that in the context of Q-learning. Q-Learning is one of the classical and popular RL algorithms. It is also considered as the foundation of Deep Q-Networks (DQN) which combines Q-learning with deep neural networks to handle high-dimensional state spaces and achieve better performance in complex environments.

Q-Learning

Q-Learning tries to optimize the Q function. The Q function, denoted as Q(s, a), tells us the expected cumulative rewards the agent can get by taking a particular action (a) in a specific state (s). It helps the agent make decisions by evaluating the value of different actions in different states. In simple problems like a “robot in a maze,” the Q function can be represented as a lookup table because the states and actions space is small and discrete. Please get back to the key components of RL above, if you forgot what is state and what is action.

Now, let’s explore Temporal Difference (TD) learning, which is a strategy behind Q-learning to optimize the Q-table/function. TD learning works iteratively, updating the Q-values based on the immediate rewards observed and estimates of future rewards. The special thing about TD learning is that it learns incrementally from each time step. This allows for continuous learning and improvement throughout the learning process. Below is how the Q-values are updated by TD learning in each time step.

Please note that TD learning leverages the Bellman equation to update Q-values iteratively. The Bellman equation is the basis of the update rule which generates the TD target.

The next step is to find the optimal policy π. The optimal policy involves starting from a given state, selecting the action with the highest value in that state from the Q-table, and continuing this process as the agent moves through the environment to maximize the expected cumulative rewards.

Deep Q Network (DQN)

DQN uses the Q-network to estimate the Q-values for each action in a state, and the action with the highest Q-value is chosen. The network aims to minimize the mean squared error (MSE) loss between the predicted Q-values and the target Q-values.

The target Q-values are computed based on the Bellman equation using the rewards obtained and the maximum Q-value of the next state, which is fundamentally the same as what I explained to you in the Q-learning concept. However, the way that the Bellman equation is applied in DQN is different. The target Q-value is estimated by another network called the target network which is a copy of the main Q-network with frozen parameters. The frozen parameters are updated periodically from the main Q-network to the target network and the updating interval is a hyperparameter.

Similar to Q-learning, the optimal policy π involves starting from a given state, selecting the action with the highest value amongst the ones estimated by the Q-network, and continuing this process as the agent moves through the environment to maximize the expected cumulative rewards.

Policy Gradient Methods

Unlike TD learning approaches, which estimate the value of different state-action pairs as an intermediate step toward finding the optimal policy, policy gradient methods bypass this intermediate step and directly estimate the policy that maximizes the expected cumulative reward.

REINFORCE

REINFORCE is one of the classical approaches which use a policy gradient method. In the REINFORCE algorithm, the policy is typically parameterized by a function approximation, which estimates the probability of an action given a state. The algorithm iteratively updates the policy parameters by computing gradient estimates and performing gradient ascent to maximize the expected cumulative reward.

A REINFORCE agent interacts with the environment by following the current policy (usually the first policy is random because the parameters of the policy function are initialized randomly). As a result of applying sequence actions, it gets a sequence of rewards, then it calculates the return, this is typically done by summing up the rewards r from the current time step until the end of the episode as below:

Here, r(t) represents the reward at time step t, γ is the discount factor for future rewards, and T is the end of the episode.

The next step is to compute the policy gradient, which represents the gradient of the expected cumulative reward with respect to the policy parameters. It is calculated as:

And then the policy parameters are updated using a learning rate:

It is commonly practiced to use a neural network as a function approximator in the context of the REINFORCE algorithm. By using a neural network as the function approximator, you can benefit from its ability to handle high-dimensional state spaces, capture non-linear relationships, and generalize well across different states.

Proximal Policy Optimization (PPO)

In 2017, OpenAI published a paper called Proximal Policy Optimization (PPO) Algorithms as a new family of policy gradient methods, in the abstract of the paper, the authors highlighted three following points:

PPO alternates between sampling data through interaction with the environment and optimizing a surrogate objective function using stochastic gradient ascent.
Unlike standard policy gradient methods that perform one gradient update per data sample, PPO introduces a novel objective function that allows for multiple epochs of minibatch updates.
PPO offers some of the benefits of Trust Region Policy Optimization (TRPO), but it is simpler to implement, more general, and has better sample complexity empirically.

To better understand PPO, it is essential to explore the motivation behind TRPO, which serves as the foundation for PPO. Prior to TRPO, while policy optimization methods proved effective in certain scenarios, they frequently encountered issues such as instability, high variance in gradient estimates, and challenges in discovering optimal policies within high-dimensional continuous action spaces.

A big update can lead to a policy collapse — Source: https://www.boyuai.com/

TRPO addresses these concerns by introducing a trust region constraint on policy updates. This constraint restricts the extent of modifications made to the policy during each iteration, guaranteeing that the updated policy remains close to the original one using Kullback–Leibler Divergence technique. Consequently, TRPO ensures stability and prevents drastic policy changes that could result in poor performance or divergence. For a detailed explanation of TRPO, please check out the following article:

Trust Region Policy Optimization (TRPO) Explained

The Reinforcement Learning algorithm TRPO builds upon natural policy gradient algorithms, ensuring updates remain…

towardsdatascience.com

The key difference between TRPO and PPO are as follow:

PPO controls the policy divergence by applying a clipping mechanism to the surrogate objective function, while TRPO utilizes the Kullback-Leibler (KL) divergence technique to restrict the policy updates within a trust region.
By using a clipping mechanism PPO offers faster computation and allows for more iterations within a given timeframe.
PPO tends to be more sample-efficient than TRPO because it allows for larger policy updates within the clipping range

References

[1] Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT Press.

[2] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms.

[3] Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2017). Trust Region Policy Optimization.