Exploring Reinforcement Learning: A Hands-on Example of Teaching OpenAI’s Lunar Lander to Land Using Actor-Critic Method with Proximal Policy Optimization (PPO) in PyTorch

13 min readOct 30, 2023

Although it is recommended for anyone who has an interest in artificial intelligence this article is good for you if you already know something about computer vision, natural language processing, or other fields of applications of neural networks. we may not dive into all the details of neural networks. for example what the loss function is, how to perform backpropagation with gradient descent what are hyperparameters. what are the labels? etc. and we also mentioned little about classic control methods.

Introduction

With the rapid development of technology in recent centuries probably the last invention of human intelligence “Artificial Intelligence” has started to become a cornerstone in almost all industries, in today's world when we say AI we mostly intend data-driven approaches just like Machine Learning (ML). Within ML, you’ll often hear about deep learning (DL), a subset that uses neural networks to solve even more complex problems. main difference between neural networks from the “traditional” ML algorithms is that neural networks can capture both linear and non-linear connections inside the data. because of this tremendous ability of neural networks, they are also called “universal function approximators” But neural networks’ weak point is that they need too much data and processing power. when we say too much it doesn't always mean a petabyte of data because the important thing is the variety of the data with great volume. In reality, it is not always possible to create a large data set that includes good variations and meaningful sequences (which is not important in vision but crucial in training robots) with the true values of the real world and of course reasonable volume. in such cases, we have a little bit older friend compared to DL. That is simulations!

Why we are using simulations?

Good question. well because not have to but reinforcement learning is inherently connected with simulations. first things first there is not much cost-effective alternative, you cannot waste billion-dollar machines to train a model. Also as we said most of the Deep Learning methods require data with high volume and good variety. and this variety with a good sequence can be provided with a well-designed simulation environment very effectively and efficiently with even Parallel! so reinforcement learning models can be trained in the everyday used simulations in industry.

Although simulations are not perfect descriptions of reality it is ENOUGH for RL models to capture to essence and learn how to fly a plane or even land a rocket vertically! but it is important to note that making simulations that reflect reality better is still an open discussion, especially in extreme conditions like turbulence.

But how do we use simulations?

Okay, let's slowly dive into the technical part. the core idea behind simulations is the time step (also known as delta time: dt) simulations are living in dt loops, in simulations we have the description of the environment and also random initial conditions for every aspect like the positions of every water particle. every dt passes we update these values according to behavior defined in the environment model. environment model is generally created from our knowledge of the laws of nature but there are also “pure” data-driven approaches. in RL corpus we name the simulations as just the environment and our (let's say robot) robot as the agent. Both in classic control theory and reinforcement learning we have environment and agent duality. “agent” “observes” the “environment” as “states” and performs “actions” and tries to achieve “target” We can think of the target as a desired state(s) we want to achieve. sometimes those states can be the final state that we want to end up with but do not have to. setting potential “promising” states from our human perception is sometimes a good practice and helps the agent to converge faster besides we can also grade our agents’ behaviors. but do not expect your perception is always meaningful for RL agents. we will explain what we mean in more detail in the next parts. the main idea behind deep neural networks applied reinforcement learning is that it is the agent itself doing internal evaluation and tuning its own parameters just as regular deep learning! this behavior separates it from traditional control approaches where we have to describe the entire behavior of the agent using advanced mathematical techniques (Transfer functions basically )and tune its parameters to perform actions for reaching the target. also, it is not possible for our human brains to directly describe the behaviors of the agent in complex environments most of the time.

So what makes us fly by modeling the agent by hand? and get rid of the limitations of both the mathematic and human brain. an optimization technique called “gradient descent” combined with “backpropagation”! but to make this article shorter we will skip the detailed parts of how neural networks work, and take them as a black box, that is a function that takes input as of states and gives a result “tuple” with a length of our action-space (action-space: all actions that our agent can perform).

Rewards

In our environment modeling we also have to plan our reward architecture to use as a “loss”. we set rewards and penalties (penalties are just negative valued rewards). for a solid example from the lunar lander for every delta time our reward is calculated as follows: 10 points if each leg has ground contact + 100 — distance from the landing location — 0.3 points for every main engine thrust. (little detail: lunar lander works at 50hz (0.02 second) this means we can fire our main engine 50 times. but in practice, most agents learn to fire the main engine much less frequently than 50 Hz.) so the final reward function looks like this:

reward = (100 - distance_to_landing_pad) + (10 * leg_ground_contact) - (0.3 * main_engine_thrust)

But you can always add your custom rewards from your perception. for example, you can give a reward for stable velocity or punishment for sudden increases in velocity. this part is up to you and your expertise in the specific domain.

Designing rewards can dramatically influence an agent’s perception and behavior.

DQN (Deep Q-Networks)

There are different approaches to what this neural network’s output “tuple” represents but we just explain two of the main approaches DQN and Actor-Critic. In DQN our network just tries to guess what reward we get for all actions. for a solid example in the lunar lander, our action space is 4. They are: fire left engine, fire left engine, fire main engine and do nothing. our DQN tries to guess their “immediate” reward values and at every step, we choose the best reward value, eventually epoch by epoch our agent learns to guess what reward we will earn, and if our perception of reality is correct (if our rewards are reasonable) agent will try to collect more reward and eventually learns to land safely to the moon! this method is also referred to as Deep Q-learning because the method originally derived from Q-tables (a classic RL method before NN’s ) to neural networks. before the neural networks, it was very hard to work with Q-tables with non-heavily discrete space.

Actor-Critic

In the actor-critic method, architecture is not totally different. in this method, we use two neural networks for different purposes. one is the actor that gives the probability for which action to perform. for example, in a lunar lander, the “actor” gives probabilities like [0.2, 0.4,0.1, 0.3] and we take a sample from this distribution. while our other network “critic” takes the state (there are also different approaches like critic takes both state and action.) and tries to guess which reward we get. again, epoch by epoch our actor learns to distribute probabilities better and precisely while our critic learns to guess future rewards. in other words, the critic’s loss is calculated by how it is guessed reward while the actor’s loss is calculated from the critic’s perception of future (or possible) rewards. in the other other words actor is responsible to the critic and the critic is responsible to the environment. There is a kind of hierarchical architecture.

Exploration-Exploitation

There is one important concept left we have to mention before going further. it is the exploration-exploitation dilemma, in the first episodes in training our agent cannot collect many rewards but as the episodes pass our agent starts to get rewards and the dilemma in RL arises. should the agent stick with the previous techniques or continue to try new things? there is no guarantee that the agent will collect more rewards when trying new things and if they skip the guaranteed reward it may get worse. this dilemma is referred to as exploration-exploitation and we also have to balance where the agent selects guaranteed rewards or tries for better future rewards.

So balancing exploration and exploitation is crucial for effective learning. In DQN one popular technique to manage this is epsilon-greedy strategy with epsilon decay. Initially, the agent starts with a high epsilon value like 0.9, when we get action from the agent (DQN) we also round RNG(random number generator) for example between 0–1 if the RNG is higher than our epsilon hyperparameter we select DQN’s action, if it is lower we select completely random action. but as the agent gains experience, the epsilon “decays” over time, we multiply the epsilon by 0.99 at every episode. For example, if we start with 0.9 and decay by 0.99 after 220 episodes our epsilon falls under 0.1 (you can calculate it as follows 0.9 * 0.99^(episode)) and it stops mostly trying new things. So it gradually shifts the focus from exploration to exploitation. This way, the agent can explore different strategies in the early stages and then later refine its policy for maximum reward, aligning well with the DQN’s objective of estimating immediate rewards for actions.

On the other hand, the Actor-Critic method inherently supports the exploration-exploitation duality through its dual-network architecture. The actor, generating a probability distribution over actions, naturally incorporates exploration. The “probability distribution” allows the agent to explore different actions, not just the one with the highest estimated reward. This stochastic nature of the actor ensures a “built-in exploration” mechanism, and the critic’s feedback guides the actor to adjust these probabilities in favor of actions that maximize expected “future rewards”. This dynamic between the actor and critic not only streamlines the exploration-exploitation trade-off but also makes the system robust to changes in the environment, leading to more adaptive learning.

Both methods have their merits in handling the exploration-exploitation dilemma. While in DQN we have to include external mechanisms like epsilon decay, Actor-Critic encapsulates it within its architecture, allowing for a more detailed and responsive adaptation to the environment. Each with its unique advantages and drawbacks to balancing exploration and exploitation.

Now, how does PPO (Proximal Policy Optimization) fit in?

Basically, PPO makes sure that the updates to the actor’s policy aren’t too drastic. As the agent learns, it might get too “creative” and try weird things PPO helps the robot refine its skills without going off the rails. Instead of making drastic changes each time, PPO encourages small adjustments based on what worked before. PPO uses a “trust region” to ensure the robot’s new techniques aren’t too risky or outlandish. Imagine the agent has learned to fire the main engine at the right time to slow down its descent. PPO ensures that in the next training iteration, the agent doesn’t suddenly decide to stop using the main engine altogether, as that would be a drastic change likely to fail. Over multiple epochs, the agent forms “trajectories,” which are sequences of actions, states, and rewards. It uses these trajectories to calculate an “advantage,” essentially assessing how much better a specific action (like firing the main engine) is compared to the average action in that situation (like randomly choosing an engine to fire).
In simpler terms, PPO makes sure the agent fine-tunes its strategy (“policy”) smoothly, optimizing its landing skills without suddenly forgetting how to land.

this behavior separates PPO from other techniques like REINFORCE. Analogically REINFORCE is like a greedy learner, always going for the immediate reward without considering the bigger picture. While PPO as obvious from its name “Proximal Policy Optimization” is like a strategic learner, understanding that consistency and adaptability are key to long-term success. In essence:

REINFORCE can be fast and simple, but it might lead to unstable policies prone to overfitting.
PPO offers more stability and adaptability, but it can be computationally more expensive. but in reinforcement learning compared to the other “Deep Learning” Methods computation is not likely to be a concern.

anyway, there is always something to say more let’s dive into pytorch code get something solid, and finish this article.

Code

import matplotlib.pyplot as plt
import numpy as np
import gym

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.distributions as distributions

if you encounter problems while installing Box2D and gym this worked for me:

py -m pip install --upgrade pip setuptools wheel

Our actor-critic class definitions by inheriting nn.Module from pytorch we are using PRELU and Dropout.

# Define the Actor neural network class
class Actor(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()
        
        self.net = nn.Sequential(
            nn.Linear(input_dim, 128),       # Linear layer
            nn.Dropout(p=0.15),              # Dropout layer is the trick in this architecture
            nn.PReLU(),                      # PRELU just slightly better than RELU
            nn.Linear(128, output_dim),      # Linear layer 
           
            nn.Softmax(dim=-1)               # Softmax activation to get probabilities
        )
        
    def forward(self, x):
        return self.net(x)

# Define the Critic neural network class
class Critic(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        
        self.net = nn.Sequential(
            nn.Linear(input_dim, 128),     # Linear layer
            nn.Dropout(p=0.15),            # Dropout layer is the trick in this architecture
            nn.PReLU(),                    # PRELU just slightly better than RELU
            
            nn.Linear(128, 1)              # Output layer
        )
        
    def forward(self, x):
        return self.net(x)

You can try for better but our basic hyperparameters are set as follows:

# Initialize the gym environment
train_env = gym.make('LunarLander-v2')

# Define dimensions for network, input is 8 and output is 4
INPUT_DIM = train_env.observation_space.shape[0]
OUTPUT_DIM = train_env.action_space.n

# Create actor and critic networks
actor = Actor(INPUT_DIM, OUTPUT_DIM)
critic = Critic(INPUT_DIM)

# Initialize optimizers for actor and critic lr is not very important
optimizer_actor = optim.Adam(actor.parameters(), lr=0.001)
optimizer_critic = optim.Adam(critic.parameters(), lr=0.001)

# Define hyperparameters
EPISODES = 3000
GAMMA = 0.99
PPO_STEPS = 7
EPSILON = 0.25

Here is our train loop:

# Initialize a list to store rewards for each episode for plotting and breaking the loop
all_rewards, loss_history_policy, loss_history_value, mean_rewards = [], [], [], []


# Main training loop
for episode in range(1, EPISODES + 1):
    # Initialize empty arrays for this episode
    states, actions, log_prob_actions, values, rewards = [], [], [], [], []
    done = False
    episode_reward = 0
    state, _ = train_env.reset()

    # Main loop here we interact with the environment most things are done here
    while not done:
        # Prepare state for network and store
        state = torch.FloatTensor(state).unsqueeze(0)
        states.append(state)
        
        # Get action and value predictions
        action_pred = actor(state)
        value_pred = critic(state)
        
        # Sample action from the distribution
        dist = distributions.Categorical(action_pred)
        action = dist.sample()
        log_prob_action = dist.log_prob(action)
        
        # Take a step in the environment (same with delta time)
        state, reward, terminated, trunked, _ = train_env.step(action.item())
        #previous versions there is only one "done" parameter
        done = terminated or trunked
        
        # Store experience
        actions.append(action)
        log_prob_actions.append(log_prob_action)
        values.append(value_pred)
        rewards.append(reward)
        
        # Accumulate rewards for this episode
        episode_reward += reward

    # Calculate returns and advantages
    returns, R = [], 0
    for r in reversed(rewards):
        # Calculate discounted return 
        R = r + R * GAMMA
        returns.insert(0, R)
        
    # Normalize returns you can add the small term to divisor for divison by zero case but it never happened to me
    returns = torch.tensor(returns)
    returns = (returns - returns.mean()) / returns.std()
    
    # Calculate and normalize advantages
    values = torch.cat(values).squeeze(-1)
    advantages = returns - values
    advantages = (advantages - advantages.mean()) / advantages.std()

    # Prepare for PPO update
    states = torch.cat(states)
    actions = torch.cat(actions)
    log_prob_actions = torch.cat(log_prob_actions).detach()
    advantages = advantages.detach()

    # PPO update step
    for _ in range(PPO_STEPS):
        # Get new action and value predictions
        action_pred = actor(states)
        value_pred = critic(states).squeeze(-1)
        
        # Calculate the ratio term for PPO
        dist = distributions.Categorical(action_pred)
        new_log_prob_actions = dist.log_prob(actions)
        policy_ratio = (new_log_prob_actions - log_prob_actions).exp()
        
        # Calculate both clipped and unclipped objective
        policy_loss_1 = policy_ratio * advantages
        policy_loss_2 = torch.clamp(policy_ratio, min=1.0 - EPSILON, max=1.0 + EPSILON) * advantages
        
        # Calculate policy and value losses
        policy_loss = -torch.min(policy_loss_1, policy_loss_2).sum()
        value_loss = F.smooth_l1_loss(returns, value_pred).sum()
        
        # Zero the gradients
        optimizer_actor.zero_grad()
        optimizer_critic.zero_grad()
        
        # Perform backpropagation
        policy_loss.backward()
        value_loss.backward()
        
        # Update the network weights
        optimizer_actor.step()
        optimizer_critic.step()

    # Store and print episode rewards
    all_rewards.append(episode_reward)
    loss_history_policy.append(policy_loss.item())  # Store policy loss
    loss_history_value.append(value_loss.item())  # Store value loss
    
    #break if we achieve our goal. that is 200 mean reward upon 100 episodes
    if len(all_rewards) >= 100:
        mean_last_100 = sum(all_rewards[-100:]) / 100
        mean_rewards.append(mean_last_100)
        if episode % 10 == 0:
            print(f'Epoch: {episode:3}, Reward: {episode_reward}, Mean of last 100: {mean_last_100}')
            
        if mean_last_100 > 200:
            print(f"Mean of last 100 episode rewards exceeds 200 ({mean_last_100}). Stopping training.")
            break

The Lunar Lander problem is considered solved when an average of 200 points is achieved over 100 consecutive runs.

Here is also my sample test run after training their rewards are recorded respectively:

Test Episode 1, Total Reward: 282.5017644661861
Test Episode 2, Total Reward: 231.99312556442905
Test Episode 3, Total Reward: 252.15062230552587
Test Episode 4, Total Reward: 253.09629731834215
Test Episode 5, Total Reward: 246.45468687511757
Test Episode 6, Total Reward: 252.46901225964368
Test Episode 7, Total Reward: 225.7263933484187
Test Episode 8, Total Reward: 220.83741008132807
Test Episode 9, Total Reward: 275.95709765920367
Test Episode 10, Total Reward: 260.54337724720875

and this code is for rendering the agent virtually.

# DO NOT FORGET THE FREEZE OTHERWISE DROPOUT LAYER WILL DROP EVERY PARAMETER!
actor.eval()

# Initialize the test environment
test_env = gym.make('LunarLander-v2',render_mode="human")

# Number of test episodes
NUM_TEST_EPISODES = 10

# Run the agent on the test environment
for episode in range(1, NUM_TEST_EPISODES + 1):
    state,_ = test_env.reset()
    done = False
    episode_reward = 0
    while not done:
        # Render the environment in human-readable format
        test_env.render()

        state = torch.FloatTensor(state).unsqueeze(0)
        
        with torch.no_grad():
            action_prob = actor(state)
        dist = distributions.Categorical(action_prob)
        action = dist.sample()
        
        state, reward, terminated, trunked, _ = test_env.step(action.item())
        done = terminated or trunked
        
        
        episode_reward += reward

    print(f'Test Episode {episode}, Total Reward: {episode_reward}')

# Close the environment
test_env.close()

Reward and loss history over epochs:

In conclusion, this article exemplifies a humble effort to enlighten the complications of reinforcement learning, bridging theory and practice. We tried to provide well-annotated code snippets, with an insightful discussion on artificial intelligence followed by the Lunar Lander problem, encouraging a deeper comprehension of reinforcement learning methodologies, simulations, and their real-world applications. there is of course a LOT more when it comes to control theory, artificial intelligence, and their intersection reinforcement learning. I am hoping that in the next articles, I can also talk more about them.

Thank you for your time, focus, and patience.