Deep Deterministic Policy Gradient (DDPG) explained with codes in reinforcement learning

Training open gym environment with continuous action space

Published in

Data Science in your pocket

8 min readMay 31, 2023

So far so good, we have covered a bunch of exciting things in reinforcement learning till now ranging from basics to MAB, to Temporal Difference learning and plenty of Deep Reinforcement Learning algorithms namely REINFORCE, A2C, DQN, etc.

My debut book “LangChain in your Pocket” is out now

LangChain in your Pocket: Beginner's Guide to Building Generative AI Applications using LLMs

LangChain in your Pocket: Beginner's Guide to Building Generative AI Applications using LLMs eBook : Gupta, Mehul…

www.amazon.in

You can check them out below in the Reinforcement Learning section

So the algorithms we discussed in my last few posts are majorly around continuous state space but discrete actions. That is actions like Left, Right, up, down, accelerate, deaccelerate, etc

But not,

Apply 0.5 break
Move 32.3° to right
Steer 35° and accelerate by 10km/hr

The difference you can observe very easily is how in the 2nd type of action, we need to estimate some continuous quantity as well alongside the action. Steer the wheel? Ok, but how much? 35° or 40°.

If you are into gaming, you know this is a major requirement. If you feel not, play GTA once.

So, this time we will deep dive into DDPG that can be used for training agents for environments with continuous action space.

DDPGs also belong to the family of Actor-Critic methods (as A2C we discussed ) only where we have Actor (Policy network) and Critic (Value Network) learning together and finally, the actor is used for determining actions.

New to Actor-Critic? the below video will help

A few things that we would be changing compared to A2C in DDPG are

Use of Target networks for both Actor & Critic for stabilized training.
Use of Experience Replay (that we used in DQNs).
An updated loss function for both Actor & Critic networks.

What is Experience Replay?

As already covered in my previous blogs, below is a video if you missed

Loss functions

Critic loss = MSE(Target q-value, Estimated q-value)

Where Target q-value = Reward + gamma X T_Q(s’, T_A(s’))

And Actor loss= — Q(s,A(s))

Where

A=Actor Network
Q=Critic Network
T_A = Target Actor network
T_Q = Target Critic Network
s = current state
s’= next state

What are Target networks getting used in DDPGs?

So they are nothing but copy networks of actual Actor & Critic networks we are training which is updated with the actual network counterparts periodically (say after every 10 epochs). Maintaining a copy of the network for Actors & critics helps in stabilized agent training.

As we discussed the changes compared to A2C, let’s have a brief on the environment I tried this time

The environment

The environment we are trying to train today is Pendulum-v1 from OpenAI Gym which intakes continuous values between -2,2 as action i.e. torque applied. The state space is a tuple of 3 values

Cosine and Sine angle (ranging -1,1)
Angular Velocity (-8,8)

The reward function is a continuous value designed to encourage the agent to swing the pendulum up and balance it at the upright position while minimizing the energy (torque) used. It is calculated using the below formula

reward = -(theta² + 0.1 * theta_dot² + 0.001 * action²)

where:

theta represents the angle of the pendulum from the vertical position.
theta_dot represents the angular velocity of the pendulum.
action represents the torque applied to the joint of the pendulum.

Time for some codes

Import required libraries

import random
import pandas as pd
import numpy as np
from PIL import Image
from keras.layers import Input, Lambda, Dense, Dropout, Convolution2D, MaxPooling2D, Flatten,Activation,Concatenate
from keras.models import Sequential, Model
from tensorflow.keras.optimizers import Adam
from keras.models import Model,load_model, model_from_json
import tensorflow as tf
import gym
import math
import pygame, sys
from tensorflow import keras
from collections import deque
import math
#enable eager execution in tensorflow
tf.config.run_functions_eagerly(True)

2. Initiate the gym environment

env = gym.make('Pendulum-v1')

input_shape = (3,)
num_actions = 1

3. Declare Actor and target Actor network

def actor_network(input_shape=(3,)):
        model = Sequential()
        model.add(Dense(128, activation=tf.keras.layers.LeakyReLU(alpha=0.2),input_shape=input_shape))
        model.add(Dense(64, activation=tf.keras.layers.LeakyReLU(alpha=0.2)))
        model.add(Dense(16, activation=tf.keras.layers.LeakyReLU(alpha=0.2)))
        model.add(Dense(num_actions,activation='tanh'))
        return model

actor, target_actor = actor_network(),actor_network()
optimizer_actor = Adam(learning_rate=0.001)

Things to note

Both Actor and Target Actor have the same architecture
They are shallow Feed-Forward Networks with the current state as input
The last activation used is ‘tanh’ as the action is in the range -2,2

Note: Tanh outputs values between -1,1 which we need to scale to -2,2

4. Declare Critic and Target Critic Network

def critic_network(state_dim, action_dim):
    # Define the input layers
    state_input = Input(shape=state_dim,dtype=tf.float64)
    action_input = Input(shape=action_dim,dtype=tf.float64)

    
    state_h1 = Dense(128, activation=tf.keras.layers.LeakyReLU(alpha=0.2))(state_input)
    state_h2 = Dense(64, activation=tf.keras.layers.LeakyReLU(alpha=0.2))(state_h1)

    action_h1 = Dense(128, activation=tf.keras.layers.LeakyReLU(alpha=0.2))(action_input)
    action_h2 = Dense(64, activation=tf.keras.layers.LeakyReLU(alpha=0.2))(action_h1)
    concat = Concatenate()([state_h2, action_h2])

    # Define the output layer
    dense1 = Dense(64, activation=tf.keras.layers.LeakyReLU(alpha=0.2))(concat)
    dense2 = Dense(32, activation=tf.keras.layers.LeakyReLU(alpha=0.2))(dense1)
    output = Dense(1, activation='linear')(dense2)
    
    model = Model(inputs=[state_input, action_input], outputs=output)
    return model
    
critic, target_critic = critic_network(3,1),critic_network(3,1)
optimizer_critic = Adam(learning_rate=0.001)

Things to note

Critic and Target Critic have the same architecture
Critic intakes both state and action as input outputting the expected q-value
The concat variable in between the architecture helps us to club up the state and action inputs. Could this be done as the first step? Yes. Though I didn’t try
The last activation is linear as q-values are continuous values

5. Updating the target networks

def update_target_networks(actor_model, critic_model, target_actor_model, target_critic_model):
    tau = 0.05
    # Update the target actor model
    actor_weights = actor_model.get_weights()
    target_actor_weights = target_actor_model.get_weights()
    for i in range(len(actor_weights)):
        target_actor_weights[i] = tau * actor_weights[i] + (1 - tau) * target_actor_weights[i]
    target_actor_model.set_weights(target_actor_weights)

    # Update the target critic model
    critic_weights = critic_model.get_weights()
    target_critic_weights = target_critic_model.get_weights()
    for i in range(len(critic_weights)):
        target_critic_weights[i] = tau * critic_weights[i] + (1 - tau) * target_critic_weights[i]
    target_critic_model.set_weights(target_critic_weights)
    return target_critic_model, target_actor_model

This function is created to update the target networks from time to time. Though this update is not a straight copy-paste but a mix of the target’s old weights and new update networks using the trade-off variable ‘tau’.

6. Noise function which we will use to alter the action predicted by the Actor to avoid overfitting. I myself copy pasted this code to add noise and doesn’t feel requires a deep dive for now.

class OrnsteinUhlenbeckActionNoise:
    def __init__(self, mu, sigma, theta, dt, size):
        self.mu = mu
        self.sigma = sigma
        self.theta = theta
        self.dt = dt
        self.size = size
        self.reset()
    
    def reset(self):
        self.state = np.ones(self.size) * self.mu
    
    def __call__(self):
        x = self.state
        dx = self.theta * (self.mu - x) * self.dt + self.sigma * np.sqrt(self.dt) * np.random.normal(size=self.size)
        self.state = x + dx
        return self.state

def ddpg_add_exploration_noise(exploration_noise, action, noise_scale):
    noise = noise_scale * exploration_noise()

    action = np.clip(action + noise, -2.0, 2.0)
    
    return action

# Usage example:
action_dim = 1  # Dimensionality of the action space
noise_mu = 0.0
noise_sigma = 0.2
noise_theta = 0.15
noise_dt = 0.01

# Create an instance of the OrnsteinUhlenbeckActionNoise class
exploration_noise = OrnsteinUhlenbeckActionNoise(noise_mu, noise_sigma, noise_theta, noise_dt, size=action_dim)

7. Some constants to be used

gamma = tf.cast(tf.constant(0.95),tf.float64)
num_episodes = 1000
maxlen = 1000
batch = 128
replay = deque(maxlen=maxlen)
epoch = 0
count=0
max_loss = math.inf
count = 0

8. Training loop

for episode in range(num_episodes):
    ep_len = 0
    state = env.reset()
    print('epsilon getting updated',epsilon)
    epsilon*=0.99
    # Run the episode
    while True:
        count+=1
        ep_len+=1
    
        action = 2*actor.predict(np.array(state).reshape(-1,3),verbose=0)[0]
        action = ddpg_add_exploration_noise(exploration_noise, action, noise_scale=0.1)
        
        next_state, reward, done, _ = env.step(action)
        done = 1 if done else 0
        
        print('reward and status',reward,state)
        state = state.reshape(3)
            
        replay.append((np.array(state),action,reward,np.array(next_state),done))
        state = next_state

        if done:
            break
    
        if count>batch:
            count = 0
            batch_ = random.sample(replay,batch)
            current_state = tf.convert_to_tensor([x[0] for x in batch_])
            next_state = tf.convert_to_tensor([x[3] for x in batch_])
            reward = tf.convert_to_tensor([x[2] for x in batch_])
            done =   tf.convert_to_tensor([x[4] for x in batch_])
            actions =   tf.convert_to_tensor([x[1] for x in batch_])
            other_actions = [[1,0] for x in range(batch)]
            
            q_actions = target_actor(next_state) 
            target_q = tf.cast(reward,tf.float64) + (tf.cast(tf.constant(1.0),tf.float64)-tf.cast(done,tf.float64))*gamma*tf.cast(target_critic([next_state,q_actions]),tf.float64)

            with tf.GradientTape() as tape:
                current_q_value = critic([current_state,actions])
                critic_loss = tf.reduce_mean(tf.math.pow(target_q-tf.cast(current_q_value,tf.float64),2))
    
            grads_critic = tape.gradient(critic_loss, critic.trainable_variables)
            optimizer_critic.apply_gradients(zip(grads_critic, critic.trainable_variables))
                
            with tf.GradientTape() as tape:
                actions = actor(current_state,training=True)                
                current_q_value = critic([current_state,actions],training=True)
                actor_loss = -tf.reduce_mean(current_q_value)
                
            grads_actor = tape.gradient(actor_loss, actor.trainable_variables)
            optimizer_actor.apply_gradients(zip(grads_actor, actor.trainable_variables))

            print('Epoch {} done with loss actor={} , critic={} !!!!!!'.format(epoch,actor_loss,critic_loss))
            if epoch%10==0:
                    actor.save('pendulum/actor/')
                    critic.save('pendulum/critic/')
            
            if epoch%5==0:
                    target_critic, target_actor = update_target_networks(actor,critic,target_actor,target_critic)
            epoch+=1

This requires some explanation. Let’s get started

For every episode

Reset the environment
Take an action as prediction from Actor network and scale the output to -2,2 (remember tanh)
Add noise to this action
Get next state, reward, done status (whether episode ended) by feeding the action to environment variable
Store next_state, current_state, action, reward, done status in Experience Replay buffer
Once the required number of samples are in the buffer, then calculate target_q_value = reward + gamma x q-value for nextstate.

How did we get to q-value for the next state?

Calculate the action for next state using Target Actor network
Calculate q-value for this next-state and action using Target Critic network.

So this is the calculation where we would be using the Target Networks. Nowhere else target networks will be used!

The next step is to calculate gradients to update our Actor and Critic Networks (not the Target Networks but the actual ones) using tf.GradientTape

For Critic, calculate the loss function i.e. mean(Target_q_value-predicted_q_value)
For Actor, it is -1 x mean(predicted_q_value)
Apply gradients to both Actor and Critic network respectively
Update the Target networks after every nth epoch/batch

9. To visualize the results, use the below code snippet which I have already given an explanation for in my previous blogs and vlogs

import tensorflow as tf
import numpy as np
import gym
import math
from PIL import Image
import pygame, sys
from pygame.locals import *
from tensorflow import keras

#pygame essentials
pygame.init()
DISPLAYSURF = pygame.display.set_mode((500,500),0,32)
clock = pygame.time.Clock()
pygame.display.flip()

#openai gym env
env = gym.make('Pendulum-v1')
state = env.reset()

done = False
count=0
done=False
steps = 0
#loading trained model
model = tf.keras.models.load_model('pendulum/actor/')
total_wins =0
episodes = 0


def print_summary(text,cood,size):
        font = pygame.font.Font(pygame.font.get_default_font(), size)
        text_surface = font.render(text, True, (0,0,0))
        DISPLAYSURF.blit(text_surface,cood)
     
while episodes<1000 :
    pygame.event.get()
    for event in pygame.event.get():
                if event.type==QUIT:
                                pygame.quit()
                                raise Exception('training ended')
    # Get the action probabilities from the policy network
    # Choose an action based on the action probabilities
    
    action = model.predict(np.array(state).reshape(-1,3))[0]
    
    next_state, reward, done, info = env.step(action) # take a step in the environment
    print('reward and done?',reward,done)
    image = env.render(mode='rgb_array') # render the environment to the screen
   
    #convert image to pygame surface object
    image = Image.fromarray(image,'RGB')
    mode,size,data = image.mode,image.size,image.tobytes()
    image = pygame.image.fromstring(data, size, mode)

    DISPLAYSURF.blit(image,(0,0))
    pygame.display.update()
    clock.tick(100)
    if done:
        state = env.reset()
        pygame.display.update()
        pygame.time.delay(100)
        episodes+=1
        
    pygame.time.delay(100)
    state = next_state

pygame.quit()

End note

A major problem I faced while training the pendulum environment, and hence got poor results is the fact that DDPGs are considered the hardest to train amongst other known algorithms as very susceptible to hyperparameters hence hyperparameter tuning is a must for DDPGs. Also, as the exploration is in a continuous space, this is actually a tough problem to crack down on.

Do try this code snippet, play around with the hyperparameters and train other OpenAI Gym environments like CarRacing which is comparatively more complex.

Until next time