Deep Reinforcement Learning in Mobile Robot Navigation Tutorial — Part2: Network

Reinis Cimurs
8 min readSep 6, 2022

--

In Part 1 of this tutorial, we cloned the repository, installed it, and set up the proper sources to run the training for the first time. This should allow us to run the full training of our TD3 architecture-based neural network in ROS simulation. In this part, we will look in detail at the TD3 network code itself.

The neural network code is located in a python file:

train_velodyne_td3.py 

Located in:

~<PATH_TO_FOLDER>/DRL-robot-navigation/TD3

The naming convention here stands for:

  • train — python file used for training
  • velodyne — sensor used to obtain surrounding environment
  • td3 — network architecture

TD3 Network Implementation

TD3 is an actor-critic type of network similar to DDPG. That means that there is an “actor” network that calculates an action to perform, and a “critic” network, that estimates, how good is this action. In a simple form, TD3 architecture is an extension of DDPG architecture to solve the problem of overestimating the Q-value. It does so by introducing a second critic network within the loop and selecting the output from the one that produces the lower Q-value estimations. (Once again, a mathematical and algorithmic background overview can be obtained here.) Therefore, we need to create an actor-network that will take the environmental state as input and output action for the robot to take. Also, we need to create two critic networks that will take the environmental state as well as the action from the actor-network as inputs and will output the estimated value of this state-action pair.

TD3 Neural Network Scheme

Actor Network

To create an actor model, we simply need to create a decoder that will map the state input to action as follows:

We create an actor class, that consists of 3 linear, fully connected layers. In the class initialization, we define the layers that we will use in the forward pass. The input to the first layer will be a single-dimension vector representing the environmental state as state_dim. We will embed this state with 800 parameters. Since the layers will be called sequentially, we must assure that the number of parameters in the previous layer will be the same as in the following layer. Therefore, the next layer input needs to be the same number of 800 parameters. Then we will map them to 600 parameters. In the final layer, we will map the 600 parameters directly to the number of controllable actions our robot will have as action_dim. In our case, we will control a ground mobile robot that will have controllable linear and angular velocities. Since our robot has limited maximal and minimal possible velocities, the output of the neural network needs to be limited to a min-max range. We can achieve this by defining a Tanh activation function. Tanh function will cap the output in the -1, 1 range.

In the forward pass, we define the sequence, how the layers will be called and how the values will be propagated through it. First, we call the first defined layer, apply ReLU activation on it, then do the same with the second layer. After that, we call the third linear layer and apply the Tanh function to force the output in the -1, 1 range. The output of this simple actor network will be the basis for the linear and angular velocities of our robot.

Critic Network

Similarly, we will create a critic network. As mentioned, we require 2 critics in order to mitigate the Q-value overestimation. Fortunately, both of these critics will always be working at the same time, so they can be defined by a single critic class.

Here, the decoder for the Q-value is very similar to the one we saw in the actor class. However, since we need to evaluate not the state, but more so the actor's response to it, we will need not only the state information but also the action from the actor-network as inputs in our critic network. There are multiple ways to define the state-action par as an input. For instance, the simplest form is to simply concatenate (or just append) the action values at the end of the one-dimensional vector representing the state. Then we could use the same form of decoder we saw in the actor network. But we will use a slightly different approach inspired by the method described here. Essentially, we treat the state and action as different inputs for the critic network. Embed the state information with a single linear, separately pass through the state embedding and action values in the second layer, combine them and obtain our value estimation in the final layer. Since we will create both critic networks with a single class, we will need to define these layers in the initialization twice. Layers 1,2,3 belong to the first critic and 4,5,6 to the second critic. Notice that layers 2_a and 4_a use the input size as action_dim, as this will be the first layer where actions will be embedded.

In the forward pass, we define how the propagation will take place. First, only the state input is passed through the first layer and ReLU activation is applied to it. Then, it is passed through the second layer. In parallel, the action values are also passed through a single layer with the same output size. State embedding and action are multiplied by the weights of the respective second layers. Each multiplication will output the same size of a tensor with 600 parameters. These tensors then can be summed and also the bias of the second action layer can be added to them. (This method of infusing the action inputs into the neural network has worked well in practice) Afterward, we use a final layer to map this combined state-action pair embedding to the Q-value estimation. Notice that there is no activation function here as the Q-value does not have a min or max value, so no capping is needed. As output, we will get both Q-values of both critic networks.

TD3 Network

Finally, we can implement the full TD3 network that will be the combination of our actor and critic networks. We can define a TD3 class and create our actors and critics in it.

In TD3 network, we will implement something called the soft update. Every time that we will optimize the network parameters, we will only select a subset of data, called a batch. That means that the optimization parameters of each network will be highly tuned to these samples. This, however, does not mean that we will have found the best parameters for all the possible situations. In fact, if we use these optimized parameters directly, our learning will be highly unstable as each time network parameters will be optimized for different sets. That is why we will have a control or target network for each of our networks. On each training cycle, we will optimize our base network and obtain new parameters. Then we will infuse the target network with a small amount of these optimized parameters from the base network and reset the base network to have the same parameters as the newly updated target network. This way, on each update, the target network will be slightly nudged in the (hopefully) general direction of the optimal policy while keeping the training process stable. This is the reason why in the initialization of the TD3 class need to not only call our actor and critic classes but also create actor_target and critic_target calls. Afterward, we load any parameters that we might have already saved for each of the classes and set the Adam optimizer for each class. Next, we set the maximum possible action value, create a writer that will record our data for Tensorboard visualization, and set a training iteration counter in iter_count.

After initialization, we can define some additional helper functions in our TD3 class.

The get_action function will obtain a single action, given an environmental state. For the processing in the neural network, the state tensor will be put on a device (cpu or cuda), but for use in ROS Simulator later, we require it to be placed specifically on cpu, turned into a NumPy array, and returned as a single dimension.

Calling save and load functions simply saves or loads the actor and critic network parameters.

Probably, the most important part we have to define in our class is the train function. This will be the crux of our TD3 implementation and will specify how the training of our robot motion is performed. First, let’s define the function and name input variables.

replay_buffer will define where to get samples for training from, iterations will define how many trainings to run in this function call, batch_size defines how many samples to sample in each iteration, discount is the discount factor in Bellman equation (from theory), tau is how much of base network parameters we will add to target network in soft update. policy_noise is the standart deviation of noise to add to the actions in order to enable exploration, noise_clip is the min and max value to clip the noise with, and policy_freq is how often to update the actor-network parameters. Additionally, define variables to store the average and max Q values as well as average loss for displaying in Tensorbord.

Next, we start the training process, by running network optimization for iteration number of iterations.

We sample a batch from the recorded robot motion experiences that are stored in the replay buffer. Then we calculate the next possible action for each state in the batch and add some noise to it.

Then estimate the possible Q values of this next state-action pair for both critic networks using the target networks, and select the minimum value of both outputs for each sample in the batch. From the minimum values, we obtain the target Q value using the Bellman equation. Subsequently, we obtain the Q values of base critic networks. By calculating the mean squared error between the base network critic values and the target Q value, we can estimate how far off is our base network. The trick here is that with base network we are estimating the current state reward + discounted all future state rewards. But target Q value uses already known current state reward and uses an estimation for future rewards. Therefore, the target value should be closer to the real estimation, and calculating the loss against it should nudge the network in the right optimization direction. Then simply optimize the critic network from gradients in the loss.

Meanwhile, we also collect the average and max Q values in av_Q and max_Q.

The Delayed part in Twin Delayed DDPG (TD3) comes into play next. Essentially, we delay the updates of the actor network if compared to the critic network as well as delay the soft update of network parameters by executing this step only every policy_freq iterations. This should increase the network stability as well as give better actor parameter updates (as critic would be more precise). During the actor update, we calculate the Q value for the state-action pair. We optimize the network parameters in the direction that will give us the highest possible Q value. This might seem strange, that at first we use fixed actor parameters to obtain critic network loss, but then use the same critic to obtain Q value for the actor updates. While this method might be unstable, the constant “pulling” of the policies of each network back and forth eventually leads in the right direction, and we obtain the (hopefully) optimal actor as well as critic networks.

Then we simply slightly update both network parameters with soft update.

Finally, simply record our Q and loss value variables for display in Tensorboard.

This should create a TD3 neural network that can be used for learning robot motion. In the next part (Part 3), we will look at how to actually call this network in a mobile robot setting.

--

--

Reinis Cimurs

Research scientist interested in machine learning, robotics, and autonomous driving