Deep Reinforcement Learning in Mobile Robot Navigation Tutorial — Part3: Training

Reinis Cimurs
11 min readOct 2, 2022

--

In Part 1 of this tutorial, we cloned and installed the GitHub repository and successfully launched our robot navigation training for the first time. In Part 2 we looked, in detail, at what the training neural network actually consists of. Now in this part, my fellow researcher, it is time to see how the training of the neural network is actually called in a mobile robot motion setting. Once again, we will be looking at the code from the python file:

train_velodyne_td3.py

Located in:

~<PATH_TO_FOLDER>/DRL-robot-navigation/TD3

The Robot and The World

Before we start the actual motion policy training of a mobile robot, we must first understand the task that we are trying to solve. To put it in human language, we are trying to “find the optimal sequence of actions that lead the robot to a given goal”. But how do we form this problem into something that a computer can understand and execute? There are two things to consider — the action and the environment that the action reacts to. In a mobile robot setting, it is quite easy to express the action in a mathematical form. It is the force applied to each actuator for the controllable degree of freedom. To put it simply, it is how much we want to move in any controllable direction.

In this tutorial, we will be using a Pioneer P3DX simulated mobile robot platform. It is a non-holonomic, differential-drive mobile robot base, that is capable of either moving back and forth or rotating around its axis. This means, that we have 2 directions that we can control — linear motion (moving forward/backward) and angular motion (rotating left/right). Both of these controls can be numerically expressed as linear velocity in meters per second and angular velocity in radians per second, which then can be easily processed by a neural network.

To get a reaction to environmental states, it needs to be observed in a logical and numerical way. One of the simplest forms of detecting the environment (in robotics) is by using a LiDAR or laser sensor. The logic here is very simple — the closer an object is to the laser sensor, the faster the emitted light will be returned to the photoreceptor. This speed then can be directly used to calculate the distance to the object. Logically, we can speculate that a robot will make decisions based on how far it is from obstacles and a motion policy can be derived from that. This gives a simple numerical representation of the environment of the surroundings of the laser sensor (and, subsequently, the robot).

One last thing to consider is, how will the robot know where it needs to go. We have a specific goal point in mind that the robot needs to reach, therefore this information needs to be communicated to it. To learn an ego policy, this information should be given centered on the robot's position. We will express this in polar coordinates in the form of distance to the goal position and the difference of angles between the robot heading and the heading towards the goal.

Now that we have formed a numeric expression of the environment and the agent, we can describe our state and action. As discussed, action is a tuple:

a = (v, ω)

where v is the linear velocity, and ω is the angular velocity.

For the state, however, things get a bit more complicated. We first need to represent the surroundings of the environment. This we will do with the laser readings. Then the goal will be given in polar coordinates. But we also might want to give the current internal state of the robot to the neural network, for it to know how fast we are already moving. Therefore, we will represent the state also with the action taken at the previous time-step. So our final state representation will become a 1D vector combining:

s = (laser_state + polar_goal_coordinates + previous_action)

We will be using 20 laser readings (we will explain how we obtain them in detail in Part 4). That means our final state representation will consist of 20 + 2 + 2 = 24 feature 1D vector.

Setting up the Training Network

So now that we have installed our repository, created a TD3 neural network, and understood how to represent a state, let us get to the meat and potatoes of actually training the mobile robot motion policy.

First, let's set up a bunch of parameters for the training. Hopefully, they are self-explanatory, or the comments explain their purpose.

Afterward, we will set up some more parameters and directories.

We will create a folder to store the network if it does not already exist. Then we specify how many laser readings we will require from the simulation in the environment_dim variable. The robot_dim will specify the number of “internal robot parameters” such as the combination of polar coordinate features (2) + previous action features (2). In line 10 we will actually call the creation of the simulation environment. The simulation environment creation parameters are stored in the file “multi_robot_scenario.launch”. To create the environment, we also need to pass in the number of laser readings we want it to return for each step. When creating the environment, some time might be required to start all the background processes in Gazebo simulation. Therefore, we need to give it a little time by putting our program to sleep for 5 seconds. (This should be enough on most systems, but might require a longer sleep time if issues arise.) Then set the seeds for reproducibility, set the state_dim as discussed in the aforementioned chapter, specify the number of actions our network needs to calculate in action_dim, and set the maximum possible action value that can be returned by the neural network.

Next, we finally create and initialize the TD3 model.

Additionally, we create a replay buffer that will store the experiences that the robot has encountered in the simulation and that will be used as a dynamic dataset that the neural network will be optimized for. If we have set the load_model parameter as True, previously trained model parameters will be loaded for our model (if they exist). If not, then the model will be initialized with random parameter values.

Before starting the training loop, some last parameters need to be set up.

We first create a list to store the results from evaluations and create variables for counting the elapsed time, steps, and epochs. Finally, create variables for later use, when we will require to perform random actions near obstacles.

Training Loop

Now that everything has been set up, we can start our training loop. We will base the amount of training on the maximum training steps, but other methods (such as the maximum number of epochs, elapsed time, etc.) can also be easily set up.

All the following code (unless stated otherwise) will be located within this loop.

Note:
We will use notations of the episode and epoch in the text.

Episode — a collection of subsequent steps, until one of the termination conditions is reached (termination conditions are — reaching a goal, colliding with an obstacle, or reaching the maximum number of steps set in max_ep parameter)

Epoch — a number of subsequent episodes or timesteps between performing an evaluation (set in eval_freq parameter).

Let us set up the code that will execute at the conclusion of every episode. We need to set it at the beginning of the loop, as we want to already do the first reset of the environment to obtain the first observation of it (this is why we set the done parameter as True earlier).

On the first execution of the code, the timestep value will still be 0, so the network training will not take place (as we do not have anything yet to train on). However, in every subsequent completion of an episode, a neural network training will be run. That means, that we train the neural network after every episode. The training cycles are determined by the episode_timesteps value we pass in. The reasoning is, that if there were more timesteps in the episode, we have collected more new samples, and we have more new data to optimize for. If we do not have a lot of new samples, fewer training cycles will be carried out. After this, we check if it is time to perform an evaluation of the current model. If it is so, we perform the evaluation, record the results and save our model. Once these checks are complete, we need to reset our environment. This will place the robot in a new random starting position and orientation, will give a new randomly placed goal position, and start a new episode. From the reset state, we will obtain an observation of the new environment. Then we just simply set our counting variables back to 0, and start executing a new episode.

In DRL, it is common to follow a greedy strategy. That means, that the neural network is tasked to select the best-calculated outcome for any input. Unfortunately, while training, this may cause a vicious cycle of first calculating a good outcome, then following it, and reinforcing the good result, without looking for possible better alternatives. This is often referred to as an exploitation strategy. To find the best possible outcome, we must force the neural network to explore the alternatives (exploration strategy). A common way to do this is to add random noise to the neural network output and observe the result in the environment.

We calculate the amount of noise we want to add to the action calculated by the neural network and save it as expl_noise. The noise level will decrease over the expl_decay_steps until it reaches the minimum expl_min value. We calculate the action with a neural network and update this value by adding the noise to it. However, robots have physical limitations in their maximum and minimum velocities. So we clip the action value in this range to obtain the final action value to test out in the environment.

If you notice the way we obtained the random value to add to the action, you will see we used the numpy.random.normal function. This gives us a random value with Gaussian distribution. That means that if we would average out the mean value of the noise, it would be very close to 0 as minus values are just as likely to be given as plus ones. However, if the robot has already found itself in a critical position, it needs decisive action to get out of it, and random Gaussian noise on each step is not sufficient. Let us imagine that the robot has found itself directly facing an obstacle. If it simply continues going forward, it will crash. Adding Gaussian noise will not help, as it will even itself out over sequential steps. Therefore, we implement an idea, to force the robot to take a single decisive action for a random number of sequential steps. This will force the robot to randomly experience a sequence of actions that will take it out of the critical position.

If we choose to use this method (it can be turned off by setting random_near_obstacle to False), we might not want to use it at every interaction near an obstacle. So we query a random value and go on with this method only if the random value is larger than 0.85. Also, we check if the obstacle in front of the robot is closer than 0.6 meters. Remember, our state is represented by 20 + 2 + 2 values, where the first 20 are the laser sensor state. To check if an obstacle is in front of the robot (and not on the sides) we check the values from 4 until 16 in our state representation (state[4:-8]). We also check if we are already locked into the sequential execution of this method. If all of this passes, we select a random number between 8 and 15 for how many steps we will execute this random action. Then sample the random action. Update the action value, and count down the counter of random actions.

Note:
Here we set action[0] = -1 to force the robot to take only a random rotation. To set a fully random action, this line should be commented out or deleted.

Next, we need to ‘fix’ our action value. Our TD3 neural network model outputs 2 action values in range -1, 1. But our laser sensor records only at a 180-degree field of view in front of the robot. That means, it would be irresponsible of us to allow it to move backward, as there is no way for the robot to know what is behind it. So we must ensure that our robot has the minimum linear velocity value of 0 m per second. We do this in the following code.

The first value in action represents the linear velocity and we squeeze it in the range 0,1. This value we can pass to our gazebo simulator environment, where it is executed. A new state is returned along with the state-action pair reward and two boolean values of whether the state concluded (in done) and if the target was reached (in target). Next, we check if the maximum number of steps is reached and update done and done_bool values here accordingly (in Python int and bool values can be used interchangeably, but here we use int as we use the numeric value in the Bellman equation as the terminality of the state). Afterward, simply update the current running state (for which we will calculate the action value in the next iteration of the training loop) and update the training counters.

This loop will run, calculate actions for each state, gather experiences and train the neural network model until it concludes… or we stop explicitly stop it by pressing CTRL+C in the terminal (which is my favored approach).

Once the loop concludes, we save the evaluation data and the final network parameters.

Evaluation

One final thing that we did not look at just yet is how to perform an evaluation. For this, we create a separate evaluation function.

Recall, that this is called after the completion of every epoch. Let us set some variables that will collect the data we might want to look at, such as average reward and the collision rate. Then we start the evaluation loop that will execute eval_episodes number of random evaluation runs. Here, the steps are the same as during the training, except we do not apply random noise to the calculated actions, as we are interested in knowing how well the current model performs. Additionally, we record the reward and whether or not we crashed in the run. A hardcoded reward for a crash is -100, so we can use the reward to see if a collision occurred (alternatively we can look if done is True while target is False). After executing all of the evaluation runs, we calculate the average reward and collision rate. Then print them out in the terminal. Ideally, we should see a high positive average reward and collision rate that is close to 0. We return the average reward, to be added to the evaluation results.

This concludes the training of the neural network model that trains a mobile robot motion policy from laser, goal, and robot motion inputs. Theoretically, any simulation environment could be used in this code part. It does not need to be our Gazobo simulator nor would it need to be a laser sensor input that records the environment. The variable env should just have a different assigned simulation environment and setting different environment_dim value would allow using a sensor with a different number of inputs. But we do have a working environment already available with Gazebo and ROS implementation, and that is exactly what we will be looking at in Part 4.

--

--

Reinis Cimurs

Research scientist interested in machine learning, robotics, and autonomous driving