Deep Reinforcement Learning in Mobile Robot Navigation Tutorial — Part4: Environment

12 min readOct 19, 2022

In the previous parts (Part 1, Part 2, and Part 3) of this tutorial, we have installed and trained our neural network to move a mobile robot platform. We observe the environment and then take action that will eventually lead the robot to our goal. But until now the environment has been something rather abstract. We specified that actions are taken in the ROS gazebo simulator, but how does this interaction actually happen? And how can we control it? To find the answer to this, we will be looking into the environment code in:

velodyne_env.py

Class Initialization

As mentioned, we will be using the ROS Gazebo simulator as the backbone, to execute the actions with the robot and collect the environment observations. For this, we will create a python class, to facilitate this exchange between the neural network code and the simulator.

First, we will initialize our class. This will set up and automatically launch our simulator with the described settings.

As input parameters, the initialization function takes the name of the “launchfile”. A launchfile is a ROS file, that describes the launching parameters of the simulations and is located in:

TD3/assets

In our example in Part 3, we passed in the “multi_robot_scenario.launch” file. The other parameter we pass in is the environment_dim value that we specified in our training file. Recall, that it is the number of laser readings that we want to describe the surrounding environment with (in our case, 20 readings). During the initialization, we set up the initial class variables. We set up the starting odometry information in X and Y coordinates. Then we also set up the placeholder for goal X and Y coordinates. The self.upper and self.lower variables, designate the maximum and minimum values of how far the random goal location can be placed from the robot (we will touch upon this later). Afterward, we create self.velodyne_state variable that will store our state observations. Initially, we fill it with values of 10. We choose 10, as that will be the maximum distance in meters that we specify for the laser readings. Create a placeholder for the last received odometry data. Then, create placeholders for specifying the set location of the robot (we will call these values when resetting the robot at specific locations).

In the next part, we create the gaps for binning of laser data.

Consider a laser reading with a field of view of 180 degrees (as is our case). Each laser value is recorded at a specific angle, and we could have many such recordings. Let us assume that we record 180 values in a 180-degree field of view. But we specified that we want to represent the environment with only 20 values (because a network trains faster with fewer input parameters). That means we need to somehow reduce the amount of 180 values to 20. We do this by splitting the 180-degree field of view into 20 gaps to record the minimum laser value per each gap. So during the initialization, we calculate the starting and ending angles for each gap and save them for later use.

Now that we have set up our placeholders, we can start initializing and launching our simulator.

We launch roscore at port 11311 by calling for this process within our initialization function.

Then launch the simulator with the input launchfile.

Now we can set up our ROS publishers and subscribers, which will allow us to send and receive the information from the simulator.

The self.vel_pub publisher will send the action information to the robot in the simulator in the form of linear and angular velocities via a Twist() (twist is a ROS message type) message. self.set_state will publish and place the robot in a specified location within the simulator. We will use this to randomly place the robot at each start of a new episode. self.unpause and self.pause functions will pause and unpause the simulator. self.reset_proxy resets the whole simulation. Publishers self.publisher / self.publisher2 / self.publisher3 will visually publish the information about the location of the goal, current linear and current angular velocities in Rviz visualization of the simulation. We will receive the laser data from the velodyne puck LiDAR by using the self.velodyne subscriber (Notice, it receives point cloud data instead of laser distances). Finally, we obtain the robot location within the simulator with self.odom subscriber.

A thing of note here is that the self.velodyne and self.odom subscribers invoke callback functions — self.velodyne_callback and self.odom_callback, respectively. Simply put, a callback is a function that will be called every time new data is available from this subscriber. Usually, this is done to process and prepare the data in some way.

The odom_callback simply stores the last received odometry data in the last_odom variable.

The velodyne_callback is a bit more involved.

We pointed out that the velodyne subscriber returns data as a point cloud. That means that each recorded point returns an X,Y,Z coordinate of each laser reading for each point. However, we want the laser readings to be represented just like a 2D LiDAR, where each laser value returns the distance to an obstacle. This means we need to process and then “flatten” the velodyne data before we can use it. First, we read the X, Y, and Z values from the incoming data. Then create a placeholder for where we will store the processed velodyne data. This data now will be set to 10, representing the maximum measurement distance. (Note: If a laser is pointed to an empty space, no measurement will be returned, and the data point will be set to NaN. However, the neural network can not process NaNs, so in such cases, we need to replace these values with a maximum measuring range.) Once we have read our data, we can start filtering each incoming data point. First, we filter out the ground by checking if the z coordinate of the data point is larger than -0.2 (the point of origin for the data is the velodyne sensor). Then we calculate the distance to the data point. Once we have the distance, we calculate to which gap this measurement belongs. If the distance to the data point is smaller than the existing distance for this gap, we save this distance as the representative measure for this gap. This allows us to use a 3D velodyne puck LiDAR sensor to record obstacles at multiple heights, but still, get the laser distance representation as a 1D vector. Simply put, we just keep the minimum value for each gap and thus “flatten” the 3D laser data.

Environment representation by a 16-channel 3D Velodyne LiDAR

Single Step

Once we have initialized our class, we can specify the function that will carry out a single motion step in the simulator.

For a single step, we pass in an action to take and publish this action as a Twist() message to our robot. We also visualize the actions and the goal position by publishing the markers. Since we do not want the simulation to be running while we perform some calculations in the background, we pause it, when we are not carrying out an action. This means that to propagate the state with the current action, we must unpause our simulation. Then propagate it for TIME_DELTA amount of time (0.1 seconds in this implementation) and pause it again. This sequence of actions will perform the action for 0.1 seconds, and allow us to read the new state after it.

Now we read the state.

Our callback functions constantly update the velodyne_data in the background with a high frequency, so we will have received the newest laser observations right after propagating the state. With this information, we can calculate, if a collision has occurred and do some quick data manipulation to save the laser_state (we copy it so as not to mutate velodyne data, and prepare the correct list form). From the odometry subscriber, we obtain the location information of the robot. The robot's heading is returned in a form of a quaternion. To turn it into something that a human can actually comprehend, we translate it into an Euler angle and round it to 4 digits after the decimal.

Now that we know the location and heading direction of the robot, we can calculate the polar coordinates of the goal.

For distance to the goal, we use the numpy.linalg.norm() function that will return a length of the given vector. If we give coordinates of the goal, with respect to the robot's location, that will give us the distance information. Then we implement a bit of a hacky way to calculate the difference in heading between a vector towards the goal and the robot heading (feel free to update this part of the code and get back to me with a better solution). Essentially, we need to align the coordinate frames and obtain the difference in the angles. This gives us distance and theta to represent the polar coordinate of the goal in the robot's frame.

With the obtained distance information, we can see if we have reached the goal.

If the distance is smaller than the GOAL_REACHED_DIST threshold (here it is 0.3 meters), we have arrived at our goal.

Finally, we can combine the full state information and return it along with other relevant information.

We combine the robot state with the laser state. Then obtain the reward. In the end, we return the state, the reward, whether the episode has concluded for some reason, and if we have reached the target.

Note: You will notice, we used a couple of class functions here. First, we had the function, to see if we have collided with anything.

We check laser data and select the minimum value. Naturally, if the minimum number is smaller than the COLLISION_DIST threshold, we return that a collision has happened. We also return the minimum laser value that will be used in the reward calculation.

Speaking of reward calculation…

Reward

We give a high reward if the robot arrived at the goal, a negative reward if the robot collided with anything, and an immediate reward otherwise. Immediate reward in its basic form can be expressed as:

r = v - |ω|

The idea behind it is that the robot needs to realize that it should be moving around and not just sitting in a single spot. By setting a positive reward for linear motion robot first learns that moving forward is good and rotating is not. Even though it crashes a lot in the beginning, the episode reward is still higher than just sitting in one spot and rotating. Soon it learns, that taking a turn near obstacles is more beneficial than crashing, even though it still might be a negative action, it is less negative than crashing. With this, the robot soon learns to avoid obstacles with a smooth motion as the smaller the rotation the smaller the penalty for turning. While running around the environment the robot randomly will end up reaching the goals and in time realize that the benefit of reaching the goal outweighs the penalty for turning. Even if it ends up in a pocket, the robot will still know that being in forward motion will give it a positive reward and thus will unintentionally look for a way out of the pocket. However, this method does not give you the shortest possible path to the goal as the robot will prefer to also have linear motion while having rotational motion as that will give a better reward. That means that the robot will take a larger turn just to turn around, if possible, which will increase the total path length. But it generally gives a very smooth motion to the robot. The size of the turn though can be minimized by giving a smaller penalty for turning.

Additionally, we add the term r3 which is calculated by our lambda function. This gives an additional negative reward if the robot is closer to any obstacle than 1 meter. Using this ‘repulsion’ makes the robot wearier of obstacles in general and move around them with a larger gap.

Reset

If an episode has concluded (recall, that an episode is a collection of sequential steps until a collision, reaching a target, or reaching the maximum number of steps), we need to reset the simulation. We can do this with a class function reset.

Here, we reset the simulation environment to its initial state.

However, we would like each episode to start with a different initial position and pose of the robot. This would allow us to vary the experiences that the robot will encounter and generalize our network better.

For this reason, we sample a random Euler angle value and turn it into a quaternion. This will serve as a random robot heading direction. Afterward, we need to sample random X and Y coordinates as the robot starting location of the episode. Our simulation environment is a walled section of size 10 by 10 meters. That means, that we would like the robot to be randomly spawned within this walled section. Therefore, we want the random values to be in the range from -4.5 to 4.5 meters around the point of origin of the environment (with keeping a 0.5-meter distance from the walls). But we also have other obstacles in the environment and we would not want to spawn our robot on top of them. So we randomly select X and Y coordinates until they are not located within the ‘dead zones’ around the obstacles. Now that we have our coordinates and heading, we update the robot position and place it there with the set_state publisher. Finally, save the coordinates as the odometry information.

Randomly place a goal in the environment by calling a function that we will explain in detail a bit later.

We have placed 4 cardboard boxes in our simulation. To vary the environment for each episode even more, we can randomly change their position at the start of each episode as well. We will also explain this function in detail in a bit. Now we just reset the markers that publish the action information. Since we do not take any action when resetting the environment, we can simply pass in zeros as action values here.

Random starting position and poses of the robot after every reset call. Cardboard boxes also change their location on every reset.

After placing the robot, goal, and boxes in the environment, we can observe it to return to the first state. The observation is very similar to what we already saw in the step function.

Propagate the state again to get the values back from our subscribers. Do the calculations, combine the laser and robot state information, and return the state.

Here we used 3 additional functions first of which was the change_goal.

We set up our class variables upper and lower during the initialization. The upper and lower values describe the range in which the goal can be randomly placed. This was done to allow us to use a simple ‘bootstrapping’ method, to help the network to learn that it needs to navigate to the goal. By placing the goal near the robot early on in the training, it has a higher chance to stumble upon it. This means that it will collect more experiences of navigating to the goal, which will help the network to learn from good samples. So when placing the random goal, we place it in the vicinity of the randomly placed robot position of range from lower to upper. Every time we place a new goal, we increase the lower and upper limits. But before accepting the goal position, here we also need to check if it is not placed on any known obstacles.

Next, we have the random_box function, which places the boxes in random locations in the environment.

The logic here is very similar to the change_goal function, except we also want to check if the box placement is not too close to the robot or the goal. This is done sequentially for all 4 boxes we have in our environment.

We also have the publish_markers function, which publishes visual information in the Rviz of the action values as well as the goal position.

The final function that we need to look at is the check_pos. This function simply checks if the passed-in X and Y values are located in the range of any of the obstacles in the environment.

Note: These ranges are determined manually and are specific to the default map used in the repository. If you would like to use any other map for training, you need to manually input the ranges of ‘dead zones’ around the obstacles in this function.

This concludes the code overview of this repository and I hope it is sufficient to start your own research in the field of Reinforcement Learning for mobile robot navigation in ROS simulator. If anything, I hope it can help to guide your own ideas and give a sort of an overview of why certain things are done or just how to even connect your network to the ROS simulator.

In Part 5, we will not be looking at the code anymore, but we will look at some tips and tricks on how to speed up training and other miscellaneous information.

Deep Reinforcement Learning in Mobile Robot Navigation Tutorial — Part4: Environment

Written by Reinis Cimurs