Accelerated robot training through simulation in the cloud with ROS and Gazebo

Introducing the robot_gym framework

9 min readSep 4, 2018

The content of this article comes from “robot_gym: accelerated robot training through simulation in the cloud with ROS and Gazebo” available at https://arxiv.org/pdf/1808.10369.pdf. Peer written with Alejandro Hernández, Asier Bilbao Calvo, Irati Zamalloa Ugarte and Risto Kojcev.

A modular articulated arm with 6DoF used to validate the robot_gym framework.

Rather than programming, training allows robots to achieve behaviors that generalize better and are capable to respond to real-world needs. However, such training requires a big amount of experimentation which is not always feasible for a physical robot. In this work, we present robot_gym, a framework to accelerate robot training through simulation in the cloud that makes use of roboticists’ tools, simplifying the development and deployment processes on real robots. We unveil that, for simple tasks, simple 3DoF robots require more than 140 attempts to learn. For more complex, 6DoF robots, the number of attempts increases to more than 900 for the same task. We demonstrate that our framework, for simple tasks, accelerates the robot training time by more than 33% while maintaining similar levels of accuracy and repeatability.

Introduction

Reinforcement Learning (RL) has recently gained attention in the robotics field. Rather than programming, it allows roboticists to train robots, producing results that generalize better and are able to comply with the dynamic environments typically encountered in robotics. Furthermore, RL techniques, if used in combination with modular robotics, could empower a new generation of robots that are more adaptable and capable of performing a variety of tasks without human intervention [1].

While some results showed the feasibility of using RL in real robots [2], such approach is expensive, requires hundreds of thousands of attempts (performed by a group of robots) and a period of several months. These capabilities are only available to a restricted few, thereby training in simulation has gained popularity. The idea behind using simulation is to train a virtual model of the real robot until the desired behavior is learned and then transfer the knowledge to the real robot. The behavior can be further enhanced by exposing it to a restricted number of additional training iterations. Following some the initial releases of OpenAI’s gym[3], many groups started using the Mujoco[4] physics engine. Others have used the Gazebo robot simulator[5] in combination with the Robot Operating System (ROS)[6] to create an environment with the common tools used by roboticists named gym_gazebo [7].

In this work, we introduce an extension of gym_gazebo, called robot_gym, that makes use of container technology to deploy experiments in a distributed way, accelerating the training process through a framework for distributed RL. We aim to provide answers to the following questions: By how much is it possible to accelerate robot training time with RL techniques? And: What is the associated cost of doing so? Our experimental results show a significant reduction of the training time. Compared to standard RL approaches, we achieve time reductions of up to 50% for simple tasks.

Compared to standard RL approaches, we achieve time reductions of up to 50% for simple tasks.

The robot_gym framework

Most robots operate in a continuously changing environment which makes generalization of given tasks extremely hard. RL and, particularly, policy gradient methods, are among the techniques that allow for the development of more adaptive behaviors. However, even the most simple tasks demand long periods of training time. This aspect becomes specially relevant in robotics, where the time spent gathering experience from the environment (rollouts) is significantly more relevant than the time spent computing gradients or updating the corresponding policy that is being trained.

The robot_gym architecture where a) pictures the worker orchestration, b) corresponds with the polity initialization c) represents each one of the workers, d) the policy merge task and e) the policy update process.

To reduce the overall training time, this work proposes robot_gym, a framework for deploying robotic experiments in distributed workers, that aims to reduce the time spent gathering experience from the environment and, overall, reduce the training time of robots when using RL. Figure above pictures the architecture of the framework which has been inspired by previous work[8]. For a complete description of the framework, refer to the original article.

Experimental results

To validate the framework, we ran experimental tests in simulation and deployed the results both in simulated and on real robots, obtaining similar results. The robots used in our experiments have been built using the H-ROS[9] technology, which simplifies the process of building, configuring and re-configuring robots.

Modular robots used for the experimental testing. Left, a modular robot arm in a SCARA configuration with 3DoF. Right, a modular articulated arm with 6DoF.

We experiment with two modular robots: a 3 Degrees-of-Freedom (DoF) robot in a SCARA configuration and a 6 DoF modular articulated arm. We analyze the impact of different number of workers, distributed through several machines and the number of iterations that the robot needs to converge. Our goal is to reach a specific position in the space stopping the training when the algorithm obtains zero as the target reward. Rewards are heuristically determined using the distance to the target point.

During our experimentation, we use the Proximal Policy Optimization (PPO)[10], a state-of-the art policy gradient technique which alternates between sampling data through interaction with the environment and optimizing the ”surrogate” objective by clipping the policy probability ratio.

Modular robotic arm in a SCARA configuration (3DoF)

We launched our experiment with 1, 2, 4 and 8 workers using 12 replicas. Within robot_gym, ray library is in charge of distributing the workers between the available replicas.

”time” vs ”reward” during approximately 700.000 iterations with different numbers of workers.

The reward obtained during training is presented and pictured against the time it took in the figure above. Each curve illustrates the development of the same robot (modular robotic arm in a SCARA configuration with 3DoF) trained with the robot_gym framework, under a different number of workers distributed among the 12 available replicas. Using 1 worker, we can observe that the robot takes approximately 600 seconds to reach mean target reward of zero, that is, about 10 minutes of training time. Using 2 workers, the training time required to reach the target reward lowers to 400 seconds (6,6 minutes) approximately. When using 4, 8 or more workers, the training time required to reach a 0 target reward lowers to 300 seconds (5 minutes) approximately. From these results, we conclude that through the use of robot_gym framework and by distributing the rollout acquisition to a number of workers, we are able to reduce the training time by half in a simple robotics motion task.

we are able to reduce the training time by half in a simple robotics motion task.

The second picture depicted above illustrates the mean reward obtained versus the number of iterations required. With this plot, we aim to visualize the sample efficiency of using a variable number of workers. As it can be seen, the more workers we use, the less sample efficient it becomes. This remains an open problem we have observed through a variety of different experimental tests comprising different simple robotics tasks.

We deployed the resulting global model denoted with θ into a) a simulated robot and b) a real robot. In both cases, the behavior displayed follows what was expected. The robots proceed to move their end effector towards the target point. Accuracy is calculated as the mean squared error between the target and the final end-effector when executing a trained RL network. The repeatability is calculated as mean square error between the mean from 10 experimental runs and the final end-effector position when executing a network.

Accuracy and repeatability (in mm) obtained in a 3DoF modular robot in a SCARA configuration when trained with the robot_gym framework for a simple robot motion task (reach a given point in the workspace).

Modular robotic articulated arm (6 DoF)

In this second experiment, we train in simulation a 6 DoF modular articulated arm, as shown in picture above. The objective remains similar: reach a given position in space. However, in this case, the robot includes additional degrees of freedom, making the search space much bigger; hence, the overall task more complicated, requiring additional training time.

We launched our experiment with 1, 2, 4, 8, and 16 workers, using 12 replicas. From our experiments, we noted that this second scenario is much more sensitive to hyperparameters than the previous robot. Fine tuning was required for each different combination of workers, in order to make it converge towards the goal.

left: ’time’ vs ’mean reward’ with different numbers of workers. right: ”iterations” vs ”mean reward” with different numbers of workers for the 6DoF robot.

Figure above displays the results of training the 6DoF robot with the robot_gym framework. For a single worker, the time required to train the model until the mean reward reaches zero is about 3000 seconds (50 minutes). Adding additional workers (4 and 8), reduces the time required to reach the target down to 2000 seconds (33.3 minutes), or reduces the training time by more than 33% compared to training only with a single worker.

Conclusions and future work

This paper introduced robot_gym, a framework to accelerate robot training using Gazebo and ROS in the cloud. Our work tackled the problem of how to accelerate the training time of robots using reinforcement learning techniques by distributing the load between several replicas in the cloud. We describe our method and display the impact of using different numbers of workers across replicas.

We demonstrated with two different robots how the training time can be reduced by more than 33% in the worst case, while maintaining similar levels of accuracy. To the question how much does it cost to train a robot in the cloud?, our cloud solution provider charged us a total of 214,80 € for two weeks of experimentation work. 12 replicas were run on demand in parallel (only active if needed). 1603 hours of cloud computing was used in total (0,134 €/hour per instance or 1,606 €/hour for all the replicas running at the same time).

The sample efficiency remains an open problem to tackle in future work.

References

[1] V. Mayoral, R. Kojcev, N. Etxezarreta, A. Hernandez, and I. Zamalloa. Towards self-adaptable robots: from programming to training machines. ArXiv e-prints, Feb. 2018.

[2] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen. Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection. ArXiv e-prints, Mar. 2016.

[3] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym, 2016.

[4] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE, 2012.

[5] N. Koenig and A. Howard. Design and use paradigms for gazebo, an open-source multi-robot simulator. In Intelligent Robots and Systems, 2004.(IROS 2004). Proceedings. 2004 IEEE/RSJ International Conference on, volume 3, pages 2149–2154. IEEE, 2004.

[6] M. Quigley, B. Gerkey, K. Conley, J. Faust, T. Foote, J. Leibs, E. Berger, R. Wheeler, and A. Ng. Ros: an open-source robot operating system. In Proc. of the IEEE Intl. Conf. on Robotics and Automation (ICRA) Workshop on Open Source Robotics, Kobe, Japan, May 2009.

[7] I. Zamora, N. Gonzalez Lopez, V. Mayoral Vilches, and A. Hernandez Cordero. Extending the OpenAI Gym for robotics: a toolkit for reinforcement learning using ROS and Gazebo. ArXiv e-prints, Aug. 2016.

[8] E. Liang, R. Liaw, P. Moritz, R. Nishihara, R. Fox, K. Goldberg, J. E. Gonzalez, M. I. Jordan, and I. Stoica. Ray RLlib: A Framework for Distributed Reinforcement Learning. ArXiv eprints, Dec. 2017.

[9] V. Mayoral, A. Hernandez, R. Kojcev, I. Muguruza, I. Zamalloa, A. Bilbao, and L. Usategi. ´ The shift in the robotics paradigm: The hardware robot operating system (h-ros); an infrastructure to create interoperable robot components. In 2017 NASA/ESA Conference on Adaptive Hardware and Systems (AHS), pages 229–236, July 2017. doi:10.1109/AHS.2017.8046383.

[10] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.