Published in

Nuro

5 min readJan 30, 2024

Successfully solving autonomous driving while maintaining high levels of safety is a challenging task. At Nuro, we have a very high bar for safety and have been developing systems and methods to ensure we reach that bar. One way to provide safety to our driving system is to rely on hand-tuned rules; however, in accordance with our AI-first approach, we have been investigating ways of using ML methods to enable learning safe driving behaviors at scale. One of the promising directions that we have been investing in is the application of Safe Reinforcement Learning (Safe RL) in different areas of our stack. In order to effectively leverage Reinforcement Learning to solve challenging problems like driving, we need to invest in developing powerful distributed training systems to enable learning at scale.

The reinforcement learning problem

Reinforcement learning is, at its core, a method for training an ML model through trial and error. In order to train an RL model, we have a “learner” and a method for generating experience. The experience that is generated is called “exploration” where we take random actions to get into new situations in order to learn which states are safe and which are not. The result of this exploration is then sent back to the learner which “reinforces” the good behavior and penalizes bad behavior.

This approach has proven to be highly effective in many tasks in game playing and robotics but has not yet been explored much in the AV industry. At Nuro, we have been investing in building highly performant RL training systems to enable advanced RL research as part of a scalable approach to an ML-first autonomy stack. In fact, we have built a system that enables us to train on years worth of driving experience in only a single hour.

The anatomy of an RL system

An efficient distributed training system is critical in order to enable training RL models in highly complex tasks. While a human can learn tasks like driving with relatively few hours of experience, ML models (and RL algorithms) require large amounts of data in order to learn effectively. As the complexity of the task we are trying to learn grows, the data requirement also increases. In order to gather enough data and learn from a sufficient amount of data, we need a sophisticated training system.

Simulation

One of the key components of any RL training system is a method to gather experience. In some industries like manipulation, we can collect experience directly in the real world. However, in safety critical applications like AV, it would be impossible to collect safety critical data in large quantities in the real world. Instead, we use simulation to gather experience.

Using simulation poses a trade-off between fidelity and efficiency. The higher the fidelity, the more likely the simulation would represent the performance of our model in the real world. However, high-fidelity simulation is often quite expensive and could take minutes to run, making it impractical for quick experimentation by researchers. Low fidelity simulation, on the other hand, is quick, but would introduce a “simulation to reality gap” (sim2real) which is a challenging problem to solve.

Our system supports both of these forms of simulation with the ability to swap freely between them. Integrating a high-fidelity production simulation into training was a key part of the system design and required a large engineering effort and many optimizations to make it effective. Some of the optimizations that were required include: map data sharding and caching, remote model inference, and efficient inter-module data sharing.

Taken together, this results in the ability to run hundreds of simulations in parallel on a single cloud instance, each of which can independently run different scenes, different models, and even distinct exploration strategies.

Communication and Synchronization

In developing a distributed training framework, we need to be able to efficiently communicate between different cloud instances to shuffle around data and enable the scaling of our system.

In order to accomplish this, we leverage an in-memory database for fast communication between different machines. Each simulation machine subscribes to model updates from the learner as well as requests to run different scenes; in turn, the simulation machine publishes the results of simulation to the learner and the process continues. With the publishing, subscription, and simulation all running in different processes, we can continually run simulations without blocking, allowing us to fully utilize the resources of the worker machine. Similarly, by offloading model weight publishing to another process, we can ensure that the learner is not blocked.

With these design decisions, we are able to fully saturate the capabilities of widely available GPU and TPU cloud compute resources.

Learner

In our system, the learner is mainly responsible for the training loop. We have abstracted away all of the communication and simulation so that users and researchers can simply plug in new agent architectures and algorithms in order to test published state-of-the-art methods as well as those being developed in-house. The ability to swap in new agents and models is very important as we research more and more applications of RL across the entirety of the AV stack and push the limits on what is possible with closed-loop training.

Scaling

With our approach to distributed RL training, we can scale out training both in terms of the number of GPUs (or TPUs) and in terms of the amount of simulations we run in parallel. Both of these scaling directions enable us to train research models on decades worth of driving experience in a matter of hours. The ability to massively scale the amount of driving experience through simulation is critical in achieving Nuro’s high bar for safety in the diversity of situations that our vehicles find themselves.

Conclusion

With this system, we have been able to successfully train and test models in simulation and on public roads showing that given a training system with enough scale, we can train a policy for driving safely in as little as a single work day. Above, you can see an example of one of our policies in simulation successfully navigating an unprotected intersection. For a more detailed look at some of what we have built on top of this system, check out what we have presented at CVPR ’23 and ML4AD ‘23.

Of course, we are just getting started. We would like to see how much further we can push the scaling of this system as well as pushing the boundaries of what is possible with state-of-the-art safe reinforcement learning. If any of these problems interest you, check out our open positions.

By: Jonathan Booher, Wei Liu, Zixun Zhang, Joyce Huang, Ethan Tang