Published in

Nuro

8 min readJul 12, 2024

In the realm of Autonomous Driving (AD), two major approaches have been applied for decades: Imitation Learning (IL) and Reinforcement Learning (RL). Both have their pros and cons: On one hand, IL is an effective algorithm using supervised learning which scales with data but is prone to accumulated error in “out-of-distribution” events. On the other hand, RL is designed to be adaptable in unseen scenarios but can be very challenging when it comes to obtaining human-like behavior (i.e., it is hard to define a reward function).

At Nuro, we propose to use the best of these two worlds — Imitation and Reinforcement Learning. In this blog post, we will provide a brief introduction to the technical details of how this is achievable in the area of AD and will share some results justifying our method. More details were shared during the CVPR-2024 Workshop on Data-Driven Autonomous Driving Simulation and can be found in our recent arXiv preprint.

Two Methods for Trajectory Generation

Over 35 years ago, a supervised learning-based approach called ALVINN (Autonomous Land Vehicle In a Neural Network [1]) was the first successful application of Neural Networks in the driving domain. The approach used is called Behavior Cloning (BC): ALVINN transforms input information — 2D and 2.5D visual information — into the future position on the road by cloning the behavior seen in the data collected in advance.

IL is a general class of algorithms that enable learning directly from demonstrations. One of the most common methods is BC which learns from and scales with expert human driving (if you’re interested in the theoretical and practical aspects of scaling, please take a look at our recent blog post ML Scaling Laws in Autonomous Driving). Unfortunately, it is not possible to stay “in distribution” always (e.g., “long-tail” events), and when facing “out-of-distribution” scenarios, the prediction and planning error can quickly accumulate. Another limiting factor is the reliance on expert level driving which requires large amounts of quality control, data sampling, or sample weighting in order to learn effectively.

Another possible approach is pure RL-based generation. One previous attempt at RL for generation (Learning to Drive in a Day [2]) leveraged an online, off-policy algorithm called DDPG [3]. The RL formulation requires an action space (how to control the vehicle) as well as a reward function (to inform the model of proper behavior). The authors used a two-dimensional action space — steering angle in the range [-1, 1] and speed setpoint in km/h. They used the average distance traveled before an infraction of traffic rules as a reward function.

RL by its nature is adaptable to unseen scenarios due to its ability to explore beyond the confines of expert data through simulated rollouts. At the same time, there is no fundamental method to define an optimal reward function (every method uses its own approach). The reward function definition is especially challenging when trying to define human-like behavior. One additional challenge to the application of RL is the reliance on reliable and scalable infrastructure (if you’re curious how our training system is constructed, consider reading our previous blog post Enabling Reinforcement Learning at Scale).

CIMRL: Combining Imitation and Reinforcement Learning

Currently in the AD domain, we have:

Very good imitation-based models (for both Prediction and Planning). These models could be based on LLMs for high-level reasoning.
Non-imitation methods that could be used for recall injection (e.g., heuristic plans or geometric rollouts)
RL policies that are not widely used due to challenges with reward design and scalable infrastructure

Our idea is the following:

Re-use existing imitation-based models
Use an RL algorithm to select (!) between multiple generators

This is what we call CIMRL: Combining Imitation and Reinforcement Learning. CIMRL adopts multiple plan proposals (can be even from different trajectory generators), evaluates them in a closed-loop manner for a longer horizon, and selects the one that is safe with the maximal score accounting for near term effects as well as long term delayed rewards.

In the figure, the upper plan is too conservative, while the lower plan is too aggressive (and eventually leads to a future collision), and they both have low scores; the safe and non-conservative plan is chosen as having the highest score.

The core of our Safe RL approach is the method called Recovery RL [4]. The method proposes to train two separate trajectory evaluation functions (Q-functions): one for rewards like preference or progress (Qtask) and another to estimate the potential risk associated with the trajectories (Qrisk). Similarly, two policies are trained in order to maximize each of the Q-functions. After selecting an action according to the trained task policy, the value of Qrisk is compared with some predefined threshold. If there exists an action that is below the threshold, the task policy is executed, otherwise the risk policy is executed (called the failover or recovery policy).

A schematic diagram of the Recovery RL method during inference.

Defining risk itself is a non-trivial exercise. Usually, there are multiple sources of risk: collision with other road agents, collision with road boundaries, different kinematic constraints, etc. To handle different sources of risk, we use a vectorized Qrisk value. If there exist safe actions, then we select actions from the re-normalized task policy where we remove the actions violating at least one Qrisk. Otherwise, we sample from the recovery policy.

Top: safe case, where a4 and a5 are unsafe actions, so we re-normalize for a1, a2, and a3 and sample from this truncated re-normalized task policy. Bottom: unsafe case, where every action violates at least one Qrisk, so we sample from the recovery policy.

Integration with Closed-loop Simulator

In order to train an RL policy and see the effect of long-horizon reasoning, we need to use a closed-loop simulation. We demonstrate results on the open-source Waymax [5] simulator because of multiple reasons:

It can be used for training
It has TPU / GPU support
It uses WOMD, which is a popular dataset in the AD space

We compared one of the top publicly available solutions for trajectory prediction — MTR [6] — with our approach that uses MTR as the trajectory generator for a fair comparison. For both proximity to the logged expert data and collision rate, CIMRL improves by a wide margin over pure MTR. The off-road rate is slightly worse; however, we notice that architecturally, the MTR model (being used with a fixed set of goal points) is highly sensitive to accumulated yaw changes over the course of the scene. We hypothesize that using a different trajectory generator that is not reliant on a fixed set of goals would result in a better comparison.

Moreover, we compared CIMRL on top of our own data and our own in-house closed-loop simulator with the goal of understanding the applicability of using trajectory generators from multiple sources. We found that the CIMRL approach improves the collision rate over pure imitation and that by adding more trajectory sources, the progress rate becomes significantly better!

Top: Comparison of CIMRL with MTR (two versions: using only top probability trajectory or just sample according to the probability distribution provided by MTR). Bottom: CIMRL is better than the BC-based approach for both Collision and Progress rates and becomes even better for the Progress rate when adding a new trajectory source — heuristic-based plans.

Onroad Qualitative Examples

Not only did we train and test this approach on Nuro’s internal simulation stack, but we have also tested this approach on the road. Here we present some examples from a recent onroad drive where the behavior is entirely controlled by CIMRL.

CIMRL learned to provide correct driving w/o stuck in the middle of a T-shaped intersection when there is no space to move further (longer video).

CIMRL learned to provide safe driving w/o being too conservative and doing gradual “creeping” motion before crossing the intersection where cross traffic doesn’t stop (longer video).

Conclusion

In this blog post, we’ve described Nuro’s novel approach for Safe Autonomous Driving — CIMRL, which is based on combining Imitation and Reinforcement Learning techniques. CIMR is a scalable (IL benefits from data scaling and RL benefits from increased distributed training) and flexible framework for combining approaches for trajectory generation for overall safe and performant driving. Its success is rooted in learning motion selection, which provides long-horizon reasoning crucial for safe driving while delegating trajectory generation to IL-based models for smooth and expert level driving. More information was shared in our presentation during the CVPR-2024 Workshop on Data-Driven Autonomous Driving Simulation and can be found in the relevant arXiv preprint.

If you’d like to push the boundary of safety and performance for Autonomous Driving, join us!

By: Jonathan Booher, Khashayar Rohanimanesh, Junhong Xu, Aleksandr Petiushko

References

[1] Pomerleau, Dean A. “Alvinn: An autonomous land vehicle in a neural network.” Advances in neural information processing systems 1 (1988).

[2] Kendall, Alex, et al. “Learning to drive in a day.” 2019 international conference on robotics and automation (ICRA). IEEE, 2019.

[3] Lillicrap, Timothy P., et al. “Continuous control with deep reinforcement learning.” arXiv preprint arXiv:1509.02971 (2015).

[4] Thananjeyan, Brijen, et al. “Recovery rl: Safe reinforcement learning with learned recovery zones.” IEEE Robotics and Automation Letters 6.3 (2021): 4915–4922.

[5] Gulino, Cole, et al. “Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research.” Advances in Neural Information Processing Systems 36 (2024).

[6] Shi, Shaoshuai, et al. “Motion transformer with global intention localization and local movement refinement.” Advances in Neural Information Processing Systems 35 (2022): 6531–6543.