Since its establishment, Nuro’s perception system has extensively utilized artificial intelligence (AI). Initially, we focused on deploying multiple small models, each dedicated to a specific task and using a sensor set or a combination of sensor sets. However, it became apparent that the computation of extracting features from identical sensors could be shared, and the rich feature learned by multiple downstream tasks can potentially help to achieve the performance ceiling of each individual task, especially when some tasks have limited labels.
Overview of the unified perception model. The unified perception system takes all sensors in and serves a wide variety of tasks.
To address these issues, we developed a unified perception model. As shown in the above figure, the unified perception model takes synchronized inputs from almost 30 individual sensors, including multiple long-range and short-range cameras, lidars, and radars, to simultaneously address a multitude of perception and mapping tasks, such as detection & tracking, localization & online mapping, occupancy and flow, etc. Such integration not only elevates the system’s performance but also reduces the requirement for re-computations of features across separate models, conserving on-board computational resources and moving away from reliance on manually crafted features. This model has led to a systematic improvement in the efficiency and effectiveness of Nuro’s autonomy system.
Design Philosophy
In the development of the unified perception model, our design philosophy is articulated through the following foundational principles:
- Cross-platform Compatibility: Given the diversity in vehicle platforms and generations, alongside the goal to fully utilize large-scale data accumulated over the years, it is crucial to train models across a broad spectrum of vehicle platforms with distinct sensor configurations. This approach enables the training of a single foundational model that can be exported and deployed across various vehicle platforms with efficient post-training.
- Sensor Robustness: Sensor malfunctions can occur during vehicle operations. To ensure vehicle resilience, the model training includes sensor dropout, whereby the model’s dependency on any single sensor is minimized. This strategy enhances the model’s robustness, enabling it to maintain performance even when multiple sensors fail.
- Scalability: In the era of large models, the model capacity and data volume are critical for emergent capabilities of handling prevalent and long-tailed scenarios. Our design advocates an unified representation for each sensor modality that can scale with the model. In addition, large-scale pre-training and end-to-end training greatly enhance our system performance with all kinds of human and synthetic labels.
- Multi-task capabilities: Our system supports a comprehensive range of perception tasks, in addition to localization and mapping tasks, utilizing shared feature extractors; this also open up the opportunity for joint perception and behavior modeling. In order to do this efficiently, we utilize our custom training framework to concurrently train multiple tasks in a single model graph to ensure rich feature representation for numerous downstream tasks.
These principles guide the design of a robust, adaptable, and efficient unified perception model that is capable of functioning effectively across varied operational scenarios and vehicle platforms.
Architecture
To elucidate the architectural framework, the following figure presents a simplified depiction of the unified perception model. The unified perception model integrates inputs from all available sensors and processes sensor features independently. Subsequently, a sensor fusion module transforms these features from native formats into an unified voxel feature space and generates multi-modal spatial features. The temporal module aligns spatial features from T to T-n and further fuse these features into spatial-temporal features. For certain tasks, the module performs stateful temporal modeling to enhance temporal consistency, conditioning on spatial-temporal features at T and task features/queries from T-1 to T-n. At this stage, the features are equipped to support a diverse array of tasks.
Simplified illustration of the model architecture.
The independent multimodal sensor encoders and unified voxel representations are the key factors to achieve the design objectives. Primarily, the sensor suite and compute power vary across different vehicle platforms. By configuring the sensor encoders while leveraging a well-trained model, we can seamlessly transfer our system to new platforms. In addition, edge latency and robustness are also enhanced. Raw sensor processing is usually distributed among multiple chips. The independent encoders can parallelize inference without waiting for data copy, and tolerate sensor failures on different nodes when we add augmentation in training time. Lastly, pretraining the image encoder on large-scale image datasets has been shown to improve the performance of all downstream tasks, thereby facilitating the seamless integration of other foundation models.
Through years of development, we observed that a cascaded model pipeline with decoupled tasks leads to diminishing marginal returns and an overreliance on heuristics, hindering system performance and team progress. Consequently, we opted to collocate all perception tasks and train an end-to-end model, which shares sensor capabilities and a dedicated training/inference infrastructure.
In perception and behavior, temporal reasoning over sequential states is necessary. To accurately estimate the world state, we leverage both spatial-temporal features from sensors and task-specific features/queries from downstream tasks. This design not only enables a variety of temporal modeling tasks but also consistently outperforms our already world-class baseline.
Training and Inference
To train such a large model with cross platform data and multi-tasks, it is generally essential to curate a joint dataset that labels various tasks, employ model parallelisms using a large amount of GPUs, and manage divergent sensor formats from different platforms. To work around such constraints, a novel joint training framework, called split batch joint training, has been developed internally. With this novel training framework, we are able to train a single model with multiple independent datasets and multiple model graphs consisting of subsets of the full model. Analogously this framework can be thought of as coordinate ascent on the space of subtasks and removes constraints on dataset compatibility and training memory. This custom training framework allows us to concurrently train across sensor platforms and/or tasks with a single model in a single training stage. This allows us to smoothly train models on data encompassing tens of millions of frames across three vehicle platforms to extract maximum performance across multiple tasks.
Successfully deploying a model with these capabilities on real vehicles poses significant constraints and challenges. As a result, in addition to the standard practices of custom kernel fusions and low precision inference, our Perception and ML-Infra teams have closely collaborated to develop novel solutions with the ML compiler and runtime to support running such a multi-sensor multi-task model efficiently on limited onboard compute. For an in-depth understanding of how the model is deployed using multiple GPUs and how task heads are executed at varying priorities and frequencies, it is recommended to read our previous blog post FTL Model Compiler Framework.
Future
Following the development of the unified perception model, our R&D focus has primarily shifted towards two distinct yet interconnected domains aimed at advancing autonomy. Firstly, we are endeavoring to augment the unified perception model by integrating open vocabulary capabilities utilizing feature extractors derived from multimodal language models (or foundation models). Secondly, our team is dedicated to achieving an end-to-end learnable autonomy system that can meet an L4 performance and safety bar, a testament to our comprehensive commitment to artificial intelligence. We invite you to look forward to future updates in these areas.
By: Zhenbang Wang, Shuvam Chakraborty, Qianhao Zhang, Zhuwen Li
The authors would like to acknowledge the contributions from Adi Ganesh, Akshat Agarwal, Andrea Allais, Aneesh Gupta, Charles Zhao, Chengyao Li, Dmytro Bobkov, Frank Julca Aguilar, Greg Long, Himani Arora, Jia Pu, Mazen Abdelfattah, Ning Xu, Prath Kini, Sam Bateman, Shihong Fang, Su Pang, Tiffany Huang, Viktor Liviniuk, Vince Gong, Xinjie Fan, Xuran Zhao, Yassen Dobrev, and other Nuro team members not mentioned here who provided help and support.
If you believe in our mission and want to help build the future of autonomous driving, join us! https://www.nuro.ai/careers