Towards Next Level of Autonomous Driving via World Models
How close are we to achieving real autonomous driving? As defined by the SAE International Standard, the Level 5 autonomy of the vehicle is no longer subject to conditions. However, current driving autonomy depends heavily on the surrounding environment during its development. Concretely, existing AI-based autonomous driving systems are instantiated on the pre-recorded driving logs and subsequently tested on controlled environments before real-world deployments, such as blocked areas under human monitoring and simulators with graphics engines. The fixed driving logs in training prevent the policy from continuously improving itself via interactive exploration, and the restricted environments in evaluation fall short of representing the diversity and complexity of the unstructured world. Therefore, these strategies hinder the system from achieving trustworthy autonomous driving in the wild.
A promising way to address these issues is to establish a foundational driving world model that mimics the real-world driving experience covering diverse geographic locations and agent behaviors. In principle, such a neural network simulates the driving world by faithfully imagining the future outcomes of the perceived scenarios as a consequence of different actions. The capability of reasoning over future evolution can greatly contribute to constructing, verifying, and improving real-world driving policies.
However, distilling the dynamics of the driving world into a neural network is challenging. The scarcity of driving data and the inefficiency of model designs prevent previous methods from estimating and controlling the future in open scenarios reasonably. We find two key ingredients to unfold such a paradigm, i.e., a large-scale driving video dataset and a competent video diffusion model, presented in their latest line of research: GenAD and Vista. The resulting driving world model can imagine rational future outcomes in open scenarios, and provide rewards for taking different actions. It paves the way for developing and testing planning methods in a purely imagined driving world via neural networks.
Driving Video Dataset with the Largest Scale and Coverage
As proved by the experience in other domains, such as foundation models in language and vision-language areas, a substantial and diverse corpus of data is necessary for realizing generalization. However, existing driving datasets fail to meet the need as they are limited in scale, geographic coverage, and scenario diversity due to their regulated collection processes.
We opted for abundant driving videos from the web as the data source and carefully constructed a large-scale driving dataset, resulting in the OpenDV dataset. This largest public driving dataset to date contains more than 2000 hours of driving videos, which is 374 times larger than the widely used counterpart. The benefits of such an approach are two-fold: (a) the driving videos can be easily scaled by crawling from the web without expensive real-world collection; and (b) online data at scale naturally captures the distribution of the driving world in terms of geographic locations, terrains, weather conditions, safety-critical scenarios, sensor settings, traffic elements, etc. The collected driving recordings on YouTube are pre-processed via rigorous human verification. To fit in multi-modal training, we further pair each video clip with diverse text-level conditions, including generated descriptions by LLMs and VLMs, and high-level driving instructions inferred by a video classifier.
Driving World Model Excelling at Fidelity, Generalization, and Controllability
The challenges of establishing a model that can soak up a large scale of data comprise training efficiency, dynamics modelling, and action controllability. To enhance training efficiency, a latent video diffusion model, which is pre-trained on the general video corpus, is used for initialization. After being fine-tuned on the OpenDV dataset, the general video pre-training could effectively transfer open-world knowledge to the driving scenarios. Another crucial ingredient is modelling the intricate dynamics of the evolving scene. Novel losses are developed to focus on learning moving objects and ignore the static background, while preserving high-frequency visual details. As for action controllability, different types of actions at different granularity are incorporated in training, such as speed and trajectory, for flexibility during inference. Since the aforementioned action labels are not available in OpenDV, we additionally introduce the annotated nuScenes dataset in training. Through this data composition, the model effectively learns the scenario generalization and action controllability from OpenDV and nuScenes samples, respectively.
The resulting driving world model, namely Vista, exhibits a wide capability spectrum including: (a) predicting high-fidelity futures (576×1024 pixels in 10Hz) in various scenarios; (b) extending its predictions to continuous and long horizons up to 15s; (c) controlling the agent with multi-modal actions; and (d) providing rewards for different actions without accessing ground truth states.
The Foreseeable Future of Autonomous Driving
Previously, some efforts were made to develop world models for autonomous driving, but these attempts are either conducted in a small-scale dataset or implemented in rendered simulators with unrealistic visual artifacts. We took a significant step in this field by establishing a large-scale driving world model that can simulate realistic futures in open-world driving domains. Nevertheless, the generated controllable futures are not real enough and the model performance in downstream tasks is not satisfactory enough for real-world deployment. Our investigation laid the foundation to robustly verify and continuously improve real-world driving policies within its internally imagined world, while waiving the need for costly deployment and testing in the physical world.