At Nuro, we have been working on developing scalable maps for many years now, and many of these tools have been used to enable multi-city driverless deployments. As the scale of our deployments and operating domain grows over time, limitations that were once rare become more frequent, and methods that work at a smaller scale are replaced with more general approaches that are more flexible. A perfect example of this is HD mapping and the challenges in growing and maintaining an HD map over time. An HD map is a detailed representation of physical and semantic features in an environment. For autonomous vehicles, this includes curbs, lane lines, stop signs, traffic signals, and more. Basically, this encompasses everything relevant to consistently understanding and obeying the traffic rules and driving safely in an intersection or on a road, and how they may differ from road to road. A lot of academic and industrial interest has been focused on the development of online HD map systems in the last few years to supplant the need for universal labeling, change detection, and map maintenance.
In this blog post, we will provide a brief introduction to some of the ideas being presented in this space and highlight some of our recent work we are sharing at CVPR 2024’s Workshop on Autonomous Driving (WAD). We aim to inspire others by giving them a peek at the interesting and challenging problems we work on every day here at Nuro. So let’s jump right into it!
What is HD Mapping and why is it hard?
The value that maps and geospatial information provide to autonomous vehicle stacks is manifold. Some of this value is immediately obvious: if a robot doesn’t understand the composition of the scene around it, such as lane lines, curbs, traffic signals, and more, it will be very difficult to propose a safe motion plan that satisfies all traffic rules. Others are a bit less direct: an AV can estimate its position with respect to a global map, a process called localization, and then follow a given route in that map. These are just two examples, but there are many other uses of maps, and thus different types of maps that AVs might leverage to enable robust, safe driverless deployments.
One particularly important type of map is a High-Definition (HD) map, which attempts to address that first problem: knowledge and comprehension of lane lines, curbs, traffic signals, etc. There are many ways to encode this knowledge, but most commonly, it’s encoded either as some mixture of occupancy grid (a spatial grid which determines traits for what falls within a given coordinate, e.g., the drivable region), polylines (a set of connected line segments which forms a closed or incomplete shape, e.g., a curb, crosswalk, or lane line), and bounding box annotations (3D positions, orientations, and sizes which represent a physical object, e.g., traffic signal). When AV systems were first conceptualized, it was hard to imagine that a perception system would be capable of detecting all these features and attributes required to perform fully driverless deployments, let alone do so safely.
An example HD Map generated from collected data, merged together into a geometric map, and then labeled with HD map representing the semantic knowledge of the scene.
To get around this limitation, many AV companies built detailed, centimeter-scale semantic maps of these features. There was concern that real world changes would occur too frequently for this to be a reasonable strategy, but eventually experience showed that, other than near construction sites, many semantic features in a map were stable for months or years at a time, and changes were relatively isolated when they did happen. Companies following this approach realized they could just detect map changes on the road and repair the map they had with human labelers later, letting them utilize an HD map for the long-term.
This image shows pictures of an example intersection (top) which underwent map change due to construction. Below is the corresponding top down polyline representation of the lane markings, curbs, driveways, and lane centers (bottom).
However, the geographical scalability and complexity of building and maintaining an HD map are significant, and for areas without high traffic, it’s possible that any business built on top of these HD maps may never provide a return on investment. On top of that, building HD maps can be a slow process, significantly slowing down the expansion speed of driverless systems to new areas and domains. Over the past few years, lots of progress has been made in online perception of occupancy, object detection, and semantic segmentation. But predicting polylines has remained a particularly sticky prediction problem due to their high accuracy requirements and complicated interconnectivity, and is often what one is referring to when they refer to the HD mapping problem.
How might one solve these problems?
The simplest solution is to just accept the cost of HD mapping and transfer the challenge of scene understanding partially to human labelers. But an approach like this creates a problematic bootstrapping problem: One needs to build and maintain massive HD maps for all deployment areas, but it requires a significant upfront cost operationally, it might significantly slow down deployment rollout, and it might limit the deployment of driverless vehicles to densely populated locales which are capable of and willing to pay higher prices for any driverless vehicle-based service.
High level architecture for traditional HD maps. Labels are labeled by hand and passed directly onboard. During deployment, change detection systems detect discrepancies with the offboard map to ensure safe operation.
The other side of the solution spectrum is to just attempt to learn an online ML perception model that predicts all the components of an HD map. In the past few years, some interesting work in academia has made this possibility more compelling and feasible (e.g. MapTR, VectorMapNet, etc.). Such a system would require less data collection for labels to deploy in new regions compared to the full map-building strategy of an HD map, and likely would be cheaper to deploy as a result. These systems typically propose fusing measurements from a number of sensors into an encoded 2D grid around the robot, which is called a Birds Eye View (BEV) representation of the sensor data. Fittingly, the model that incorporates these sensors into the BEV representation is dubbed the BEV encoder. However, these systems still have significant limitations in producing outputs with comparable accuracy to an HD map due to limitations in sensor range and field of view compared to the always complete scene understanding of an HD map, which is highly desirable when generating safe motion plans. Both of these traits are desirable and likely necessary to reduce risks of adverse events sufficiently to enable large-scale, driverless deployments.
High level architecture for an online only HD map prediction model. Here, a model is trained to predict polyline features ahead of time by fusing sensor information in the Birds Eye View (BEV) encoder, and then decoded into the map ahead of time. At runtime, the downstream autonomy system uses these predictions directly to understand the environment using only sensor data.
Building off this, recently some academic work (Mind the Map, Neural Map Prior, etc.) has proposed something in between: training a model that consumes both out-of-date offline semantic map features, and online sensor measurements. This could be the best of both worlds: A method that learns to pass through an accurate offline HD map prior when it is correct, but is robust to changes in the map and low quality labeling, requiring much less frequent map maintenance and reducing the accuracy requirements on offline HD map labels. This should provide more accurate predictions when online sensor measurements would otherwise be unable to resolve semantic map features due to occlusion or sensor resolution, but provide real time accurate and robust predictions nearer to the AV, learning to trade-off between these two in training to maximize map accuracy and provide the most accurate representation of the world for generating motion plans for the robot.
High level architecture for a hybrid HD map prediction model, which learns to fuse information from an offboard HD map prior and onboard sensors to predict the final polylines.
Deeper Dive: How well does a Hybrid Approach work in practice?
Although the hybrid ML HD map approach is very promising, it has a crucial caveat for training: real world discrepancies between offline maps and real world data (i.e., map change events) are quite rare in the real world, and they vary greatly in the scope and size of changes. One solution to this problem adopted in a variety of academic work on map change detection is to generate synthetic map change events, and then learn to fix synthetic map change events, with the hope that the mapping model will generalize to real world events.
This approach shows great promise in the academic literature, but as an AV company, we are in a unique position where we have a large historical backlog of out-of-date semantic HD map, as well as up-to-date semantic HD map. This means that we can try a similar approach trained on synthetic map prior changes and test it against a large set of real world map changes.
These are some examples of synthetic HD map prior changes we evaluated in our recent publication. Some vary from minor changes to major changes to the positions or semantic meaning of the polylines in the region.
That’s exactly what we did in our recent publication at the CVPR 2024 Workshop on Autonomous Driving. We found that, as intuition would suggest, providing a map prior does improve the performance of a map prediction model. In scenes with minor map change events, like small changes or label errors of curbs, the model has little trouble integrating the prior and sensors together to match or exceed the accuracy of the map prior alone, adapting to discrepancies in the prior. But we also found that current methods of synthetic perturbation, and even some new ones, don’t provide a strong enough signal to the model during training to handle major map change events, for example, a rebuilt intersection, or a new median. In these major map change events, the model struggles to reject the prior map given sensor measurements, or simply gets confused. It’s likely because the prior, even after being corrupted by various synthetic noises, can be so reliable most of the time that even high quality sensor data and direct observation could be a noisier signal than the incomplete, noisy prior. We are actively working on addressing these limitations, and this work uncovers a lot of future impactful research opportunities.
Conclusions:
Ultimately, solving complex technical problems like this one is crucial to deploying safe, large scale driverless deployments. The unique challenges and data that we work with regularly, as well as the incredible people with whom we collaborate with, allow us to solve many interesting technical challenges and enable the safe deployment of driverless vehicles on the road. If you are interested in working with us on these kinds of problems, we are hiring!
Also, if you are interested in learning more about our recent work we will be sharing at CVPR 2024, feel free to check it out here and come and say hi!
By: Samuel Bateman, Ning Xu, Charles Zhao, Yael Ben Shalom, Vince Gong, Greg Long, Will Maddern