Rethinking Maps for Self-Driving
Maps are a key component to building self-driving technology. Unlike regular web map services which are in wide use today for turn by turn navigation, the specialized needs of autonomous vehicles (AVs) require a new class of high definition (HD) maps. These HD maps need to represent the world at an unprecedented centimeter resolution, which is one to two orders of magnitude greater than the roughly meter level resolution that web map services offer today.
AVs demand such a high resolution because they need to routinely execute complex maneuvers such as nudging into a bike lane to take a turn and safely passing bicyclists. For example, marked bike lanes in the United States are typically 4 feet (1.2 meters) wide, but can be as narrow as 2 feet (0.6 meters) for unmarked lanes. The lane markings themselves are 4 inches (10 cm) wide. Centimeter level accurate maps are a must for an AV to be able to confidently reason about its position within a lane, assess distance from the curb and confidently take action.
In this blog post, I’ll share how we at Level 5 are thinking about HD maps for self-driving and how the various pieces of information in the HD map are produced and organized for use by the AV. This treatment is meant for a general audience who is interested in learning about HD maps and how they power self-driving. Future posts will go into deeper technical aspects of HD maps.
What is the map?
HD maps are not new. Several mapping and self-driving companies have already started to produce and consume HD maps. However, it is still early days in terms of how these maps are being built, the richness of information they contain, and how accurate they are. Companies are iterating quickly on making these HD maps better and as such there is little standardization between various providers and consumers.
At Level 5, the HD maps we are building are purpose driven. Our goal is to accelerate the development of self driving vehicles that serve ridesharing scenarios. The overall Level 5 mission is to become the first ridesharing network to operate self-driving vehicles at scale. Scale here implies hundreds of thousands of self-driving vehicles. Given this, we view the HD map as a specialized component in the autonomy stack.
What belongs in an HD map and what doesn’t is decided based on a set of principles. One can ask, why does this matter? If you build HD maps for one use case, can’t they be easily used for other purposes? The short answer is No, while a somewhat longer answer is Maybe. Let’s consider two examples. HD maps can be built either for use by AVs in a ride-sharing network or for use with augmented reality (AR) applications on mobile phones. The former would focus on road elements and high demand routes with low driver supply, while the latter would focus on street side, store fronts and interiors, public landmarks, etc. There is little overlap between the two use cases. Having a clear set of principles helps us build the right HD mapping technology. Tight integration between the map build technology and autonomy technology helps accelerate both by creating and leveraging shared components. For this reason, most HD maps built for autonomy use today are vertically integrated with the corresponding autonomy software stack.
HD Map Principles
There are four main principles that capture how we define the map and go about building it.
Mapping as pre-computation: From the point of view of self-driving technology, the mapping operation includes everything we can do to pre-compute things before the AV starts driving. In some cases, this pre-computation can result in completely solving sub-parts of the autonomy problem. For example, perception and localization of static objects in the world such as roads, intersections, street signs, etc. can be solved offline and in a highly accurate manner. Human operators can curate pre-computed data to ensure high quality. In other cases, the problem cannot be completely solved ahead of time, but one can pre-compute partial, approximate, or intermediate results that can make real-time autonomy work easier. We use the latter to build various map prior pieces that represent Bayesian prior probabilities about dynamic parts of the world, such as observed speed profiles and unmarked parking spaces. Pre-computed results include both spatial and temporal aspects of the world and are indexed for efficient retrieval.
Mapping to improve safety: The use of maps for navigation is well understood. An AV that aspires to drive safely not only needs to perform expert navigation, but also adopt pragmatic best practices that reduce risk during driving. So, Level 5 HD maps are designed not only to contain speed limit information for each lane segment, but also speed profiles derived from actual human drivers on the Lyft network that meet our high bar for safety.
Map as a unique sensor: At runtime, the map is viewed by the autonomy system as a sensor with special perception and prediction capabilities. When compared with other sensors such as cameras and lidar, the map has no range limitations. It can sense things way beyond the 100–200m range that is typical of today’s AV sensors. It is also immune to runtime occlusion from dynamic objects like other vehicles. Viewing the map as yet another sensor allows us to design efficient map access patterns and integrate map data more naturally into the autonomy stack (e.g. sensor-fusion components).
Map as global shared state: AVs can achieve higher safety and efficiency levels by working together in AV fleets. You can think of the pre-computed data described above as social memory of the whole AV fleet that is accessible to every AV in the fleet as part of the map. This memory is very large and changes slowly. In addition, at runtime, you can view the AV fleet as sensing the world together in a distributed manner from various angles and points of view. The map then becomes a shared data structure that lives both in the cloud and also docked in each of the AVs. AVs use the map to both read and write to this social memory. The latter is how we think of sharing real-time information between the AVs in the fleet.
We build HD maps using a fleet of cars containing state of the art sensors such as HD cameras, lidar, radar, GPS, and inertial measurement units (IMUs). Early on we chose to keep the sensor configuration and hardware build to be the same as that used by our self-driving cars. This constraint makes it easy for us to ensure that the HD maps we build work correctly with the subsequent autonomy software stack which supports various self-driving functions such as localization, perception, prediction, and planning. In our experience, sharing cars interchangeably between mapping and autonomy feature development gets us better utilization of our autonomy fleet while keeping fleet management overhead low. Similarly, having a single hardware SKU makes it simpler for us to manage hardware development. Our map build process also uses the same data collect logs as autonomy feature development. This enables us to re-use any and all miles collected by the car to build better maps. Map build data collection runs don’t need the self-driving car to be engaged in autonomy mode aka the car is driving by itself. However, we do keep many of the autonomy services running, especially localization and perception, in passive mode as their outputs help greatly in subsequent map build processing. For example, perception output can help with removal of dynamic objects that aren’t part of the HD map and localization match errors can quickly help us detect which parts of the map are stale.
A Layered Map
The information contained in an HD map is represented as layers. Organizing the information in layers makes it easy to independently design, build, test, and release new information. These layers are perfectly aligned with each other and indexed in a manner that allows for efficient parallel lookups of information both for the current location of the AV and also local neighborhood. We think of the basic road network data offered by web map services as being the bottom most layer. Each subsequent layer adds additional details to the map. At Lyft Level 5, our HD maps contain several layers. Four noteworthy HD layers are: the geometric map, the semantic map, map priors, and real-time knowledge.
Geometric Map Layer
The geometric map layer contains 3D information of the world. This information is organized in very high detail to support precise calculations. Raw sensor data from lidar, various cameras, GPS, and IMUs is processed using simultaneous localization and mapping (SLAM) algorithms to first build a 3D view of the region explored by the mapping data collect run. The outputs of the SLAM algorithm are an aligned dense 3D point cloud and a very precise trajectory that the mapping vehicle took. The vehicle trajectory is shown in pink. Each of the 3D points is colored using the colors observed for that 3D point in the corresponding camera images. The 3D point cloud is post-processed to produce derived map objects that are stored in the geometric map. Two important derived objects are the voxelized geometric maps and a ground map. The voxelized geometric map is produced by segmenting the point cloud into voxels that are as small as 5cm x 5cm x 5cm. During real-time operation, the geometric map is the most efficient way to access point cloud information. It offers a good trade-off between accuracy and speed. Segmentation algorithms identify 3D points in the point cloud for building a model of the ground, defined as the driveable surface part of the map. These ground points are used to build a parametric model of the ground in small sections. The ground map is key for aligning the subsequent layers of the map, such as the semantic map.
Semantic Map Layer
The semantic map layer builds on the geometric map layer by adding semantic objects. Semantic objects include various traffic 2D and 3D objects such as lane boundaries, intersections, crosswalks, parking spots, stop signs, traffic lights, etc. that are used for driving safely. These objects contain rich metadata associated with them such as speed limits and turn restrictions for lanes. While the 3D point cloud might contain all of the pixels and voxels that represent a traffic light, it is in the semantic map layer that a clean 3D object identifying the 3D location and bounding box for the traffic light and its various components are stored. We use a combination of heuristics, computer vision, and point classification algorithms to generate hypotheses for these semantic objects and their metadata. The output of these algorithms isn’t accurate enough for us to produce a high fidelity map. Human operators post-process these hypotheses via rich visualization and annotation tools to both validate the quality and fix any misses. For example, to identify traffic lights, we first run a traffic light detector on the camera images. Visual SLAM is used to process multiple camera images to get a coarse location of the traffic light in 3D. Lidar points in the local neighborhood of this location are matched and processed to produce the bounding box and orientation of the traffic light and its sub-components. We also employ heuristics for solving simpler problems. One area where we’ve found heuristics to be useful is in the generation of lane hypotheses, yield relationships, and connectivity graphs at intersections. There is a lot of structure in how these are setup for roads, especially since there are local laws that ensure consistency. Feedback from the human curation and quality assurance steps is used to keep these up to date.
The geometric and semantic map layers provide information about the static and physical parts of the world that are important to the self-driving vehicle. They are built at a very high fidelity and there is very little ambiguity about what the ground truth is. At Level 5, we view the map as a component that not only captures our understanding of the physical and static parts of the world, but also dynamic and behavioral aspects of the environment. The map priors layer and real-time knowledge layer represent this information. Information in these layers is computed not only from logs from the AV fleet, but also from the Lyft ridesharing network comprising millions of Lyft drivers. This scale is necessary to achieve high coverage of the map priors and ensure freshness of the real-time information.
Map priors layer
The map priors layer contains derived information about dynamic elements and also human driving behavior. Information here can pertain to both semantic and geometric parts of the map. For example, derived information such as the order in which traffic lights at an intersection cycle through their various states e.g. (red, protected-left, green, yellow, red) or (red, green, protected-left, yellow, red) and the amount of time spent in each state are encoded in the map priors layer. Time and day of week dimensions are used as keys to support multiple settings. These priors are approximate and serve as hints to the onboard autonomy systems. Another example is parking priors in the map. These parking priors are used by the prediction and planning systems to determine object velocities and make appropriate decisions. Parking priors are represented as polygonal regions on the lanes with metadata that capture the probability of encountering a parked vehicle at that location in the lane. When the AV encounters a stationary vehicle in a map region with high parking prior, then it will more aggressively explore plans that route the AV around the vehicle and demote plans that queue up the AV behind the vehicle. Similarly, knowing where people normally park allows perception systems to be more cautious to car doors opening and detected pedestrians as they might be getting in and out of cars. Unlike information in the geometric and semantic layers of the map, the information in the map priors layer is designed to be approximate and act as hints. Autonomy algorithms commonly consume these priors in models as inputs or features and combined with other real-time information.
Real-time knowledge layer
The real-time layer is the top most layer in the map and is designed to be read/write capable. This is the only layer in the map designed to be updated while the map is in use by the AV serving a ride. It contains real-time traffic information such as observed speeds, congestion, newly discovered construction zones, etc. The real-time layer is designed to support gathering and sharing of real-time global information between a whole fleet of AVs.
Each of the above map layers is built independently. Derived layers may rely on intermediate outputs from previous layers. For example, the semantic layer uses the ground map generated by the geometric layer to identify z-positions of the lane polygons. In the final step, alignment algorithms are used to stitch together all layers of the map before it is released as one consistent component to the self-driving vehicle.
At Level 5, we are excited about building HD maps that make self-driving cars a reality. If these HD maps topics interest you and you are passionate about building maps and autonomy components for self-driving cars, check out our careers page. We’re hiring!