A Visual Guide to the Software Architecture of Autonomous Vehicles

8 min readSep 25, 2022

This article provides a high-level overview of common AV software architecture design in late-2022.

While there are plenty of variations in architecture design between AV projects on low-mid levels, a predominant high-level architecture is mostly in-place in 2022. The most significant split in high-level architecture design is centered around the usage, or purposeful avoidance, of LiDAR and HD maps. The two sides of this division are casually tagged by some engineers as Tesla-like architecture (No LiDAR/HD maps) and Waymo-like architecture. This article focuses on the Waymo-like architecture, as it is well-known to be the more frequently deployed approach in 2022.

Waymo-like software architecture can be abstracted into the following technical areas: mapping, sensing, perception, localization, planning, prediction, control, actuation, and simulation. The relationships between each area can be viewed in the above chart. Note that in more detailed studies, distinct boundaries/connections between each area may become less well-defined. The remainder of this article will break down each module in more detail.

Mapping

Maps are provided as priors to an autonomous trip and contain information about permanent/semi-permanent previously observed structures, road networks, traffic laws, and more. The data is commonly provided in 2D, 3D, and graph formats. Examples of common attributes include positions of pedestrian crossings, traffic lights, road signs, barriers, etc.

The information provided via maps is utilized primarily during:

Localization: where sensed/perceived info is referenced to map info. Notably, sensed/perceived data during an AV trip is typically also used to update or influence the state of the mapping database.
Planning: where maps are used to assist generation of global routes and influence behavioral/local plans.

External Sensation/Perception

A Waymo-like sensor stack includes three primary externally-focused sensor types: Camera, LiDAR, and RaDAR — Each of which have their advantages and disadvantages. Cameras produce high resolution 2D images, providing positional and color information. LiDAR also produces positional information, as well as precise distance measurements. In addition to positional and distance measurements, RaDAR additionally provides speed measurements and can observe linearly obstructed areas, albeit at a significantly lower resolution.

Processing can be done on each sensor type’s data individually, or jointly after sensor fusion. Most commonly, sensor fusion occurs in multiple stages with processing occurring both independently and jointly.

Processing techniques can be divided into detection of objects/semantics and monitoring/tracking of those detections. Detection is implemented by declaring the boundaries of an object/semantic (classification), or by labeling each point/pixel/voxel independently (segmentation).

Monitoring implementations typically utilize trajectory prediction models, where kinematic motion is propagated over time stamps — or learning based methods, where detection association between time stamps is learned. Monitoring overlaps with the localization module, which will be covered next.

Landmark-based Localization

In addition to the mapping and perception data, localization also receives input from three internally-focused sensor types:

Odometer: Distance traveled
IMU: Linear and angular acceleration
GNSS: Geo-spatial positioning

Odometry and IMU data provide indication about how the ego-vehicle is moving in absolute space — this information is especially useful to the tracking effort because it provides a baseline which indicates how external detection coordinates should transform relative to the ego-vehicle’s movements. Within the localization module, tracking is sometimes referred to as local association.

The identified objects (landmarks) from local association are then passed to map matching. During map matching, detected landmarks are projected and matched to landmarks detailed by the input map. Map matching helps to reinforce confidence in perceptions, as well as to orient the ego-vehicle spatially.

Geo-spatial positions provided by GNSS are used as an initialization to reduced the search-space of the map matching process. A sliding window graph approach is also typically used to reduce the computational demand of the matching process.

Planning

Each module discussed previously has been focused on the objective of generating a perceived environment and the ego-vehicle’s state within it. The planning module now begins the process of determining and performing actions in response to the supplied environment-state. Specifically, the planning module’s role is to output a high-resolution trajectory, which the ego-vehicle should then attempt to follow.

The planning module can first be broken into 3 primary hierarchical sub-components:

1) Global Planning: Generating low resolution waypoints given a start, end, and mapping data.

The global planning process, given a graph representation of road networks, can be formulated as a graph-based path-finding search algorithm — Variations of A*-based algorithms are popular.

While running thorough search on large road networks can be computationally costly, global planning is also typically a low frequency process — only being invoked during initialization/re-routing.

2) Behavioral Planning: Determining maneuver specification

The behavioral planning module is a highly complex task of abstractly determining how the ego-vehicle should act given the global route and scene representation. For example, behavioral planning determines when the ego vehicle should lane change, proceed through an intersection, make a turn, change speed, etc.

Behavioral planning deals with the challenges many of us would designate as the most cognitively demanding tasks of driving. The addition of other traffic participants, as any realistic driving environment will have, multiplies the difficulty of the task.

The ego-vehicle must be able to generate behavior which coheres with the behavior of other traffic participants — to do this, it needs to relatively accurately predict, at a modest time-horizon, how the other traffic participants will act. Estimating the behavior of other driving agents is often categorized into its own module, prediction, and can be further broken into the following subtasks:

Prediction:

a) Intention Estimation: Infer what other drivers want to do in the future.

b) Trait Estimation: Infer with what traits the driver’s goal will be pursued.

c) Motion Prediction: Predict the future states of each traffic participant.

Currently, the most promising prediction/behavior estimation approaches utilize data-driven learning techniques. The learning methods supporting these techniques will be further discussed in the simulation section of this article.

3) Local Planning: Generating a high resolution trajectory

The local planning sub-module’s objective is to create a high-resolution trajectory that satisfies the low resolution waypoints from the global planning sub-module, the desired maneuver specified by the behavior planning sub-module, as well as the kinematic constraints of the vehicle.

Trajectory generation and evaluation, like maneuver specification, are very cognitively demanding processes:

Trajectory generation, due to its very large problem space, relies on heuristic-based strategies such as variational planning — where initial trajectories are proposed, then optimized iteratively. Various sampling-based and learning-based methods are also utilized.

Trajectory evaluation is faced with the challenge of creating an objective function to which the generated trajectories will be scored. Common criteria include: safety, legality, perceived safety, comfort, route efficiency, and others.

While at an abstract level these criteria are easy to conceptualize, quantifiably scoring them is more difficult. Some criteria naturally have obscure or sparse feedback. For example, collisions are infrequent, and ‘near’ collisions are difficult to label on a continuous scale.

Not only is scoring individual categories challenging, the task of creating inter-category trade-offs poses an even larger challenge that often invokes convoluted logic. For example, if we treat safety-risk as non-negotiable, then we would never have the car leave its parked state.

Mid-to-mid planning agents:

Given the complex challenges associated with path planning, some AV projects are seeking to create “mid-to-mid” software architecture — where the entire planning module is a learned policy. Waymo has openly stated that they are targeting a mid-to-mid architecture and that their emphasis on simulation will be essential to this transition.

Simulation

Simulation strategies and importance vary by AV project — The general objective is to model the driving environment and the ego-vehicle such that the learning models used by the vehicle can be trained on artificial data.

The rarity of some driving events, in combination with learning models requiring repetition in order to reliably learn behavior, makes the ability to train on artificial data important. Artificially generated data can be designed to represent driving events that challenge the ego-vehicle, yet appear too sparsely to be learned effectively. For example, real-world data gathered from instances of rare events can be tweaked and augmented to multiply the number of training instances representing that driving scenario.

Waymo’s end-to-end architecture search, or “ML-factory”

Simulation also allows for learning models to be learned and evaluated more rapidly and thoroughly. The effectiveness of different ML model architectures varies largely by problem, and can be difficult to infer prior to observing results. The ability to explore many model architectures and select based on empirical results can be provided by simulation in combination with robust architecture search.

Control

The controller’s objective is to generate acceleration/deceleration and steering angle commands to follow the chosen trajectory supplied by the path planning module. These commands are passed to actuation module, which directly controls the mechanics of the vehicle. Given the noisy nature of real-world driving, control executes iteratively at a rapid pace, utilizing feedback to quickly adapt commands.

Traditional control methods typically follow a reactive methodology, in which actions are mostly determined using a retroactive perspective. Most modern AV control methods format the control task as an optimization problem, which places more emphasis on proactive control. Further, most common modern approaches are model-based — either heuristic-based or learned.

Hopefully this article was useful to you! Please leave a comment if you think I left something important out, find an error/inaccuracy, or have any other feedback.

Youtube: https://www.youtube.com/@aiwithjustin2897

LinkedIn: https://www.linkedin.com/in/justin-milner-b190467b/