By: George Hansel, Hardware Engineer, Lyft Level 5
Level 4 and 5 autonomous vehicles (AVs) must be designed to have appropriate levels of fault tolerance in both the hardware and software portions of the computational system. There are many ways to increase the fault tolerance of a complex system, but most involve building in redundancy, which means adding in extra hardware and software that can assist the primary system by taking over some tasks if the primary system suffers a fault.
Another differentiator between Level 4/5 AVs and traditional cars is their need to make plans for how to move through the world. Consequently the perception, planning, and controls software can require computational power comparable to or exceeding the amount of power allocated to propulsion in traditional cars. Power generation and consumption in a car is limited by physics as much as economics — one cannot dissipate the megawatts of power associated with the most performant contemporary AIs in a small vehicle of reasonable characteristics and endurance. Without needing to know what the power constraint is, we can know that it exists. With the existence of a power constraint in mind, it’s crucial to design the computational stack and software to efficiently use the available power.
In this blog post, we’ll talk through a simple and hypothetical model relating power consumption to the rate of two different types of faults that may occur. We’ll use this model to illustrate that expanding the design space of a fault-tolerance scheme from plain replication to an alternate structure of computation called an interactive proof system can decrease overall risk.
Why and What For
In the design of control or decision systems, there are a variety of different ways of answering the question “what’s the appropriate type and degree of redundancy?” To an extent, this diversity reflects the gamut of reasons that those systems incorporate redundancy — that is, tolerance to faults through some kind of multiplicity — including:
- to comply with regulation, as in 14 CFR 25.671 for aircraft that can be used in airline services
- to minimize the severity of failures: retransmissions in the Transmission Control Protocol that served this web page to you are a redundancy-over-time and reflect genuine degradation of the channel, but the loss of service is gradual
- to minimize the probability of failures, as in the error-correcting codes used in telecommunications, where we want no degradation of service up to a certain frequency of errors
- to minimize some type of risk metric, like the downtime of a datacenter or payment processing service
Classic redundant control systems collect multiplicated sensor data, feed it to three or more synchronized estimation/control units, then use a collection of “arbiter” systems to tally votes for what a given actuator physically executes. In some cases, the eventual voting is actually mechanical, with only small amounts of state electronically voted. This is the state of the art for systems where the only/predominant source of a failure is a fault in hardware or software (i.e. that the hardware or software failed to meet its specification). It is suitable for such systems because these voting schemes are “Byzantine fault tolerant” in that they are capable of suppressing truly arbitrary hardware or software faults up to the completeness and correctness of the overall specification. Such systems are hard to design correctly: voting can’t suppress “systemic faults” associated with the specification or decision being inherently and commonly wrong, and increased complexity of the specification renders the determination of its correctness more difficult.
Shown below is an example of such a quadruple-modular-redundant system, from the Space Shuttle Avionics System description.
In such systems, “correctness” of the final output in the presence of a bounded number of faults can be defined only as “interactive consistency” between the independent compute units: whether they eventually agree. This is an extremely general definition of correctness that is imposed at a significant increase in power. For instance, if we want to reliably compute the factorization of an integer in the presence of a fault, we don’t require three votes to an arbiter — and thus three expensive factorizations — to determine which is correct. It’s sufficient and much less expensive to multiply the proposed factors to verify the factorization.
The Model and the Allocation Problem
One way to frame the design of a system is to formulate an optimization problem. Given appropriate constraints, what design minimizes the number of failures per hour/mile? To do this, we need to express the model and the constraints in a numerically convenient way. Take a look at a particularly simple model:
Consider all faults as independently originating from “perception, planning, control” problems (pink) or from problems addressable by hardware redundancy (green). The pink and green curves are fictional but qualitatively descriptive — a curve of this nature could be the output of a systems engineering exercise or an empirical study. The important assumptions embedded in these two relationships are:
- Perception, planning, or control insufficiencies are likely to be responsible for more faults than plain hardware faults.
- Additional power to each function monotonically reduces, with diminishing returns, the frequency of its faults. This is the simplest behavior we could choose.
- Increasing sophistication of each strategy can be thought of moving its respective line down or to the left.
While we’ve referred to the two categories of faults by the context in which each usually occurs, the actual differentiating characteristic is whether or not replications of each process have correlated outcomes. For example, bit errors in a computer memory are generally independent, whereas classifications of a particular image by a particular inference engine are for the most part highly correlated. One solution to bit errors is to add replicated bits, whereas decreasing the probability of misclassification can require adding depth, width, or connectivity to a neural net.
Consider a “failure” to be associated with either type of fault and so the total failure frequency the sum of the two fault rates. The allocation that minimizes the total failure rate is one which equalizes the marginal utility given the constraint: set the derivatives of the pink and green curves equal to each other and use the remaining degree of freedom to saturate the constraint.
So for these hypothetically-asserted fault curves, an optimal allocation yields a mere 150W toward dealing with hardware faults and the remainder to improving the capability of the perception and control system — orders of magnitude difference.
This is, like many models, approximate. Because what the hardware does is implement the perception, planning, and control, an unmodeled dependence exists between the pink and green curves. What exactly this dependence is constitutes part of the design question.
Consider a system where “redundancy” is only manifested as a multiplication by N of the underlying resource of the perception, planning, and control. Replication merges the green and pink lines, since only hardware faults are observable among synchronized, identical systems, and the aggregate fault rate at the power constraint is the same as the aggregate fault rate of the original system if the power constraint had been divided by N. This is an increase in the number of aggregate faults as a result of misallocation of power to the processes least likely to cause them.
Then how does one create a system where the amount of power consumed in order to address hardware faults is separable from the amount of power consumed to address decision faults in autonomy? In the words of many a computational aerodynamicist, one must exploit the structure of the problem. The structure here arises from the fact that the results of calculations underlying the perception, planning, and control functions have meaning: a specific and definitional criterion for correctness. For example:
- Trajectories and some control commands are solutions to optimization problems which have convergence and constraint satisfaction criteria.
- Perception and localization algorithms compute accelerations, distances, and road topologies that must be physical and consistent.
- Commands are generated from non-stale data, or are otherwise executed in a definite order.
Each of these processes produces a result that can be checked for its definitional correctness by a means independent of the process of computing result in the first place. The parts of the result, or the computational byproducts accompanying the result which may be used to verify that the calculation happened correctly, may be termed the certificate. For gradient-based optimization problems, this could be a Jacobian that satisfies the optimality criterion. If independent hardware checks the certificate and finds it to be valid, we’ve excluded implementation faults just as exhaustively as random ones. Just as in voted systems, we still need to make sure the certificate represents the behaviorally-correct property — — and by allocating power effectively to computing and checking as separate concepts, we can express the highest possible behavioral performance.
By instead viewing the computation task from a certificate verification standpoint, we replicate the decision (the ‘convincing’ of a verifier), rather than the resource. That mitigates hardware failures as desired while consuming an amount of power closer to the optimal allocation reflecting the (for now) lower failure rates of hardware compared to perception, planning, and control strategies.
The effectiveness of this strategy relies on the fact that for a very wide class of problems it is possible to verify a solution’s correctness in asymptotically much less time than is required to compute the solution in the first place. The pairing of an extremely powerful but untrustworthy “prover” with a substantially less powerful but trustworthy “verifier” is called an interactive proof system.
If this post interested you and you’d like to work on problems involving power system design, multidisciplinary optimization, redundancy theory, or interactive proof systems as applied to problems in autonomy, please get in touch! We’re hiring, and we’d love to talk to you. Learn more at lyft.com/level5.