Laying Out the Challenges in AI Safety

Published in

Five Blog

9 min readJun 4, 2020

Background

The development of a new Automated Driving Systems (ADS) — in particular SAE Level 4 — is the great technological challenge of our time. The complexity of the environments where these systems will operate is astonishing. Tens of billions of dollars and tens of thousands of person-years have been dedicated to the problem since the DARPA grand challenges in 2004–2005.

As an industry, we’ve made some impressive leaps forward.

A few months ago, Five demonstrated that its reference ADS could operate at high safety even on public roads in Europe’s only megacity, making over 100 public test drives in a single month across a complex 21km route in South London.

In the US, giants like Waymo, Uber ATG, and Cruise drive many thousands of miles a day autonomously. Waymo even offers rides to the public without safety drivers in a small area of Phoenix, Az.

More broadly, modern passenger cars have felt the impact: the advent of Advanced Driver-Assistance Systems (ADAS) gives a level of automated assistance to highway driving, parking, and navigating traffic jams.

But although we’re making great strides, one issue continues to hold the industry back.

As these systems have started to move from a proof of concept in the Bay Area to a functionally and nominally safe, robust ADS, it has become crystal clear that there is a long way to go until the technology will be broadly deployed. The major variable is demonstrating the safety of AI systems. The size, difficulty and expense of this remaining challenge has been a major factor in the hype around self-driving fizzling out over the past year. Until the problem is tackled, the widespread commercialisation of ADS technology will continue to be hamstrung.

Myth-busting

In the early days, miles driven per disengagement was the measure of safety performance for an ADS. Furthermore, disengagements and ‘unexpected events’ were the primary mechanism for discovering and fixing problems with the stack.

The first concept to be shattered was the disengagement myth, as the main metric of progress used in the industry. It proved far too simplistic and inefficient a measure to drive the assurance efforts behind these types of systems.

It’s now clear to everyone that simply measuring progress as improvements in miles between disengagements hides many failures that might not bubble up to the level of disengagement, whilst at the same time enforcing an extremely slow development cycle. That’s not to mention the need to physically drive hundreds of millions of miles to be statistically confident. That’s one heck of a lot of work for safety drivers and unviable on many levels as the primary error finding method.

With disengagements per human driven mile taking a back seat as a measure of a system’s safety, in comes efficient and powerful use of simulation as the key to unlocking autonomy. In a simulated virtual world, an ADS can be given the ‘ride of its life’: packing more thrills and spills into a synthetic mile than you would see in thousands of real world miles.

Simulation may hold the key but, at Five, we have come to understand that unlocking the value of simulation comes with its own set of challenges. Before we dig into them, let’s take a quick look at how an ADS is built:

Automated Driving 101
Almost all practical ADSs are built of three high-level components, which can be summarised as Sense, Plan & Act.

Sense develops an internal model of the world outside, including the location of the ADS in that world
Plan develops a high level trajectory plan for the ADS based on goals, an interpretation of that model in the world and rules
Act translates that plan into smooth and comfortable steering, acceleration, braking and signalling
Sense is the component where ‘AI’ generally lives, often in the form of deep Convolutional Neural Networks (CNNs) interpreting LiDAR, radar and camera-based sensor outputs. The Plan component lies within the traditional domain of robotics, but with the added twist that this component is required to plan under uncertainty, since the Sense component will never interpret the world accurately.

The devil’s in the details

When it comes to simulation, we have a choice to make. Do we run a high fidelity simulation with full photo and material-realistic rendering or should we run a low fidelity simulation instead? Both come with significant implications on measuring safety.

Option 1: High Fidelity Simulation

If we opt for high fidelity simulation, we will need to digitise and render vast areas of the world in virtual space, not just for photo-realism but also for material-realism. Then, if we want to present the entire ADS with inputs it would receive if tested in the real world, we will need accurate models of all the sensors in a vehicle to generate those inputs. Needless to say, this is a computationally (and financially) gargantuan task and really hard to do faster than real-time. But imagine it was done, it still leaves the developer exposed to the domain adaptation problem.

CNNs turn out to be extremely sensitive to the statistics of their input domain. Constructing synthetic images that stimulate the same outputs as real world images remains a general problem in deep learning right now. Despite the emergence of promising approaches, the problem is still essentially unsolved, and this means a developer is left with a significant residual risk that potentially invalidates their testing.

Option 2: Low Fidelity Simulation

The alternative of using low fidelity simulation requires us to generate only the feature representation of the interface between the Sense and the Plan components; no need to render anything, no need to model any of the sensors. Often referred to as ‘headless’, low fidelity simulation removes the computationally expensive part of the simulation and, since Sense isn’t tested at all, the expensive parts of running an ADS.

Of course, the problems with this paradigm are twofold:

Firstly, since no testing of the Sense component is performed in simulation at all, separate real-world testing and driving of the Sense component is still needed in large quantities, meaning high expense, long timescales and low replicability.
Secondly, in presenting the Plan component with a perfect feature space representation of a scenario in simulation (sometimes referred to as ground truth simulation), we are not in fact testing that component as it would be exercised in the real world, which always contains some perception errors. All perception systems are imperfect and a planner needs to be robust in the presence of such errors. Again, an assurance engineer has a siren going off in their head: residual risk, residual risk, residual risk!

We call this the Simulation Fidelity Problem.

From Bad to Worse

As if this wasn’t bad enough, there lies a second challenge when it comes to simulation realism: saliency. But before we talk about saliency, let’s just remind ourselves of what we mean by an Operational Design Domain (ODD).

The Operational Design Domain (ODD)
One of the key concepts in autonomous driving is that of the ODD, which is defined as the “operating conditions under which a given ADS is specifically designed to function”. The ODD forms a high level specification in which the ADS is designed to operate safely. This includes, for example, the types of environmental conditions it will see, the types of traffic and the roadway characteristics. We like to visualise the ODD as establishing a boundary that constrains the world of possibilities where each possible scenario might be a point within the ODD boundary and we might measure coverage and performance, like this:

In simplistic terms, the objective is to demonstrate safety in all the dots that represent an ODD.

When all our testing was in the real world, we could more-or-less guarantee that it was relevant, since we were driving within our target ODD. Since all ADS developers seek better-than-human performance of, say, 1.5x10⁸ miles per serious injury, and we are certain to be far from covering that number of miles on any particular software version, if we had to disengage, it was almost certainly a failure of our system. That meant that the traffic situation that forced our disengagement was almost certain to occur again. It’s salient.

Now, if we push the majority of our testing into a simulated world, we can repeatedly reproduce that system error as a scenario — either hand-crafted or extracted from a real-world encounter — and fix our system.

That’s great, but we can still go much further. In simulation, we are able to create just about any traffic scenario that we could encounter in the real world. This is extremely powerful.

The problem comes, however, when we ask ourselves two questions:

Is this scenario that I just created likely to happen in the real-world (and how likely)?
If it did, would even a human driver have been able to negotiate it successfully? Should my better-than-human ADS have been able to negotiate it safely?

The problem we have just elucidated is essentially this: a single scenario in simulation gives a so-called micro-level assessment of safety, whereas what we seek is a macro-level assessment of safety. Knowing the difference and knowing how to move between the two is a pretty important piece of knowledge. And if we don’t know that, we call it the Simulation Saliency Problem.

But It’s Just Residual Risk

It should be noted, of course, that one lens through which to view simulation-based testing is as a mechanism for reducing as much risk as possible before heading out onto the road. This is valid. But the level of on-road driving required for these naïve simulation-based testing approaches is too great: leading, not only to continued risks of further fatal accidents involving autonomous testing fleets, but maintaining a long development cycle that may very well put self-driving beyond the budget of even the largest players in this space.

Addressing the simulation fidelity problem and the simulation saliency problem are of the highest priority for our industry. That’s why we at Five have been working hard on two ideas that represent a step change in accelerating and facilitating the safe development of autonomous driving.

Two Powerful Ideas: A High Level Outline

At Five, we have developed two powerful ideas that go a long way towards developing a simulation-based development and assurance pipeline that is truly fit for purpose, and that addresses the biggest outstanding challenges in assurance today.

PRISM™️
In response to the Simulation Fidelity Problem (outlined above) we’ve developed a perception verification and validation workflow that accurately models a perception system, warts and all. In use, it delivers a surrogate, statistical twin model whose outputs are plausible and statistically indistinguishable from real world outputs. That means the industry doesn’t have to tolerate the residual risk presented by the shortcomings of existing high or low fidelity simulation options.
Saliency models
And, in response to the Simulation Saliency problem, we take a similar approach. Our verification and validation workflow allows us to characterise the scenarios that we have seen in the real world and build models of what realistic agent behaviours are. This workflow then ensures that the relevant scenarios and edge cases to test are directly connected to the ODD in which we intend to perform safely.

With both of these models, we give a dual purpose to our real-world driving: not only are we using it to test our ADS, but we are also directly validating the models of perception and saliency that we run in simulation.

Stay Tuned for More Details

We are actively developing these ideas as part of a powerful end-to-end development and assurance platform that takes you from the formalisation of an ODD to the safety case that you need to develop to assure an Automated Driving System.

If you would like to find out more, you can:

Reach out to us directly at info@five.ai
Follow our Medium page for future blogs detailing our system.