Murthy Nukala
15 min readJun 20, 2016

A constellation of innovations is needed for Autonomous Driving…

As the Autonomous Vehicle (AV) industry gathers steam, a constellation of innovations is needed to fulfill its promise. There are a range of critical problems that need solving for the market to emerge and take shape. Success will require both co-operation and competition on a scale that has scarcely been seen as no single company will be able to solve all these problems themselves.

This is Article 2 in the series on Autonomous Vehicles, covering the structure of technical problems that need to be solved and the big open questions that remain. You can find Article 1 here, which covered the strategic landscape and tensions within the industry.

Summary of key takeaways:

Sensors are in the middle of a Moore’s law like exponential curve for both cost reduction and performance improvement, but absorption of the sheer range and scope of innovation will require modular product design on behalf of automakers, which is a departure for automakers.

‘Prior maps’ need to balance freshness, authoritativeness and comprehensiveness. None of the current approaches offer all three, but expect fleet and crowdsourcing based approaches to beat survey vehicle based approaches.

Registration of road objects will require a ‘family of classifiers’ approach. Estimates range from 50 to 200 classifiers, with each classifier attacking a ‘long-tailed’ distribution. This is very hard.

Most current approaches in the reasoning layer use some flavor of deep learning. These are promising directions, but there is great need for more diversity of reasoning and learning approaches to go ‘up the reasoning stack’.

Globally referenced HD maps will be needed for a ‘global brain’ approach where each vehicle learns from many others.

But before we get started, let us try to define what ‘autonomy’ means.

Definition: Levels of Automation

The NHTSA has defined five levels of autonomous driving, shown in Figure 1 below.

Figure 1: The Five levels of Autonomous Driving. Source: NHTSA.

The first three levels are ‘driver assist’ and the last two require increasing degrees of autonomy, with Level 3 requiring the driver to standby to recover control. Most industry observers this will be the hardest level due to uncertainty about the alertness of the driver, and the fact that humans need some time to adjust to an unfamiliar situation.

A notable aspect of Levels 4 automation is the requirement for redundancy of the core driving systems, where a backup system takes over in case the primary system fails.

Complex systems where safety is of paramount interest will prioritize redundancy over efficiency, which in turn will have a significant impact on strategic dynamics. There will always be an appetite for experimentation with diverse solutions, and reduce the likelihood of lock-in for any single solution, dampening the dynamics that typically lead to winner take all scenarios.

The Big Picture

A simplified picture of the processing loop for autonomous vehicles (AV’s) is shown in Figure 2. It involves integrating data from the multiple sensors (also called sensor fusion)around 10 times per second to create a single ‘scene’ of the AV’s surroundings.

Figure 2: The big picture for Autonomous Drivng

Sensor Fusion, Prior Maps and Scene Creation form a tightly linked loop where the AV perceives what is around it, compares to prior maps to localize itself to within 10 centimeters, and then registers the scene back to the prior map. This dynamic drives the insight that map creation and AV driving are two sides of the same coin.

Once the ‘scene’ is created, the AV needs to interpret the scene, plot its next set of moves and take action. The output control space is actually quite simple — comprising accelerator and brake settings, and steering wheel angle. Scene interpretation and path planning can get quite complex as the AV needs to interact with road objects, pedestrians and other automobiles.

What is sensor fusion, and why is it important?

Sensor fusion refers to the technique of ‘fusing’ or combining the outputs of multiple individual sensors to gain a more accurate environmental picture.

It is a powerful technique that is based on a simple, yet profound truth — the variance of a joint probability distribution of two variables whose errors are not correlated is significantly lower than their individual variances.

In Figure 3 below, the variance of the joint distribution of the pink curve and the green curve is the blue curve, which is much lower than the variance of either.

Figure 3: Foundational principles of sensor fusion…

Source: http://www.bzarg.com/p/how-a-kalman-filter-works-in-pictures/?imm_mid=0d6ffb

In simple terms, these are mathematical forms of triangulation and smoothing, and works like magic in error prone, time evolving systems. In the context of AV’s, each sensor is independent and their errors are largely uncorrelated. Sensor fusion produces a more accurate result than any individual sensor can achieve.

These simple insights have been developed into a vast body of work called Extended Kalman Filters, with ever more exotic variants to solve practical problems.

Kalman filters and their derivatives have had an immense impact, it is not an exaggeration to claim they were central to the success of the space program.

Deep dive into Autonomous Vehicle Sensors

Let’s get specific. An overview of the diverse range of sensors and the sensor fusion process within Autonomous vehicles is depicted in Figure 4.

Figure 4: Deep dive into sensor fusion for AV’s

Simplistically, we can divide the sensors into ‘Perception’ sensors — LIDAR, Cameras, Radar and Ultrasound; and ‘Localization’ sensors — GPS, Inertial Measurement Units and Wheel Odometry.

Each sensor has its strengths and weaknesses. For example, GPS is accurate to within 5 to 10 meters, updates its estimates about 1x per second, and does not work well within urban valleys, tunnels and under thick foliage. IMU’s compensate by working well within urban valleys and tunnels, and update at 40 to 50 Hz, but errors multiply and cause drift within a few minutes. It is only by combining GPS, IMU, wheel odometry and prior maps that localization to within 10 cm or so is achievable.

Most automakers plan to use both LIDAR and cameras — but Tesla famously has favored camera based systems over LIDAR. The primary tradeoff seems to be quality of data (LIDAR produces highly accurate distance information to each object, and sees equally well in day and night) versus cost (cameras are extremely good and cheap compared to LIDAR). The pro’s and cons of LIDAR versus cameras are discussed in depth in this article

Figure 5: Approximate costs of AV sensors.

Source: BCG Perspectives

Cost curves for sensors:

Sensors are deep in the middle of a Moore’s law like exponential curve — increase in performance and decline in cost, on pretty much every dimension as the promise of the ADAS market motivates intense research and innovation.

Some key trends in the cost and performance of sensors are:

LIDAR: The Velodyne 64 channel LIDAR used on the Google self driving car famously cost about $80,000. Velodyne introduced its ‘Puck’ lidar at CES ’16 that costs $8000, and expects the cost of LIDAR to drop to about $200 in a year or two.

Quanergy in particular has an interesting innovation using solid state phased array systems with no moving parts. It is expected to cost $250 or so and will have a~120 degree field of view (so 3–4 will be required per car). It is apparently shipping in ‘17.

GPS Performance Improvements: The Global Positioning Satellite system is truly a wonder of the modern world. Conceived in the 50’s, developed in the ‘80’s and operational in the ‘90’s, the initial signal was artificially manipulated to prevent military grade accuracy. This was reversed in 2000 by Congress, leading to an explosion of devices and applications. With the advent of smartphones, GPS is a utility few of us can imagine life without.

A GPS signal has two components: a coded signal transmitted at 1.023 MHz and a carrier phase wave transmitted at 1575 MHz. The coded signal has a wavelength of about 19cm, which determines its resolving accuracy — while the carrier wave is about 1000x the frequency and corresponding resolving power.

Mainline GPS uses the coded signal, and is accurate to about 5m to 10m. Efforts to make GPS more accurate involve using the carrier phase signal in addition to the coded signal, and correction of the major measurement errors.

The major sources of error include (a) atmospheric conditions in the ionosphere (charged particles speed up the carrier signal and slow down the code signal), (b) uncertainty in the precise location of satellites in their orbits (few meters) (c) the quality of the antenna on the measuring device, (d) the quality of the clock on the receiving device and (e) the intrinsic resolving power of the GPS signal. Further, GPS systems have issues within dense city blocks due to ‘mutli-path’ interference ( the same signal reaching the receiver multiple times via reflection), within tunnels and under dense foliage.

To correct for these errors, industry sources report that some smartphones will include native error correction in their chipsets later this year, potentially leading to a 1 m accuracy in position. Also, new GPS satellite systems (called Block IIIa) are due to go live in the next few years that use a dual frequency signal that can cancel out many of these errors including multi-path interference.

Another set of methods — called Differential GPS or RTK GPS harness the GPS carrier wave by triangulating with a base station, and can get to accuracy of 10 cm or less.. RTK and differential GPS are also quite expensive (about $1000), but these costs are also dropping quickly.

Figure 6: How RTK GPS works…

This is a rich area of research. Some recent work from University of Texas shows promising results in using an external USB antenna to get centimeter accurate location. There are fascinating recent efforts involving ‘shadow mapping’ — using 3D models of the city skyline, this technique uses the fact that a certain satellite is obstructed from view to localize to within a few meters!

Cameras: While the cameras used in autonomous driving are improving quickly, the biggest innovation here is in post processing. As custom ASIC’s get created to process images in real time, expect the value of processing objects to improve considerably.

Vendor strategies: As the cost of sensors are dropping, vendors such as MobileEye and the LIDAR vendors are forward integrating into post-processing and map creation — essentially creating labeled data sets and classifying the objects that are detected. This is improving the value of these sensors considerably.

Sensor summary: the autonomous car market is fueling a fevered pace of innovation in the sensor space, driving a Moore’s law style reduction in cost and improvement in performance that is expected to continue for a long time.

While this is largely positive, this dynamic will be challenging for automakers who do not design their products in a modular fashion. When sensor cost and quality change rapidly, AV’s should incorporate the ability to upgrade these at a faster pace than the AV itself. Absorbing the rapid innovation into shipping product will challenge the traditional automaker design and production process.

Prior maps

We’ve discussed the need for Prior Maps in Article 1. The methods of creating prior maps are via (a) survey vehicles — typically sensor laden, expensive vehicles that map a given area in fine detail (b) crowdsourcing — get people or fleets to contribute data during their their normal course and (c) autonomous vehicles themselves, when they get on the road.

Prior maps balance three goals: freshness, authoritative and comprehensiveness.

Figure 7: The demands on Prior Maps

Today’s survey vehicles, fitted with top of the line equipment, cost $150K to $300K. They drive around cities like Mountain View, Austin and Seattle to map them in detail as described in this seminal paper from Thrun. They solve the authoritativeness problem, but are not comprehensive or fresh. Google has only mapped a few cities, and HERE and TomTom have mapped only a subset of the road with a subset of required data. Survey vehicles get the ball rolling, but are not the answer.

Dropping costs of sensors enables crowdsourcing models, and I expect to see new initiatives here.

Given the challenge of balancing these three dimensions, automated taxi services like the proposed service from Lyft and GM in San Francisco makes a lot of sense as it enables achieving all three in a contained region.

Summary: Survey vehicles are authoritative, but not fresh or comprehensive. Fleets are fresh and comprehensive, but not (yet) authoritative. Ditto with crowdsourcing. I see fleet and crowdsourcing models winning against survey vehicles.

Creating the Scene and a Family of Classifiers

Once the data is assembled from sensor fusion, and the AV localizes itself, the next step is to identify relevant objects in the scene and plan the next set of moves.

A taxonomy of road objects based on rate of change is shown in figure 8.

Figure 8: Taxonomy of road objects…

Critically, each of these objects exhibit ‘long tailed’ distributions — with a few common representations, and a large number of idiosyncratic representations. To further complicate matters, road objects have to be detected in all lighting conditions, all weather conditions, and even when they are partially obscured.

For example, Figure 9 shows some common road signs to be read and interpreted. While seeming diverse, these actually quite structured, although the signs themselves and their meanings will change by city, state and country.

Figure 9: Road signs to be interpreted…

Here are some variants on traffic lights…

Figure 10: Some variants on traffic lights…

And some variants on stop signs…

Figure 11: Some variants on Stop signs…

These tend to be easy for humans, but really hard for machine learning classifiers. The effort to identify and classify the long tail of a given object tends to become exponentially more difficult.

Figure 12: The long tail problem for classification…

So a ‘STOP sign’ classifier will require considerable effort to identify most stop signs in all lighting and weather conditions, and when the signs are partially obscured. Similarly, a ‘traffic light’ classifier will require a similar amount of effort, as will a ‘sidewalk’ classifier and a ‘lane marking’ classifier….

My estimation is that about 50 to 200 vertical classifiers will need to be built, each with the long tail, exponential effort characteristic. I call this the ‘Family of Classifiers’ problem.

Figure 13: A family of deep vertical classifiers is required…

Summary: Each company needs to build between 50 to 200 vertical feature classifiers, with different rules and implications in various localities. The Family of Classifiers problem is one of the most daunting in the space, with no clear roadmap to solving it.

‘Up the Reasoning Stack’

A layered approach for autonomous vehicles is proposed in Figure 14. At the bottom is the physical sensor layer, followed by the processing layer for each sensor. This is then merged together in the ‘sensor fusion’ layer, interpreted in the ‘localization’, ‘scene creation’ and ‘path planning’ layers.

Figure 14: Layers of abstraction for autonomous driving. The reasoning layer needs more diversity of approaches.

A critically important layer is likely to be the ‘reasoning’ layer, which will be the brains to allow the automated driving system to navigate a diversity of situations, many of which it will encounter for the first time.

“A ball rolls onto the road. A typical human driver would anticipate a kid running into the street and slow down..”

“If you are in the lane over from a car that is parked behind a garbage truck, you would anticipate that the car is likely to pull into your lane..”

Most efforts today are focused on deep learning, which is the midst of a renaissance due to the combination of software, chipsets and massive data. But even the most ardent fan would admit that these approaches do not work well when they have to learn unfamiliar situations with little data in real time.

Further, Level 4 automation as defined by the NHTSA will require redundancy of driving systems in board each vehicles, ideally with different models so all the driving systems are unlikely to fail simultaneously.

Summary: While deep learning is enjoying a Cambrian explosion, diverse reasoning and learning models that have different strengths need to be developed.

Network architecture for a ‘Global brain’

Should the processing and decisioning happen on each AV, or in the network? Should the AV be largely self contained, or largely part of a fleet that learns together? How will the system behave under partially connected or disconnected conditions?

Simply put, value in the network layer and value on device are inversely related. If all data collection, processing and reasoning could be done on board each AV in real time, that would be an ideal solution, and there would be no need for a network layer. However, it is very hard to see how all of the processing required in Figure 14 can possibly be done on board in real time.

On the other hand, the AV will have to plan for partially connected or disconnected modes and needs the data, processing and computational ability to operate independently.

This would be more of a ‘thick client’ model, where the AV can operate independently for limited periods of time, but synchronizes with the network to learn from other fleet vehicles (global brain), and to also re-initialize sensors to correct for drift. The trade-off or ‘on board” versus “on network” is likely to continue to be a rich area of discussion.

Figure 15 lays out one conception of the network architecture model which essentially views the network and prior maps as the conduit for coordination among multiple vehicles.

Figure 15: Network architecture for ‘Global Brain’

This model calls for each vehicle to be connected to a mapping system that then acts as the coordination layer, rather than a true peer to peer system envisaged by vehicle to vehicle systems.

But a ‘global brain’ strategy will require standards to integrate information into a cohesive system — most importantly for a global reference system for location, and a standard dictionary for road objects. Of these, the notion of Global Referencing is controversial and merits a short segue.

What is global referencing, and why is it important?

Global referencing refers to a global high definition map with absolute coordinates referenced to the center of the earth. Contrary to common belief, commonly accessed mobile maps don’t meet these exacting standards. George Musser from Scientific American has a great article explaining why… http://blogs.scientificamerican.com/critical-opalescence/what-happens-to-google-maps-when-tectonic-plates-move/.

The earth’s surface is constantly moving, due to tectonic plate movements and earthquakes. There are two primary geodetic reference systems — WGS 84 and NAD 83. WGS 84 is referenced to the earth’s center, so the tectonic plates ‘float’ on top of this global reference frame. So absolute GPS coordinates get out of synch with ground truth.

Today’s mobile maps are mostly built from satellite imagery. These images undergo significant transformations and post processing to ‘drape’ the imagery on the surface of the earth — this process also introduces positional error of a few meters. Further errors are introduced by the assumption of the GPS Lat Long system that the earth is an ‘oblate ellipsoid’, which deviates slightly from reality, especially at higher latitudes.

These issues cause problems and involve major effort for any absolute referencing system. Several industry players including Google believe that local referencing (where a car knows the local area around the vehicle) is most important for autonomous driving, and that the benefits of global referencing are not worth the effort.

But I think this is misguided, and that the imperative for a ‘global brain’ learning model will amplify the need for a globally referenced map.

In order for multiple vehicles to contribute to a single map, both vehicles need to know where they are with high levels of accuracy. In the absence of global referencing, new errors get introduced when comparing two data streams from vehicles whose positions are offset by unknown amounts. While it is challenging to execute, any global brain strategy will require global referencing.

This is a good time to repeat the Summary of key takeaways:

Sensors are in the middle of a Moore’s law like exponential curve for both cost reduction and performance improvement, but absorption of the sheer range and scope of innovation will require modular product design on behalf of automakers, which is a departure for automakers.

‘Prior maps’ need to balance freshness, authoritativeness and comprehensiveness. None of the current approaches offer all three, but expect fleet and crowdsourcing based approaches to beat survey vehicle based approaches.

Registration of road objects will require a ‘family of classifiers’ approach. Estimates range from 50 to 200 classifiers, with each classifier attacking a long-tailed distribution. This is very hard.

Most approaches in the reasoning layer use some flavor of deep learning. These are promising directions, but need more diversity of reasoning and learning approaches to go ‘up the reasoning stack’.

Globally referenced HD maps will be needed for a ‘global brain’ approach where each vehicle learns from many others.

The innate structure of the problems being solved will defy the efforts of any single party or enterprise. From creating high resolution maps of every road on earth, detecting and classifying all relevant objects, or teaching a driving system to reason in unfamiliar situations, these problems will tax even the wealthiest and most innovative companies. No single party can do this alone, but the draw of one of the greatest prizes in human history will wring new strategic configurations out of large corporations.

Grab the popcorn.