It’s the Data, Stupid!

Published in

ManhattanVenturePartners

4 min readJul 9, 2019

By: Santosh Rao, Head of Research at Manhattan Venture Partners

We could have autonomous vehicles (AV) today, but they will not be safe. In order to have safe AVs, we need about 10 years of data gathering. AVs use a type of artificial intelligence (AI) algorithm called neural networks, and the main characteristic you have to know about neural networks is that they are the most data demanding algorithm out there. Neural networks help computers make statistical predictions about the future based on past experiences, and the past experiences are in fact the data we supply it.

One important phrase to highlight is “statistical predictions,” and as we all know, statistics cannot be correct 100% of the time. Therefore, it all comes down to what the chances of making a mistake are.

We see neural networks applications every day, and they sometimes make mistakes. Depending on the application, those mistakes can be tolerated, like when Netflix suggests a movie to watch that you certainly dislike, but a mistake by an AV can be fatal.

Not Just Data but Type of Data

The more data we feed the algorithm, the greater the accuracy of the algorithm, but why is that? Something that many people get wrong is that the car does not literally understand its surroundings (AI has not reached that point as yet), but gets inputs from sensors like radars and LIDARs to make statistical decisions about its surroundings.

But it is not as simple as just providing any data. Engineers have to decide what type of data will help the algorithm improve its accuracy. As obvious at it seems, more precise data produces more precise driving decisions, but it also takes more time for the computer to process that data.

In our previous article “In the Hunt for The Holy Grail,” we mentioned three sources that generate data: cameras, radars and LIDARs. Since the data from cameras is most precise, we should only use cameras, right? Not really. The following Exhibit 1 illustrates the key aspects of the three sources of data.

Exhibit 1: Total Data Generated per Second — By Sensor Type

Source: MVR and *Stephan Heinrich (Waymo’s Systems Architect)*

An average AV (sensors vary depending on the company) has to process 3.15GB every second but today’s computers are not that fast, so we not only need data but also a fast computer to process that data. Data from LIDARs and radars are easily processed by the computer since it is being recorded in computer language, but cameras collect pixels, which then have to be translated to a computer language, which requires extra computing power and time. To solve this problem, companies are increasingly adopting LIDARs and radars rather than cameras. That said, Tesla’s decision to depend only on cameras despite its issues seems unreasonable.

Scope is As Much if not More Important than Volume of Data

One more factor to consider with statistical models is that the scope of data is as much if not more important than the volume of data. Massive amounts of data about the same highway or road will not improve the algorithm’s accuracy by much. The algorithm needs data from different situations since, as mentioned before, the algorithm makes decisions based on past experiences.

An algorithm that has never seen a kid jumping on a pogo stick in the middle of the street may not provide the correct response and that is a risk we cannot take. The algorithm needs a large number of AVs (with a safety engineer until the algorithm’s accuracy meets safety standards) on the roads to collect more data, but current regulations are restrictive in terms of the number of AVs on the road.

Simulations are good, but cannot fully substitute real-life data to move closer to full-proof safe autonomous cars

The collection of data for multiple “scenarios” is a tedious and laborious task that will take years to gather since there are an “infinite” number of possible scenarios that a car could encounter on the road. Driving applications that will encounter a minimal number of “scenarios”are going to be implemented first. Trucks are ideally suited to be the first since driving on a highway is more monotonous than driving in a major city such as New York City. Accordingly, autonomous trucks will very likely be the first application where human safety drivers will not be present.

While gathering data is critical, it is not going to be easy. At this point the number of AVs collecting data is minimal given the restrictions in most cities. Instead, companies have to rely on simulating different driving situations and conditions. One major challenge is simulating human behavior which, as it turns out sometimes, is not that predictable. Ultimately, simulations are good for improving the accuracy of algorithms, but cannot fully substitute real-life data to move closer to full-proof safe autonomous cars.

Exhibit 2: The Logic Ends Up Being Pretty Simple

Given all the variables and challenges, how many more years before we see a mass roll-out of autonomous cars? From today’s vantage point, the best forecast by market experts is 10 years, and we concur with that projection.

It’s the Data, Stupid!

Written by Santosh Rao