The ML Production Readiness of Tesla’s Autopilot

Read this post on my blog.

Disclaimer: I previously interned at Tesla on things unrelated to Autopilot, and I now work at Google, where I am compensated in part through Alphabet stock. I do not own any Tesla stock. Below are my personal views, not those of either company.

Many companies are scrambling to deploy ML and “AI” in their products, either to be genuinely innovative or at least to appear so. Like most other software, ML models are prone to bugginess and failure, although oftentimes in new and unexpected ways. In conventional software development, we test code to preemptively catch the kind of faults one might encounter in production and give some assurances of correctness. In ML model development, the best ways to test ML for production readiness are still fairly new and are often domain-specific.

More worryingly, very few companies test their ML models and code at all — simply “doing ML” is seen as enough. In conventional software development, having bugs in production services can lead to a negative user experience and lost revenue. Since ML models are being put to work widely in the real world and the failures are often more subtle, the overall effects can be more harmful. For an e-commerce company, a buggy model can lead to poor product recommendations. For a self-driving car company, a buggy model can lead to people dying.

The goal of this post is to assess how current ML production best practices might apply to the development of a self-driving car, and where Tesla’s efforts in particular may be falling short. This isn’t mean to be a Tesla “hit piece” — Tesla is just one of many companies deploying self-driving cars with a cavalier attitude towards thoroughness and safety. In particular, they have the largest deployed fleet of semi-autonomous vehicles, and don’t mind using the words “beta” and “self-driving car software” in the same sentence.

The analysis below is based on the paper “The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction” and Andrej Karpathy’s recent talk, “Building the Software 2.0 Stack”. I recommend reading the entire paper and watching the entire talk if you have the time. The paper gives concrete steps a company can take to test ML models for production readiness, and the talk gives an honest look at some of the data-related challenges around building self-driving cars.

Data

Feature expectations are captured in a schema.
All features are beneficial.
No feature’s cost is too much.
Features adhere to meta-level requirements.
The data pipeline has appropriate privacy controls.
New features can be added quickly.
All input feature code is tested.

Karpathy’s talk focuses almost entirely on the collection and labelling of data as the main problem behind his team’s work. His reasoning is that where there were once engineers writing new algorithms for each task, there are now labellers generating high-quality data for predefined algorithms to learn each task from, so most ML problems are really just data problems.

Breck’s paper focuses on the practical aspects of testing data pipelines that feed models. These are standard software engineering best practices applied to data-ingestion techniques. We can’t assume what Tesla’s internal software engineering practices are like, but we can assert that any serious self-driving car program must have a robust data management operation, or it will not succeed.

Privacy is the only data principle that stands out as more than a mere technical detail. The idea of privacy controls applied to self-driving car data is a nascent area of study, though it has some precedent in Google’s Street View imagery. Some startups will gladly pay you in exchange for your driving data, but Tesla gets it for free from their fleet. They are at the very least asking for users’ permission to record video from their cars, though we can’t know whether or not the internal privacy controls on that data are satisfactory, and how they may or may not comply with GDPR. The privacy of those being recorded in public also comes into question — in this video from Cruise, for example, people’s faces are recorded and displayed without any anonymization.

Model Development

Model specs are reviewed and submitted.
Offline and online metrics correlate.
All hyperparameters have been tuned.
The impact of model staleness is known.
A simpler model is not better.
Model quality is sufficient on important data slices.
The model is tested for considerations of inclusion.

This section focuses on the process of model development itself. Though the Tesla Autopilot team likely spends a significant amount of effort on building and testing various models, Karpathy’s talk frames modeling as essentially a solved problem when compared to data collection. Indeed, current Autopilot versions appear to be using fairly standard architectures.

Still, model development has its nuances. Inference must be fast, so tradeoffs between performance and latency must be taken into account. Hyperparameter (or even architecture) tuning must be done to eek out any last remaining gains in performance on the desired data.

Model staleness in the case of self-driving cars is an interesting one. Consider, for example, the changing of the seasons. A model whose primary input is visual data would perform slightly differently in different weather conditions during different parts of the year, depending on the data coverage of those conditions. Since Tesla is a global company, ideally they would have coverage of all conditions in different areas and climates, and models would be tested on all available data.

The same ideas can be applied to inclusion — which types of roads and neighborhoods does Tesla have good training data for? Karpathy’s talk gets into the nuances of stop signs, road markers, inclines, and traffic lights in various parts of the world, so it is reasonable to expect that they’re testing their models on edge cases as much as possible and filling in the gaps in their datasets as soon as they appear.

Infrastructure

Training is reproducible.
Model specs are unit tested.
The ML pipeline is Integration tested.
Model quality is validated before serving.
The model is debuggable.
Models are canaried before serving.
Serving models can be rolled back.

Infrastructure tests are particularly interesting when applied to self-driving cars. In contrast to more conventional ML architectures, which are often served from centralized servers in data centers, self-driving cars do almost everything on the edge, in each individual car itself. As a result, a self-driving vehicle should have both hardware and software redundancy to be tolerant against failures. Tesla’s current lineup lacks hardware redundancy. Since software updates are delivered OTA, it is completely reasonable to expect that Tesla has mechanisms in place to allow Autopilot versions to be rolled back immediately in case there is an unexpected regression in performance, though such a rollback would still need to be triggered from some central source.

Tesla’s camera-centric hardware introduces its own set of problems. Considerations like white balance, color calibration, and various other sensor properties need to be consistent across data and devices, otherwise uncertainty can be introduced in both training and inference. Such differences can be especially pronounced in different lighting (daytime vs. nighttime) and weather conditions.

Additionally, simulation is an essential part of any robust self-driving car operation. Being able to test new iterations of models against all historical data is key to safe iteration that ensures there are no regressions for certain slices of the data. Simulation is especially important in such high-stakes scenarios when A/B testing in the real world isn’t an option. It’s unclear how much work Tesla is doing in the simulation space, though it’s likely not enough if we’re judging by their job postings.

Monitoring

Dependency changes result in notification.
Data invariants hold for inputs.
Training and serving are not skewed.
Models are not too stale.
Models are numerically stable.
Computing performance has not regressed.
Prediction quality has not regressed.

Monitoring a live system is especially important in ensuring that performance remains reliable as the system itself or the world around it change. In the case of Tesla’s vehicles, the hardware’s health would need to be monitored to detect any issues with cameras or other sensors, as well as the hardware that runs inference, as the components age. Any issues with other fundamental features of the car’s operating system can all affect the reliability of the parts of the system running model inference.

Model staleness that isn’t addressed through a varied enough dataset (for example, exceptionally smokey skies during a wildfire) would need to be addressed through regular OTA updates, which comes with its own set of infrastructure challenges. Again, because these systems are being run on the edge, Tesla has more control while creating and improving the model, but must provide failsafe guarantees when the model is deployed and out of their immediate control.

Conclusion

These are only a few of the most obvious concerns a company such as Tesla might encounter when building a reliable self-driving car, as viewed through the lens of building production ML systems for lower-stakes consumer software. Surely the Autopilot team has considered each of these scenarios, but considering them from the perspective of a battle-tested production ML checklist offers a helpful way of framing problems in better-understood domains to anticipate problems in this new one. From the simple analysis above, it seems that Tesla has a significant set of challenges to address before it can claim to have safe and reliable self-driving capabilities in its vehicles.