The Five Pillars of Tesla’s Large-Scale Fleet Learning
How over half a million Teslas help train neural networks for autonomous driving
Here are the five pillars of Tesla’s large-scale fleet learning approach to autonomous driving, as I see them:
- Automatic curation of rare, diverse, and high-entropy training examples for fully supervised learning of computer vision tasks (i.e. when a human annotator labels images or videos). Curation techniques may include deep learning-based queries, automatic object discovery, human interventions (e.g. Autopilot disengagements), disagreements between a human-driven path and the Autopilot path planner, and disagreements between different neural networks.
- Weakly supervised learning of computer vision tasks (e.g. semantic segmentation of free space) using behavioural cues from human driving to automatically label images and videos. An upload may be triggered where there is a conflict between the vision system’s output and a behavioural cue.
- Self-supervised learning of computer vision tasks (e.g. depth mapping) or self-supervised pre-training for tasks that are fine-tuned with fully supervised learning. Similar curation techniques to (1) may be used. (Self-supervised learning on video is probably what Tesla’s Dojo computer is intended to accelerate.)
- Self-supervised learning for behaviour prediction tasks (e.g. predicting cut-ins). The future automatically labels the past. An upload may be triggered when the system’s prediction is wrong.
- Imitation learning for planning tasks (e.g. path prediction), probably combined with an explicit, hand-coded planner, and possibly used to bootstrap some form of real world reinforcement learning. Human interventions and human-Autopilot “disagreements” are possible curation techniques.
All the pillars except (1) use automatic labelling because all the techniques mentioned except for fully supervised learning use automatic labelling. Weakly supervised learning, self-supervised learning, imitation learning, and reinforcement learning all use automatic labels.
With ~1,000x more vehicles than Waymo and ~500x more vehicles than all U.S. autonomous vehicle fleets combined, Tesla can collect commensurately more automatically labelled training data for (2), (3), (4), and (5) and commensurately higher-quality manually labelled training data with (1).
For (2–5), if performance scales with data like top-1 error (i.e. correct on the first guess) on the ImageNet image classification dataset, that means Tesla should have roughly 10x better performance on these tasks than competitors. If performance scales like top-5 error (i.e. correct within the first 5 guesses), then it’s roughly 30x better performance. For (1), I’m not yet aware of any research that attempts to quantify anything analogous. The scaling rate for different tasks is different, anyway, so these numbers should be taken with a grain of salt.
Lidar is an ongoing topic of debate. Regardless of whether a company uses lidar, it needs accurate and robust computer vision. This is not just for redundancy, but also because there are certain tasks lidar can’t help with. For example, only cameras can determine whether a traffic light is green, yellow, or red. At any point in the future, Tesla can deploy a small fleet of test vehicles equipped with high-grade lidar, just as companies like Waymo and Cruise are doing today. These lidar-equipped Teslas would combine the benefits of lidar and large-scale fleet learning.
My prediction: in the next 1–3 years, Tesla will surprise many observers with how much progress it makes through its large-scale fleet learning approach. Many folks think Waymo is far ahead and Tesla has no chance of catching them. I think Tesla will pull ahead eventually.