Tesla’s arc of AI progress

Tracing Karpathy’s talks through the years

6 min readSep 27, 2021

I’ve followed Karpathy’s work since taking CS231n back in 2015, but I’ve been admittedly skeptical of Tesla’s approach to camera-only autonomy. That said, I found Tesla’s recent AI day presentation impressive. I’ve watched Karpathy’s talks on their HydraNet architecture over the years and tried to put the recent AI day presentation in context with the prior talks (see table below).

Summary of Tesla’s HydraNet architecture and ML infrastructure since 2019. Image by author.

The challenge of multi-task learning

Computer vision research often focuses on specific tasks (e.g., 3D object detection, segmentation, tracking). Open datasets (e.g., Waymo, NuScenes, KITTI) host task-specific challenges and leaderboards. Top submissions to these challenges are often complex and specialized models for each task.

In industry applications, models often must perform many tasks. Karpathy gave a talk on this problem at ICML 2019: Tesla’s perception system must detect moving / static objects, signs, traffic lights, lane markings, etc. One model performing all of these tasks efficiently shares compute, but this also introduces a central challenge: how to de-couple the tasks so that many developers can concurrently work on each one? In the worst case, improving one task (e.g., traffic lights) degrades performance of all other tasks since they complete for resources (e.g., model capacity, data during training, etc).

Balancing performance across tasks can be hard. Image taken from Karpathy’s ICML 2019 talk.

HydraNet for multi-task learning

At Pytorch Devon in 2019, Karpathy introduced HydraNet, which is Tesla’s approach to addressing the above-mentioned multi-task learning problem. His talks at Scaled Machine Learning Conference 2020, CVPR 2020, and CVPR 2021 emphasize a few points. Hydranet uses a shared backbone that amortizes compute across tasks, infrequent end-to-end training with feature caching, heads that can be fine-tuned from cached features concurrently, and an organization of tasks into terminals, which is likely informed by some of the literature on multi-task-grouping (e.g., such as papers here and here).

The HydraNet branching structure. Image taken from Karpathy’s CVPR 2021 talk.

This is laid explicitly at AI Day where the benefit of task-level decoupling is emphasized: tasks are fine-tuned on top of cached, shared video features.

The benefits of HydraNet. Image taken from Tesla’s AI day 2021.

HydraNet evolution

The general branching structure of HydraNet appears unchanged since 2019. However, more details have been released about specific components over the years, culminating with the recent AI day presentation. I summarize them in the above table and use the following sections to discuss a few in more depth.

Compute

Tesla’s most recent training cluster has 720 nodes with 8 A100 GPUs per node, which is reported to be the 5th largest supercomputer in the world. AI day presented the Dojo chip. As noted here, it has higher power density than the Nvidia A100, higher transistor density than all other high-performance chips (only exceeded by mobile chips and the Apple M1), and appears to use TSMC’s fan out system on wafer to densely pack chip tiles with favorable cooling and power consumption. In summary, they report 1.3x higher performance per watt with 5x smaller footprint relative to Nvidia GPUs.

The Dojo training tile. Image taken from Tesla’s AI day 2021.

Camera fusion

AI day gave more detail on the backbone, which can be broken up into three pieces: per-camera feature extraction, camera fusion and view transform, and temporal feature caching. Per-camera feature extraction moved from ResNet-50-like architecture to RegNets. Fusing Tesla’s 8 camera into a single image benefits detection, especially in cases where objects span camera stitching boundaries (below). They do this in-network fusion using a rectification layer to translate images into a common virtual camera, which overcomes the challenge of noisy camera calibration across their large fleet of vehicles.

Truck detection that spans multiple cameras. Image taken from Tesla’s AI day 2021.

View transform

Autonomous vehicles need to know the distance to obstacles in order to drive. Unlike LIDAR, cameras do not provide the distance (or “depth”) of each pixel. This is a well-studied challenge for camera-only perception. At CVPR 2020, Karpathy mentioned that Tesla used pseudo-lidar for this task: regress self-supervised depth for each image pixel, project it to a 3D psuedo-lidar point cloud, and then use any mature detector that operates on LIDAR point clouds.

Generating a “pseudo-lidar” point cloud from directly from camera images. Image taken from Wang et. al. here.

At AI day, Karpathy suggested that this approach is not good enough: lane lines regressed in the images (top) look noisy when projected into birds-eye-view (BEV, at bottom) because of error in the depth estimate of each pixel.

Lane lines look noisy when projected to BEV. Image taken from Tesla’s AI day 2021.

To address this limitation, Karpathy presented a transformer that maps pixels between the camera input and BEV output, an interesting departure from convolutional architectures (CNNs). Transformers have clear appeal: they operate on sets whereas CNNs require a fixed grid (e.g., image) input and they are general architectures whereas CNNs have inductive bias baked in (e.g., we construct CNNs to generate multi-scale features across a gradually expanding receptive field). As Yannic Kilcher argues, this inductive bias may constrain performance in the large-data regime and Tesla has a lot of data.

Transformers reached SOTA in NLP (e.g., machine translation). The vision transformer (ViT) showed that images can be fed into a transformer as a sequence of patches, just like a string of words.The embedding filters learned in ViT are similar to the filters learned by CNNs, suggesting that this general architecture is learning similar features as CNNs for image recognition.

Learned filters for embedding image patches in ViT. Image taken from Dosovitskiy et. al. here.

Karpathy presents that transformers can be extended for depth regression following the general steps that work for NLP or image classification: encode the image input as a set of key-value pairs and query these key-value pairs for each output pixel in BEV. The intuition is that the transformer can learn the mapping between inputs and output pixels if provided sufficient data.

Transformer for learning depth from camera images. Image taken from Tesla’s AI day 2021.

Temporal context

Since 2019, Karpathy reported HydraNets with recurrence for memory: this is useful for temporarily occluded vehicles or predicting road geometry based upon prior signs or road markings. At AI day, he clarified that multi-camera features in BEV are cached along with kinematics into a feature queue, which a spatial RNN reads from. The RNN hidden state is spatially oriented around the car with channels that track relevant obstacles or road geometry.

Spatial RNN video module predicting road geometry as the Tesla drives. Image taken from Tesla’s AI day 2021.

Summary

The central HydraNet design principles have been consistent since 2019 and address some of the general challenges of concurrent multi-task learning:

Use a shared backbone that amortizes compute
Perform infrequent end-to-end training with feature caching
Parallelize terminal(task)-level fine-tuning from cached features

CNNs (ResNets or RegNets) for per-image feature generation have also been consistent since 2019. The camera-to-BEV transform changed considerably, from a CNN-based pseudo-LIDAR approach to a transformer architecture.

The video module used recurrence in 2019 but appears to have undergone some architectural exploration recently, as Karpthy notes that they considered using transformers and 3D CONV before settling on a spatial RNN. The overall architecture today uses CNN-transformer-RNN when moving from image feature extraction to view transform to temporal feature generation.

HydraNet architecture. Image taken from Tesla’s AI day 2021.

Regardless of your view on Tesla and their approach, it’s worth studying their arc of progress over the years. I appreciate that they’ve released these videos and hopefully in the future they will publish, as many other AV companies do.