How we Scale Machine Learning

Paul Gresia
Published in
12 min readMay 26, 2020


by Daniel Lee on May 18th, 2020


At Scale AI, we label on the order of 10MM annotations per week. To deliver high-quality annotations for this enormous volume of data, we’ve developed a number of techniques including advanced sensor fusion to provide rich detail about complex environments, active tooling to accelerate the labeling process, and automated benchmarks to measure and maintain labeler (Tasker) quality. As we work with more customers, more Taskers, and more data, we continue to refine these methods to improve our labeling quality, efficiency, and scalability.

How we use ML

While this vast quantity of data provides Scale AI with invaluable opportunities to learn and to build upon our annotation processes, it also enables our Machine Learning team to train models that further augment our capabilities. We leverage ML models throughout our annotation pipeline including:

  1. Pre-labeling to reduce Taskers’ manual work

2. Active tooling to maximize Taskers’ speed & efficiency

Video can be viewed here:

3. ML Linters to check Taskers’ annotations for potential errors

We have observed that ML models offer orders-of-magnitude improvement for every component of our annotation pipeline. For our customers, this equates to higher quality, higher throughput, and lower prices. Because ML has such high potential impact, it is important that our models are not only highly-performant, but also highly-scalable. This means we have to be very intentional about how we scale machine learning across our many customers, task types, and datasets.

Challenges of Vast and Varied Data

The naive approach is to train a model for each customer dataset. But this doesn’t scale. It would be very expensive for an ML engineer to train and deploy a new model whenever a customer wants a new dataset. It can also take significant “ramp time” before we’ve even labeled enough data to train a useful model. More importantly, research from Google AI has demonstrated that “model performance increases logarithmically based on volume of training data.” Scale AI has a vast and ever-growing quantity of labeled data. This means training separate models, each on a small fraction of the available data, would waste enormous potential performance gains.

If we train a model per dataset, we need to wait for the “ramp time” to annotate a sufficiently large training set before we can actually leverage ML. Training models across datasets increases performance, improves generalization, and eliminates this ramp time.

To effectively leverage our data advantage, we need to train models across customers’ datasets. Doing so reduces the cost of developing a custom model per dataset, eliminates the ramp time needed to train a useful model, and drastically improves model performance. Ultimately, this enhances the product for our customers. However, using models across customer datasets is complicated because each dataset is unique. The two primary differences between customer datasets are the data domain and label taxonomy.

Data Domain

Images from different datasets (COCO, Mapillary, KITTI, NuScenes, Waterloo) highlighting domain differences.

In the context of machine learning, a domain is the underlying distribution your data is sampled from. For example, within the general domain of driving images, there are a number of factors that affect the data distribution such as:

  1. Weather and lighting conditions. (E.g. day vs. night, snow vs. rain, sunny vs. overcast)
  2. The type of sensor. This can affect image perspective, object dimensions, color contrast, etc. Cameras at the front of a car will capture different data than cameras on the sides.
  3. The location. An urban environment will feature more occluded objects than a highway. Motorcycles will appear more frequently in Thailand than in Germany.

Models perform best when the training data distribution is representative of the target distribution. Ideally, the training and target data come from the same domain. However, our customers’ datasets exhibit all these variations and more.

Label Taxonomy

NuScenes image labeled according to three different customer-taxonomies exhibits naming and missing label issues.

Even if the training data comes from the same underlying distribution as the target data, the training labels must also be representative of the target labels. However, all of our customers want their data labeled differently. For a given dataset, the customer defines a set of rules specifying which objects to label, what names to give them, how tight to make the bounding boxes, etc. A taxonomy is the combination of all these rules. Variations among customer-taxonomies pose several challenges for training models.

The most obvious issue is that customer-taxonomies use different names for labeling the same object classes. For example, some customers may label any person as “Human” while others distinguish between “Pedestrian_adult” and “Pedestrian_child.” Over the course of my internship I’ve learned more nuanced labels for people, vehicles, and signs than I’d like to admit.

We face an even greater challenge when customer-taxonomies label different sets of object classes. For example, in the figure above, Taxonomy A only labels people, Taxonomy C only labels vehicles, and Taxonomy B labels both people and vehicles. When training an object detector for people and vehicles, we’d want to use data from all three taxonomies to maximize performance. However, images from Taxonomy A with unlabeled vehicles will encourage the model not to detect vehicles while images from Taxonomy C with unlabeled people will encourage the model not to detect people. This issue is particularly problematic because these “missing labels” are systematic. As we’ve mentioned in Quantity is no Panacea, random labeling errors may be mitigated by increasing the quantity of training data, but systematic errors will severely impact model performance.

Tackling the Taxonomy Dilemma

In this section, we’ll walk you through the steps we took to overcome these challenges. Training across customer datasets should provide our models with a large, varied training set to cope with the differences in data domain. But we need some more innovative ideas to cope with differences in taxonomy. For simplicity, we’ll focus on 2D object detection and we’ll use SOTA object detection model EfficientDet recently released by Google Brain. We’ll detail three different approaches we’ve explored to tackle the taxonomy dilemma — using the model architecture, the dataset, and the loss function. As you read on, you’ll see why modifying the loss function via Taxonomy Loss Masking yields the best solution.

Approach 1: Separate Tasks

EfficientDet backbone with three pairs of taxonomy-specific classification and regression heads (modified from EfficientDet).

Our initial approach takes the perspective of multitask learning — we treat prediction for each customer-taxonomy as a separate task. In this setting, we have a shared backbone which generates a taxonomy-agnostic feature embedding, and a pair of classification and regression heads for each of the target customer-taxonomies (Figure above). To distinguish differences in taxonomies, the taxonomy-specific heads are only trained on examples labeled according to that customer-taxonomy. For training/inference on an example from customer dataset A, the Taxonomy A classification and regression heads perform the prediction.

Unfortunately, like the naive approach training a model for each dataset, this multitask approach requires a “ramp time” before we’ve annotated enough examples to train a new taxonomy-specific head. Additionally, the cost of this approach scales with the number of customer-taxonomies because we have to train a pair of heads for each. Furthermore, this multitask approach is stymied by the classic problem in multitask learning where each head “fights for capacity” (Andrej Karpathy, ICML 2019). There exist complex correlations between tasks so the backbone struggles to learn a representation that satisfies them all. We observe this phenomena as our model has only mediocre performance.

This first approach uses the model architecture, namely class-specific heads, to cope with differences in customer-taxonomies. However, we neglect to tell our model crucial information: customer-taxonomies are closely related.

Super Taxonomy

The next two approaches rely on the notion of a “super-taxonomy.” Instead of directly distinguishing between customer-taxonomies, we can define a general “super-taxonomy” that encompasses them all. Then, we define a mapping from labels in the customer-taxonomies to labels in the super-taxonomy. This allows us to encode our priors about relationships between the customer-taxonomies by treating each as a subset of the super-taxonomy.

The mapping from customer-taxonomies to the super taxonomy.

By mapping customer labels to labels in the super-taxonomy, the individual customer datasets can be combined into a “super-dataset” with a consistent naming scheme. However, the super-dataset suffers from the systematic ‘missing label issue’ we discussed in the Label Taxonomy section.

As a quick baseline, we tried training directly on this super-dataset. We were optimistic because we read that “(randomly) dropping 30% of the annotations… only drops (performance) by 5% on the PASCAL VOC dataset” (Soft Sampling for Robust Object Detection). But even after adding extra tricks like Background Recalibration Loss, the resulting model had abysmal recall — likely because our missing labels are systematic, not random.

Approach 2: Separate Datasets

In this second approach, we address the missing label problem by dividing the super-dataset into separate class-specific super-datasets, one for each label in the super-taxonomy. As shown in the Figure below, we eliminate the missing label issue in each of the class-specific super-datasets by including only customer-taxonomies which label the corresponding class. This enables us to train a class-specific model for each class-specific super-dataset. For inference, we simply run all of these class-specific models on an image and combine their outputs.

Class-specific super-datasets only include customer taxonomies which label the corresponding class. We train a single class-specific model for each class-specific super-dataset. All of these models are combined for inference.

This approach yields good performance because labels within each dataset are consistent and each model learns a good general representation for its class. Using so many models would be impractical for most of our self-driving customers due to the constrained computational resources and strict latency requirements associated with perception. However, these limitations don’t apply to Scale AI because labeling operates on a much longer timescale than perception. Using separate models has the added benefit of decoupling performance across classes — we can re-train our Vehicle model without affecting the performance on any other class.

In this approach, the cost of training and inference scales with the number of classes in our super-taxonomy, rather than the number of customer-taxonomies. This is a marked improvement over the multitasking approach if we have a small super-taxonomy. However, our super-taxonomies are often fairly large. This approach is undesirable because it requires having many separate models that should theoretically be able to share features. We’d prefer to use just a single model.

Approach 3: Separate Loss — Taxonomy Loss Masking

What if, instead of telling our model about differences in customer-taxonomies through the architecture or through separate datasets, we make this distinction in the loss function? To answer this question, let’s take a closer look at how our model is trained.

Focal Loss mitigates class imbalance by re-weighting binary cross entropy loss.

Single stage detectors, like EfficientDet, detect objects by making predictions at many predefined anchor boxes. Let A be the number of anchor boxes and K be the number of target object classes. The EfficientDet classifier head predicts a matrix of probabilities P ∈ RAxK, where Pij is the probability that there is an instance of class j at anchor i. The model is trained using Focal Loss, a standard loss function for mitigating extreme class imbalance in single stage detectors. As demonstrated in the Figure below, this loss is applied to predictions for “positive” and “negative” anchors. The loss is not applied to predictions for “unassigned” anchors — these values are masked during training because they provide ambiguous signals to the model.

We realized that, because the detector is class-aware (as opposed to the more common class-agnostic Region Proposal Network in two-stage detectors), we can use a similar loss masking scheme to deal with the taxonomy dilemma. In addition to masking loss coming from anchors (rows) based on IoU, we mask loss coming from target object classes (columns) if the class is “missing” in the corresponding customer-taxonomy. If a training example comes from a customer-taxonomy that doesn’t include “Vehicle,” we know “Vehicle” objects are not labeled so the probability of “Vehicle” for every anchor box is ignored and the corresponding loss values are masked. We define a “missing label mask” ∈ (0,1)K for each customer-taxonomy that tells us which classes (columns) to mask when training on an example from that customer-taxonomy. During inference, the model makes predictions for all classes. We call this Taxonomy Loss Masking.

Left: Masking in Focal Loss. We depict P ∈ RAxK, the matrix of probabilities predicted by the classification head where Pij is the probability that there is an instance of class j at anchor i. We show the anchors (rows) sorted by IoU with the nearest ground truth annotation: green rows have IoU >= 0.5 and red rows have IoU less than 0.4. Focal Loss applies to these “positive” and “negative” anchors respectively. However, anchors with IoU between [0.4, 0.5) are neither positive nor negative. They are “unassigned” and their loss is masked. Right: Taxonomy Loss Masking. When training on an example from a customer-taxonomy which only labels Traffic Light and Person, the loss for all other classes (columns) is masked.

In essence, Taxonomy Loss Masking is multitasking across super-taxonomy classes without any task-specific parameters — using a fully-shared architecture, and only altering the loss function. This approach leverages maximum information about our priors: the taxonomies are closely related, some taxonomies are systematically “missing labels,” and the model should be able to share features across classes. Not only is this approach simpler than the previous two, but it also yields excellent performance, it’s cheaper, and it’s more scalable.

Although we primarily focused on 2D object detection for the domain of driving images, we can use variants of Taxonomy Loss Masking across domains and across task types including 3D object detection, 2D/3D semantic segmentation, etc.

In Case You’re Wondering

“The model only predicts labels in the super-taxonomy, not the actual customer-taxonomies. Isn’t this incomplete?” Good observation. We can predict labels in customer-taxonomies by using a hierarchy of classifiers and a hierarchical super-taxonomy. Each level of the classifier hierarchy will make a more granular prediction in the super-taxonomy and use a different “missing label mask.” The classifiers can even predict object attributes: e.g. whether a vehicle is parked or occluded (Scale AI offers attribute labeling too)!

But remember, our models are used to augment and accelerate our Taskers. For Taskers doing object detection, the most difficult part of labeling is detection. Classification is very easy because it’s just a multiple choice question. Although we try, we don’t actually need to solve the entire problem; we only need to focus on the expensive part — detection. I found this distinction particularly interesting and important during my internship.

“Are there other differences in customer taxonomies that you have to deal with? Some customers have different labeling rules. For example, some want bounding boxes for vehicles to include the side mirrors while others want just the main body of the vehicle. We can address this issue by weighting the regression loss based on labeling rules. Let’s say we want the model to include side mirrors. If we’re training on data from Taxonomy A which includes side mirrors and Taxonomy B which doesn’t, we’ll weight the regression loss higher for instances from Taxonomy A. Another way to teach the model to include side mirrors is to oversample data from Taxonomy A during training, or to fine-tune the model on Taxonomy A after training.

Differences in labeling rules. The left bounding box includes side mirrors. The right does not.

Operation Vacation

Scale’s machine learning models supercharge our data annotation cycle. As long as our customers continue sending us data and our Taskers continue labeling, our models will continue improving, accelerating the labeling process, and perpetuating this virtuous cycle.

As we mentioned earlier, it is important that our models are trained on as much data as possible. The quantity of our training data grows along two primary dimensions — the number of customer datasets and the time we spend labeling these datasets. Taxonomy Loss Masking enabled us to scale our model’s training data across the customer dimension. Since our Taskers are continuously labeling data and the size of our datasets grows over time, it’s important that we also scale across the time dimension. In other words, we should continue training our models as we get more labeled data.

In a recent collaboration with PyTorch, Scaliens Daniel Havir and Nathan Hayflick demonstrated how we use asynchronous data streaming to train on large, growing datasets. This technique, along with innovative cloud infrastructure and distributed training, enables us to automatically train models as we accumulate more data. We use this system to push the limits of model performance for both training on super-taxonomies, and fine-tuning on individual customer datasets.

Daniel and Nathan also demonstrated how we use hashing to achieve a consistent train/test split when we have a growing dataset. This means that our train and test sets should grow at the same rate. Holding the test set constant, we’d expect increased model performance as we continue training on a growing train set. On the other hand, holding the model constant, we can examine performance on cohorts of older vs. newer examples in our test set. Changes in model performance along this dimension indicate drift in the data distribution over time, meaning that our customer is sending us different data. This is an example of how we can automatically monitor model performance.

In the same vein as Andrej Karpathy’s “Operation Vacation”, we use these training and monitoring systems to automate the model learning process. This saves engineering hours and enables our Machine Learning team to focus on finding new ways to enhance our annotation pipeline with ML. As long as our customers continue sending us data and our Taskers continue labeling, we can “take a vacation.” Our models will continue improving, accelerating the labeling process, and perpetuating this virtuous cycle.

Ultimately, scalable machine learning helps us continuously improve our labeling quality and efficiency while striving toward our mission: “To accelerate the development of AI by democratizing access to intelligent data.”