Scalable Active Learning for Autonomous Driving
- To address inefficiencies in training data selection for autonomous driving DNNs, we implemented a scalable active learning approach on our internal production-grade AI platform called MagLev.
- Focusing on nighttime detection tasks, our methodology shows that data selected with active learning yields relative improvements in mean average precision of 3x on pedestrian detection and 4.4x on detection of bicycles over data selected manually.
- Given the right infrastructure, this methodology can be fully automated to produce systematic data-driven DNN improvements.
Deep learning for autonomous driving must provide highly accurate predictions, implied by the safety-critical operation. To achieve this high accuracy, deep neural networks (DNNs) require a large amount of training data. Collecting this level of data for autonomous driving is a major undertaking, and selecting the “right” training data that captures all the possible conditions the AI system must operate under poses a significant challenge. In this post, we present results on a method to automatically select the right data in order to build better training sets, faster: active learning.
Data Selection for Autonomous Driving
In the context of training deep neural nets for autonomous driving, there are several important reasons to optimize the way training data is selected:
- Scale: A simple back-of-the-envelope calculation shows that a fleet of 100 cars driving eight hours per day would require more than 1 million labelers to annotate all frames from all cameras at 30fps for object detection. This is impractical.
- Cost: Manually labeling frames takes an inordinate amount of time and labor, so we need to make sure we select the frames which give the highest increase in model performance.
- Performance: Selecting the “right” frames for training which, for example, contain rare scenarios that the model is not yet comfortable with, will lead to better results and accuracy.
Data Selection Methods
There are a few “classical” ways to select data:
- Random sampling: sample data uniformly (say 0.1%) from the available pool of data. By definition, this will capture the most common scenarios quickly but neglect the rare but informative patterns in the long-tail of events.
- Metadata-based sampling: sample data based on cheap labels (rain, night, etc.), or other metadata coming from the car. This scales well and can help with rare scenarios. However, it ignores what the model itself already knows or has trouble with.
- Manual curation: sample data based on metadata, but also visualize it to manually select the most “helpful” frames. Better than the above to find rare events but still likely to miss data that the model has trouble with. It’s also difficult to scale and error-prone.
In this blog, we present an experiment to evaluate active learning, a more formal and automated process of finding the right training data. Specifically, we focus on answering the following question: to improve object detection using a deep neural network, what is the advantage of selecting training data using active learning compared to manual curation?
In contrast to public research in this area, we work with unlabeled data at scale and “in the wild”. Typical object detection research datasets make it possible to simulate the selection from a pool of at most ~200k frames (e.g., MS COCO). These frames have already been pre-selected for labeling when the dataset was created and thus contain mostly informative frames. In our experiments, we select from a pool of 2 million frames stemming from recordings collected by cars on the road, so they contain noisy and possibly irrelevant frames. A smart selection is absolutely required in this case.
We start by describing the active learning methodology we chose, discuss the experiment setup, and then present the results.
Active Learning Methodology
Our methodology uses pool-based active learning and an acquisition function based on disagreement between models. The acquisition function can be applied to unlabeled frames and is designed to identify the frames which are most informative to the model. We enable a repetitive loop that performs the following operations:
- TRAIN: Train N models initialized with different random parameters on all currently labeled training data.
- QUERY: Select examples from unlabeled pool using acquisition function that leverages “disagreement” between the N models.
- ANNOTATE: Send selected examples to human labelers.
- APPEND: Append newly labeled examples to training data.
- Go back to 1.
This is illustrated in the diagram below:
A/B Test: Nighttime Detection of Pedestrians and Bicycles
In our experiment, we run a single iteration of the active learning loop above. We select data with the specific goal of improving nighttime detection of pedestrians and bicycles/motorcycles (Vulnerable Road Users). Bad illumination and low contrast makes this a challenging setting for a DNN (but sometimes also for humans) and, we believe, a good candidate for an active learning experiment.
- Choose an object detection DNN based on a relatively mainstream U-net architecture that outputs bounding box candidates (coordinates and associated probabilities). The DNN detects objects for the classes: car, bicycle (which includes motorcycles), person, traffic light and road sign.
- Start with an initially labeled dataset of 850k images that contains only few nighttime images for bicycle and person classes.
- Use an ensemble of eight models to compute “disagreement” for acquisition scores on the unlabeled data.
- Select the “best” ~19k frames out of a pool of 2M (nighttime) frames with the goal of improving detection accuracy of person and bicycle classes at nighttime.
What we test in the A/B test is the alteration of the “QUERY” step. Method A selects 19k frames using active learning, while Method B uses the “Manual Curation” method described above to select 19k frames. The results are observed on a variety of test sets and described below.
In the following sections we describe the implementation of the above loop, step-by-step.
Step 1: TRAIN
On our selected training dataset of 850k images, we train eight models with different initial random parameters, but otherwise same architecture and training schedule. Each training run uses eight GPUs and takes approximately two days.
Step 2: QUERY (Method A — Active Learning)
The query step aims at finding the 19k frames which are considered most informative according to our acquisition function.
The obstacle detection DNN we use outputs a 2D map of probabilities per class (bicycle, person, car, etc.). Each cell in this map corresponds to a patch of pixels in the input image, and the probability specifies whether an object of that class has a bounding box centered there.
We follow a Bayesian approach and use disagreement within our ensemble to compute mutual information between the predicted probabilities. We end up with one 2D-map of mutual information scores per class that allows us to visualize interesting heatmaps:
Our acquisition function is then the average across each map of scores for a given class. In earlier experiments, we also tried various other acquisition functions ranging from simple entropy-based methods to more complex metrics based on detected bounding boxes or MC-Dropout, but overall the method above worked best.
We apply this acquisition function to our pool of unlabeled data. In this experiment, we restricted ourselves to an unlabeled pool of 2M nighttime images which we obtained by leveraging the metadata tags we add to all our recordings.
After applying the acquisition function, we selected the top 19k frames in a round-robin fashion over our two classes of interest (person and bicycle).
Here are some examples of selected frames:
Step 2: QUERY (Method B — Manual Curation)
The “Manual Curation” process leverages metadata tags to select segments of videos at nighttime. Additionally, other filters such as geolocation were used to select areas with a higher likelihood of presence of pedestrians and bicycles.
Using the above selection, the curation team then scrolls through the videos to find narrower segments and selects frames where bicycles and pedestrians are present until the dataset size exceeds the required number of 19k frames.
Note: The active learning methodology, in comparison, doesn’t just look for frames with person or bicycle objects, but where it is the most uncertain about those classes. One intuitive consequence of this is that human curation could lead to building datasets that lack hard negative examples, e.g., frames that contain no actual bicycle but are confusing to the model nonetheless.
Step 3 & 4: ANNOTATE AND APPEND
The frames from Step 2 are then enqueued for labeling in our labeling platform. A trained team of labelers annotates the bounding boxes involving several quality assurance steps to ensure correctness.
Once the data has been labeled, we obtain two new labeled datasets of 19k frames each, one from active learning (AL) and one from manual curation (MANUAL).
A/B Test Results
After labeling completed, we first observed that overall labeling cost per frame for both datasets were roughly the same (within 5%, measured via annotation time). We then compared the number of identified objects in each dataset. The figure below shows the number of objects for several selected classes (not all object classes shown are actually used in training). Overall, the number of objects selected by active learning is around 12% higher.
We can clearly see that the active learning selection contains more person and bicycle objects, while fewer car and other objects were present. Hence the data selected by active learning is more directed.
On the right diagram, we can also see that the active learning selection picked frames from many more driving sessions. This is a clear advantage of an automatic selection, as it can resort to selecting only the few most informative frames from each driving session. Humans typically resort to selecting subsequences of many frames within a session. We therefore expect the active learning selection to be more diverse.
Evaluation Setup & Metrics
We follow a cross-validation approach to evaluate specifically nighttime performance for bicycle and person classes. To this end, from each newly labeled dataset (MANUAL and AL), we split off (additional) training and test sets in a 90/10 ratio three times.
We then train new models by adding the training portion of the split to our initially labeled training data. In addition, we trained a REFERENCE model on only the initially labeled data.
First, we evaluated all models on existing global test data for mainly daylight conditions, showing that the active learning data performs on par with the data from manual curation.
Next, we evaluated the models on the test data for nighttime derived from both MANUAL and AL data. We focus on weighted mean average precision (wMAP) which averages MAP across several object sizes and report improvements over the REFERENCE model.
As we can see in the figure above, both manual and active learning selection improve over the REFERENCE model. For the person class, the data selected by active learning improves weighted mean average precision (wMAP) by 3.3%, while the manually selected data improves only by 1.1%, which means the relative improvement by AL is 3x.
For the bicycle class, the AL-selected data improves wMAP by 5.9%, compared to an improvement by 1.4% for manually curated data, i.e., AL-selected data gives a relative improvement of 4.4x. This confirms that AL-selected data performs significantly better on a test set for bicycle and person classes at night, the main goal of this data selection.
In addition, we report MAP for large- and medium-sized objects. These are important since large objects are usually close and sometimes in front of the car. Note in the table below, how the active learning data considerably outperforms the manually curated data, especially for large objects.
The results above are computed on a test set that (also) contains data selected via active learning. To verify that the active learning selection is not biased and generalizes to data selected via manual curation, we also evaluate on the test set composed of only the test splits from MANUAL data.
Again, both manual and active learning selection improve over the reference model. For bicycle objects, the improvement due to active learning data is again considerably higher (wMAP increase of 3.2% vs 2.3%), i.e., still a 1.4x improvement, while performing on par for persons.
The table below shows more detailed results. In particular, the performance increase on large bicycle objects is considerable. This demonstrates that the data selected by active learning is at least as good as manually curated data, even when testing exclusively on data from manual curation.
Overall, the results show a strong improvement from data selected via active learning compared to a manual curation by experts.
A Word on Infrastructure
The experiment above ran a single iteration of an active learning loop on our internal AI platform MagLev which provided, e.g., the required scalable infrastructure for training and inference and a centralized data platform for metadata. Implementing an automatic active learning loop in production and running it for many iterations requires even more. To this end, Maglev enables such workloads via:
- A large compute and data cluster: Active learning requires to continuously train new models and run inference on unlabeled data at scale (billions of frames). This requires high-performance hardware for training and inference as well as large and efficient data storage.
- A scalable workflow management platform: The ability to describe complex task dependencies with high parallelism, traceability, caching and auto-scheduling is essential to build automation.
- A large-scale data management platform: Active learning requires fast scalable access to structured metadata and unstructured data like images. For example, at each point in time we need to be able to identify and access all unlabeled data, train on the latest labeled training dataset, and store uncertainty information associated with each frame.
- Traceability: At that scale, it is necessary to store immutable versions of all involved data, models, and code and to track all experiments and their metrics to ensure reproducibility.
- A high-throughput and programmatic labeling platform: A programmatic interface is needed to orchestrate a labeling platform which dispatches labeling tasks to a large workforce for high throughput and quality.
Our MagLev infrastructure provides this and more.
We applied active learning in an autonomous driving setting to improve nighttime detection of pedestrians and bicycles. We compared two methods of selecting data: automatically via active learning and manually via a selection by experts. Results show very strong performance improvements for the automatic selection, in some cases giving more than 4x the mean average precision improvement compared to the manual selection. Labeling costs in this experiment were the same (within 5% difference). This validates that active learning is a very promising avenue to continuously and automatically improve performance for autonomous driving.
We believe that the applied disagreement-based method is generic (could be applied to a large range of data types and tasks) and scalable (no limitations on the dataset size) provided the right underlying infrastructure. Besides a scalable infrastructure, regular and thorough evaluation on a carefully selected test set covering all relevant areas is necessary to ensure no “blind spots” are introduced (as is the case for manually selected training data). As major future work, we plan long-running experiments with several iterations of selecting and annotating data for varying scenarios.
NVIDIA Authors: Elmar Haussmann, Jan Ivanecky, Michele Fenzi, Jose Alvarez, Kashyap Chitta, Donna Roy, and Akshita Mittel.
Thanks also to management (Clement Farabet, Nicolas Koumchatzky) and the help from our MagLev Infrastructure team. Their support enabled us to deliver this A/B test at the required scale.