Addressing the Long Tail of Autonomous Vehicle Perception

Jun 17 · 3 min read

Humans can effortlessly and reliably perceive the roadway and other vehicles in any locale, under almost any weather, lighting, or environmental conditions. People who learned to drive on the wide, smooth roads of sunny California can instantly adapt to driving with the awful weather, poorly maintained roads, and different signage and road markings of Massachusetts without having to relearn to perceive the environment. In contrast, the performance of data-driven learning algorithms for autonomous vehicle (AV) perception is highly-dependent on the training domain: A model trained on California roads and weather can fail catastrophically in Boston, or vice-versa.

Inspired by the human ability to adapt perceptual learning between domains, and given the need for generalization to the long tail of AV sensor inputs, we’ve developed methods to allow computer vision systems for AV to adapt experiences collected in one domain to a range of other domains.

Unsupervised image translation for generalization to new conditions and locales

When humans drive in a new environment, we can relate the new situation to our previous driving experiences, allowing us to use our old experiences to explain the new situation. For example, lane markings follow similar geometric shapes across environments, but their appearances may differ depending on the locale, lighting, and weather conditions. Prior knowledge of lane geometry can help us detect lanes with very different appearances in new environments.

Inspired by this, we adopt recent techniques for style transfer, which we use to translate images in an old environment to realistic images in the style of the new environment, while preserving the semantic content of the old images. This allows us to transfer labeled data to new environments and conditions, allowing our models to learn to generalize beyond previous learning experiences. Specifically, we use unsupervised image to image translation methods to learn a mapping from images collected in one locale and set of conditions, to the distribution of images from another locale and set of conditions. This allows us to adapt the distribution of images from publicly available annotated datasets to better match corner case weather and environmental conditions.

Our approach translates the original sunny California image (top left) to different Massachusetts styles but still preserves the major semantic structure of the scene.

Here we consider the lane detection task as a testbed for our style transfer approach.

Without any additional annotation, Fastdraw adapts to new situations and achieves high accuracy on challenging Massachusetts images outside of the training dataset collected in sunny California. The color of the lane represents the uncertainty of the lane prediction.

An end-to-end neural network model for lane detection that runs at 90 FPS

We built a lane detection model that integrates the decoding step directly into the network. Our model, called FastDraw, “draws” lanes on an image in an autoregressive manner by starting from a single pixel, and predicts the local lane shape sequentially to extend the lane across the entire image. At test time, we decode the global lane by following the local contours as predicted by the CNN. Because decoding is largely carried out by the convolutional backbone, we are able to optimize the network to run at 90 frames per second on a GTX 1080 GPU.

In contrast, previous models of lane detection generally follow three steps that are not end-to-end differentiable. First, the likelihood that each pixel is part of a lane is estimated. Second, pixels that clear a certain threshold probability of being part of a lane are collected. Lastly, these pixels are clustered, for instance with RANSAC, into individual lanes. Because the second and third steps in which road structure is inferred from a point cloud of candidate pixels are in general not differentiable, the performance of models of lane detection that follow this template is limited by the performance of the initial segmentation. Also, these post-processing steps tend to be slow and more computationally expensive.

For more details, please read our paper, published at CVPR 2019.

To learn more about our research, please visit our webpage.