Literature Review: MultiNet

David Silver
Self-Driving Cars
Published in
4 min readMay 31, 2017

Today I downloaded the academic paper “MultiNet: Real-Time Joint Semantic Reasoning for Autonomous Driving”, by Teichmann et al, as the say in the academy.

I thought I’d try to summarize it, mostly as an exercise in trying to understand the paper myself.

Background

This paper appears to originate out of the lab of Raquel Urtasun, the University of Toronto professor who just joined Uber ATG. Prior to Uber, Urtasun compiled the KITTI benchmark dataset.

KITTI has a great collection of images for autonomous driving, corresponding leaderboards in various tasks, like visual odometry and object tracking. The MultiNet paper is part of the overall KITTI Lane Detection leaderboard.

Right now, MultiNet sits at 15th place on the leaderboard, but it’s the top entry that’s been formally written up in an academic paper.

Goals

Interestingly, the goal of MultiNet is exactly to win the KITTI Lane Detection competition. Rather, it’s to train a network that can segment the road quickly, in real-time. Adding complexity, the network also detects and classifies vehicles on the road.

¿Por qué no?

Architecture

The MultiNet architecture is three-headed. The beginning of the network is just VGG16, without the three fully connected layers at the end. This part of the network is the “encoder” part of the standard encoder-decoder architecture.

Conceptually, the “CNN Encoder” reduces each input image down to a set of features. Specifically, 512 features, since the output tensor (“Encoded Features”) of the encoder is 39x12x512.

For each region of an input image, this Encoded Features tensor captures a measure of how strongly each of 512 features is represented in that region.

Since this is a neural network, we don’t really know what these features are, and they may not even really be things we can explain. It’s just whatever things the network learns to be important.

The three-headed outputs are more complex.

Classification: Actually, I just lied. This output head is pretty straightforward. The network applies a 1x1 convolution to the encoded features (I’m not totally sure why), then adds a fully connected layer and a softmax function. Easy.

(Update: Several commenters have added helpful explanations of 1x1 convolutional layers. My uncertainty was actually more about why MultiNet adds a 1x1 convolutional layer in this precise place. After chewing on it, though, I think I understand. Basically, the precise features encoded by the encoder sub-network may not be the best match for classification. Instead, the classification output perform best if the shared features are used to build a new set of features that is specifically tuned for classification. The 1x1 convolutional layer transforms the common encoded features into that new set of features specific to classification.)

Detection: This output head is complicated. They say it’s inspired by Yolo and Faster-RCNN, and involves a series of 1x1 convolutions that output a tensor that has bounding box coordinates.

Remember, however, the encoded features only have dimensions 39x12, while the original input image is a whopping 1248x384. Apparently 39x12 winds up being too small to produce accurate bounding boxes. So the network has “rezoom layers” that combine the first pass at bounding boxes with some of the less down-sampled VGG convolutional outputs.

The result is more accurate bounding boxes, but I can’t really say I understand how this works, at least on a first readthrough.

Segmentation: The segmentation output head applies fully-convolutional upsampling layers to blow up the encoded features from 39x12x512 back to the original image size of 1248x312x2.

The “2” at the end is because this head actually outputs a mask, not the original image. The mask is binary and just marks each pixel in the image as “road” or “not road”. This is actually how the network is scored for the KITTI leaderboard.

Training

The paper includes a detailed discussion of loss function and training. The main point that jumped out at me is that there are only 289 training images in the KITTI lane detection training set. So the network is basically relying on transfer learning from VGG.

It’s pretty amazing that any network can score at levels of 90%+ road accuracy, given a training set of only 289 images.

I’m also surprised that the 200,000 training steps don’t result in severe overfitting.

Summary

MultiNet seems like a really neat network, in that it accomplishes several tasks at once, really fast. The writeup is also pretty easy follow, so kudos to them for that.

If you’re so inclined, it might worth downloading the KITTI dataset and trying out some of this on your own.

--

--