(PPS) DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

Published in

Mini Distill

4 min readJun 19, 2018

This paper deals with semantic segmentation: the task of labeling pixels in an image by their object classes. The goal of semantic segmentation is usually to derive precise object boundaries in an image. It has far reaching applications in autonomous driving and other fields. What’s cool about this particular paper is that it applies dilated convolutions to the task of semantic segmentation and discusses the benefits of using dilated convolutions in an actual application. In short, dilated convolutions describe a way of designing your convolution filter so that you can grow your receptive field more quickly than using vanilla convolution filters. I’ll only go through the basic idea of dilated convolutions, Ferenc has a nice blogpost with a more detailed explanation.

Global consistency is an important property in a lot of image processing tasks. For a model to be globally consistent means the output of the model makes sense within the context of each region of the input image. As an example, suppose we want to segment cars and people in an image. We would want our segmentation to be globally consistent in that it takes into account the high level objects in the image. Knowing some pixels belong to a car, the model should output circular wheel or knowing that some pixels belong to a person, the model should output an oval head. Note that we don’t have to explicitly build the idea of cars or people into our model, sometimes we can get global consistency for free. If our model wasn’t globally consistent, it might overfit to produce rectangular wheels or square heads. Global consistency is closely tied to the idea of receptive fields which simply means for each location (u, v) in a feature map M, what are the pixels in the input image which affected the value of M(u, v). If the receptive field is too small, that means the output at a certain location can’t “see” a lot of the input image and therefore will not be globally consistent.

First let’s consider the receptive field of regular convolution filters. A 3⨯3 filter has a receptive field of 3⨯3. Two 3⨯3 in series has a receptive field of 5⨯5. To see this, consider a value in the final feature map. It’s affected by values in a 3⨯3 grid in the previous feature map. Each corner of the 3⨯3 grid is affected by a 3⨯3 grid around it in the input image. The receptive field of vanilla convolutions grows linearly with the number of layers (3⨯3, 5⨯5, 7⨯7, etc.). Furthermore, even if the receptive field is 7⨯7, the effective receptive field is arguably smaller because points in the middle of 7⨯7 box have much greater influence on the feature map than points near the edge. For image segmentation, this slow growing receptive field size is a problem because you may want to segment one region of the image based on what the image looks like in another region. Dilated convolutions aim to solve this problem.

Below is a visualization of 3 layers of 3⨯3 dilated convolutions and the resulting receptive field:

3 dilated 3⨯3 convolution layers and resulting receptive field.

The red dots are where the dilated convolution is computed. Unlike regular convolutions which multiple a 3⨯3 filter against a 3⨯3 patch in the input, dilated convolutions multiple a 3⨯3 filter against the red dots. Blue is a visualization of the receptive field for the center point in the third-layer feature map. The darker the square, the most influence it has on the value of the center point in the third-layer feature map.

In the first layer, we perform a regular 3⨯3 convolution at the red dots (shown in a). In the second layer, we perform another 3⨯3 convolution but the points you convolve against are dilated from each other (red dots in b). Now our receptive field is 7⨯7 instead of 5⨯5 as with 2 layers of normal convolutions. Finally we perform another 3⨯3 convolution but the points are even further apart (red dots in c). The receptive field is now 16⨯16. The pattern is to increase the distance between the red points exponentially (in tis case 2^n). By doing so the receptive field grows exponentially as well. Using dilated convolutions allows distant regions of an image to affect each other.

The author present a second motivation for using dilated convolutions. The output of semantic segmentation has the same dimensions as the input image. This is because the output specifies the class that each pixel in the image belongs to. However, convolutions (especially when used with pooling) reduces the dimensions of the feature map so what people tend to do is to use an upsampling layer to return the dimension to normal. This is shown in the top pathway below:

Comparison of regular convolution with upsampling versus dilated convolutions. Upsampling produces holes in the feature map.

The top pipeline uses regular convolution with downsampling (pooling) and upsampling (deconvolution). The result is holes in the output feature map. The bottom pipeline uses dilated convolution and gets a feature map without holes. The authors claim that holes in the feature map are adverse to good semantic segmentation predictions.

The final trick the authors used was to post-process the semantic segmentation prediction using a Conditional Random Field (CRF). As seen below, the CRF helps produce sharper lines and corners in the output.

(PPS) DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

Written by Kevin Shen