Computer Vision Part 6: Semantic Segmentation, classification on the pixel level.

Published in

Analytics Vidhya

13 min readDec 29, 2019

In the two previous chapters, we broadly discussed how neural network architectures are constructed, and the rationale behind those architectures, to either classify images or to detect object within an image and draw a bounding box around those detected objects. It is possible to go even more fine in granularity by looking at each pixel and determine to which object or class it belongs to. As briefly outlined here and seen below, semantic segmentation will assign each pixel to a class but will not distinguish multiple occurrences within the same class, whereas instance segmentation makes this differentiation and identifies unique occurrences within a category.

Semantic Segmentation vs Instance Segmentation.

1. Concepts

We can simply stack a set of convolutional layers where, as we know, local features in images are captured, creating a hierarchical set to extract broader structure. Through successive convolutional layers that capture increasingly complex features in the image, a CNN can encode an image as a compact representation of its contents. The architecture learns then an immediate mapping between the input image and its respective segmentation output through this hierarchical representation. To achieve this connection of conv layers, identical padding must be used across all layers resulting into a preserved resolution of the input image and thus greatly increases the computational cost.

From previous series, we know that through successive convolutional layers a CNN can capture increasingly complex features in the image by incrementing the number of feature maps. Additionally, compressing the spatial resolution by using pooling and/or strided convolutions resulted in a lower computational load. This encoder configuration was perfectly suited for image classification tasks due to the fact it only cared about the content of the image and not the location. For the segmentation task however, it is necessary to have a full resolution mask of the pixel wise prediction. As such, it is not uncommon to find architectures with an encoder-decoder structure as seen above.

A convolutional decoder takes the low resolution output of the convolutional encoder and will upsample it. The last step of this decoder is to generate an array which stores the pixel wise labeling of the image. How does the upsampling happen? Intuitively, we would could do the reverse of pooling (unpooling) which consists of taking a single pixel value and distribute its value over a higher resolution. What if the architecture could learn how best to upsample pixels? This is where transpose convolutions come into the picture.

As seen above, whereas a convolution takes the dot product of the input, the transposed convolution takes the input and multiplies it with all the weights. Transposed convolution can be easy implemented since the forward and backward passes of convolution are simply reversed.

2. Architectures

2.1. Fully Convolutional Network

In 2015, Long et al. introduces the first method to train a fully convolutional network by adapting classification networks (AlexNet, GoogleNet and VGG) and fine tuning them for the segmentation task.

Adapting classifier architectures to fully convolutional versions happens by converting fully connected layers into convolutional kernels spanning the entire input region. This conversion has an important aspect, as each one of those converted convolutions will output a coarse heatmap of the labels which makes the training quite straightforward for both backward and forward passes.

Overview of the FCN Architecture. Convolutional layers and converted fully connected layers are omitted.

Above we can observe the FCN architecture. We see that the encoder will compress the image into a lower resolution image. The grids representing the pooling layers also illustrates the relative spatial coarseness. After the 5th pooling layer, the resolution has decreased 32 times which results in a coarse segmentation after the upsampling operation. In the paper, the balance between finding the what and the were is discussed. To understand what is present in an image, global (coarse) information is necessary whereas to pinpoint where it is present, local (fine) information is necessary. As such, skip connections (not unlike resnets) and slow upsampling where combined.

2.2. U-Net

Based on the FCN, O. Ronneberger et al. (2015) presents a network that “consists of a contracting path to capture context and a symmetric expanding path that enables precise localization.” and training strategy "that relies on the strong use of data augmentation to use the available annotated samples more efficiently."

U-net architecture where blue boxes corresponds to a multi-channel feature map where the number of channels is denoted on top of those boxes. The lower left edge of the box represents the x-y-size and white boxes represents copied feature maps.

As can be seen on the figure above, the first part exists of the usual set of convolutional, ReLU and max pooling operations. The 2x2 max pooling operation with stride 2 will result in a downsampling step where the number of feature channels is doubled and the size is halved. The second part consist in a sequence of upsampling the feature map, concatenating corresponding cropped feature map, which is necessary due to the loss of border pixels after each convolution, convolving and applying ReLU.

Similarly to FCN, high resolution features from the compression path are combined with the upsampled output which then is fed to a series of convolutional layers. The main difference with FCN is that in the upsampling part, a large number of feature channels is present, which allows the network to propagate contextual information to higher resolution layers. Furthermore, the network does not contain fully connected layers but only uses convolutional outputs.

2.3. FC-DenseNet

The FC DenseNet or 100 Layers Tiramisu is a segmentation technique built upon the DenseNet architecture for image classification. DenseNet is based on the paradigm where shortcut connections from early layers are made to later layers. What makes DenseNet so special is that all layers are connected with each other.

A 5-layer dense block with a growth rate of k = 4, where k refers to the number of channels for subsequent layers. Each layer takes all preceding feature-maps as input.

Each layers passes on its own feature-maps to all subsequent layers. Where ResNet uses element-wise addition to combine features, concatenation is used in DenseNets. As such, each layer is receiving a collective set of knowledge from all preceding layers. Perhaps counterintuitively, this requires fewer parameters than traditional methods, as there is no need to relearn redundant feature-maps.

Below, we find an overview of the architecture. Each layer produces a k output feature-maps which coincides with the aforementioned growth rate. This is then fed into the next layer's Dense Block's BottleNeck to reduce the number of input feature-maps, and thus to improve computational efficiency of its respective Composite blocks. Before being passed to the next layer, the feature maps are compressed by going through a Transition Layer.

Below, we can observe how DenseNet was then used to create FC-DenseNet for image segmentation. We can discern a similar encoder-decoder structure found in previous FCN and U-Net.

For segmentation, Dense Blocks contain only a succession of composite blocks where a dropout of 0.2 is applied. Care must be taken to not have an explosion of feature maps. As such, each layer concatenates only the previous concatenation. The last layer performs a concatenation of the outputs of all layers, and thus contains 4 ∗ k feature maps. A transition up/down convolution which occurs after a dense block is applied only to the feature maps obtained by the very last dense block and due to the fact that the linear growth in the number of features would be too memory demanding.

Building blocks of Transition Down/Up convolutions

Due to the pooling layers, some information from earlier dense blocks is lost. Hence the skip connections which helps us reuse features maps where the upsampling path can recover some spatially detailed information from the downsampling path.

2.4. DeepLab

No reputable post would dare to be complete without mentioning Google's take on the subject's challenge. As such, true to Google's modus operandi we will take a look to a sequence of interesting improvements over a main concept which debuted in 2015.

2.4.1. DeepLab v1

DeepLab v1 introduces 2 main ideas: Atrous Convolution and Fully Connected Conditional Random Field (FC CRF). Below, we find the architecture:

A Deep Convolutional Neural Network (DCNN) backbone (VGG or ResNet) is used with atrous convolution to downsample the signal. Afterwards, the feature maps is enlarged to original size. Finally, a FC CRF is applied to refine the segmentation result.

Atrous convolution comes from the French "à trous" meaning hole. Atrous convolution is also referred to as dilated convolution. Below, we see that dilated convolution is a standard convolution where you uniformly skip a number of pixels in both dimensions. To be more precise, below we observe a diluted convolution of rate 2 as it sample every 2 pixels from the input to pass to the convolutional kernel.

We see that the receptive field of a dilated convolution is larger compared to the standard equivalent. Why is this relevant you may wonder?

All previously discussed architectures were using a multi-scale CNN relying on spatial pooling to create an encoder-decoder type of structure. This was done to combine and balance the:

pixel-level accuracy
global knowledge of the image

Instead of your typical pooling, DeepLab uses dilated layers to address this balancing issue. By controlling the field-of-view in atrous convolutions we can find the best trade-off between accurate localization (small field-of-view) and context assimilation (large field-of-view) without too much increasing the number of parameters. This can be seen below:

Multiple layers of dilated convolutions where the dilation factor increases exponentially after each layer results in an effective receptive field which also grows exponentially. The receptive field grows (exponentially)at a faster rate than the number of parameters (linearly).

Once the feature maps created and upscaled, a FC CRF is applied. CRF is a statistical learning method where it takes context into account. This context can be understood as dependencies between predictions. In Natural Language Processing, CRF is a sequential dependency of prediction whereas in Computer Vision nearby pixels are defined as dependencies. As the name implies FC CRF uses all pixels to create a long-range model which can be used to smooth noisy segmentation maps. The main challenge here is the computational explosion due to the fully connected nature of the model. In 2012, a paper introduced a highly efficient inference algorithm for FC CRF models. This model was ultimately used in DeepLab v1 and DeepLab v2 resulting that both architectures can not be used as an end-to-end learning framework.

The first row is the score/feature map and the second row represents the result of the softmax function.

2.4.2. DeepLab v2

Where DeepLab v1 uses a VGGNet as backbone, DeepLab v2 uses a ResNet and also introduces Atrous Spatial Pyramid Pooling (ASPP). These are the main distinctions between both architectures. As such, let us examine ASPP departing from the DeepLab v1 as base.

As the name suggests, this is merely an atrous version of SPP. In 2014, Spatial Pyramid Pooling was introduced, addressing the concern that fixed-size input images were required in CNNs due to the fact that fully connected layers have a fixed size input by design. Therefore, it is at the transition from convolutional layers to the FCN which imposes this size restriction. SPP is a new layer between the convolutional layers and the FCN to map input sizes to a fixed size output. Below, an illustration of SPP can be seen. By taking the feature maps of the last layer and dividing it into a number of spatial bins proportional to the image size. Meaning that the number of bins is fixed regardless of image size. As we see, bins are created throughout different levels of granularity and the final layer consists of the whole image. Finally, each spatial bin for each filter is pooled using max pooling. Since the number of bins is known, we can concatenate the different outputs to give a fixed length representation of a F*B, where F is the filter number and B number of bins, dimensional vector which is then fed to the FCN.

Atrous SPP is similar to SPP but instead of using bins it will use multiple filters in parallel with different sampling rates. The extracted features are then fused to generate the final result.

ASPP addresses the fact that objects of the same class can have different scales in an image which leads to improved accuracy.

2.4.3. DeepLab v3

In Rethinking Atrous Convolution for Semantic Image Segmentation main concepts from DeepLab v2 where revised and improved upon resulting in a new architecture which significantly outperforms the previous ones. The main improvements that we will discuss are:

Atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates
Improvements on ASPP

Cascaded modules without and with atrous convolution when atrous convolution with rate > 1 is applied after block3 when output stride = 16

First of all, we can above below that block 4 is duplicated several times where atrous convolution is laid up upon on cascade. Compared to the architecture without atrous convolution we can keep a constant stride constant while keeping a larger field-of-view while having a minimum number of parameters and have a larger feature map, making it easy to capture long range information in the deeper blocks.

Secondly, ASPP was revisited by introducing 4 parallel atrous convolutions with different atrous rates applied on top of the feature map to capture multi-scale information. However, it was discovered that "as the sampling rate becomes larger, the number of valid filter weights becomes smaller". To counter this, global average pooling is applied on the last feature map. The result is then fed to a 1x1 convolution, batch normalized and finally bilinearly upsampled to the required dimensions.

Parallel modules with atrous convolution (ASPP), augmented with image-level features

2.4.4. DeepLab v3+

Below, we can see that the encoder part is already provided by DeepLab v3. By extending DeepLabv3 with a decoder module v3+ is born.

Combination of SPP (left) and encoder-decoder structure (middle) results in DeepLab v3+ (right) where the encoder part provides for rich semantic information while more detailed object boundary info is provided from the decoder.

DeepLab v3+ also modifies atrous convolution based on depthwise separable convolution. A normal convolution can be factorized into a depthwise convolution and pointwise convolution. By using depthwise convolutions, one can drastically reduce the amount of computations needed. By applying the same rationale to factorize atrous depthwise convolutions, it was shown that it also reduced the computational complexity.

Below, we can observe the architecture for DeepLab v3+. From this illustration it is clear to see that DeepLab v3 indeed represents the encoder.

The features from the encoder are first bilinearly upsampled by a factor of 4 and then concatenated with the corresponding low-level features from the backbone network. These low-level features are first passed through a 1x1 convolution to reduce the amount of channels which can distort the importance of the richer encoder features. Afterwards, another set of convolutions occur before upsampling again by a factor 4. These last 3x3 convolutions help with refining the features before scaling the image up.

As backbone, both ResNet-101 and Xception were researched and it was shown that SOTA was achieved using a DeepLab v3+ Xception architecture on the PASCAL VOC 2012 dataset.

2.5. Fast FCN

In 2019, the principle of dilated convolution was replaced with Joint Pyramid Upsampling (JPU). The authors discuss the heavy computational complexity and memory introduced by dilated convolutions aiming to tackle this issues with JPU.

With the same backbone as the dilated FCN, the JPU module takes the last 3 feature maps as input map and generates a high-resolution feature map.

The differences between Fast FCN's backbone and DilatedFCN's lies in the last two convolution stages. Typically, the input feature map is first processed by a regular convolution layer, followed by a series of dilated convolutions. Fast FCN conceptually processes the input feature map with a strided convolution and then employs several regular convolutions to generate the output. This lightens the computational burden compared to DilatedFCN. Conceptually, because that's the main gist of the idea but it was found that a high converging time was required during gradient descent. As such, JPU was created to approximate this optimization process.

As we see above, each feature map is passed through your regular convolutional block. Afterwards, the feature maps are upsampled and concatenated which then is passed through four convolutions with different dilation rates. Finally, the convolution's results are concatenated again and passed through a final convolution layer. Where ASPP only exploit the information in the last feature map, JPU extracts multi-scale context information from multi-level feature maps, which leads to a better performance.

2.6. Gated-SCNN

In summer 2019, Gated Shape CNNs for Semantic Segmentation was introduced. GSCNN introduces a new CNN architecture with 2 streams: the classical stream as discussed in previous architectures and a shape stream. The rationale is that there is an inherent inefficacy in aforementioned architecture design since color, shape and texture information are all processed together inside one deep CNN while they likely contain very different amounts and type of information.

GSCNN Architecture. The regular stream can be any backbone architecture. The shape stream focuses on shape processing through a set of residual blocks and gated convolutional layers. Lastly, the 2 streams are fused with an Atrous Spatial Pyramid Pooling for a refined semantic segmentation output.

Although the task of semantic segmentation and semantic boundary are closely related, the shape stream does not employ features from the regular stream. Instead, Gated Convolutional Layers (GCL) helps the shape stream to only process relevant information by filtering out the rest. The regular stream forms a high-level understanding of the scene, using GCL, we can ensure that the shape stream only focuses on the relevant information and is then passed on to the next layer in the shape stream for further processing. Intuitively, the shape stream can be seen as a succession of processes which generates an attention map were areas with important boundary information increasingly gets heavier in terms of weights. Finally, in the shape stream, since the ground-truth of edges from the regular stream's semantic segmentation masks can be obtained, supervised binary cross entropy loss can be used on the output boundaries to supervise the shape stream.

The last step is to fuse the regular and shape streams using ASPP to ensure that multi-scale contextual information is preserved. This improvement leads to an architecture that produces sharper predictions around object boundaries and significantly boosts performance on thinner and smaller objects.

Conclusion

We have discussed different pivotal architectures for semantic segmentation. For each of one, we can observe an encoder-decoder structure with as goal to extract and combine fine grained locational information and coarse grained content information. Segmentation is critical in applications such as: autonomous driving, healthcare, robotic navigation, localization, and scene understanding.

In the next chapter, we will discuss instance segmentation which provides us a mean to obtain the individual instances of all classes in an image.