Fast-SCNN explained and implemented using Tensorflow 2.0

Published in

Deep Learning Journal

8 min readMay 7, 2019

Fig 1: Image extracted from the original paper: https://arxiv.org/abs/1902.04502

Fast Segmentation Convolutional Neural Network (Fast-SCNN) is an above real-time semantic segmentation model on high resolution image data suited to efficient computation on embedded devices with low memory. The authors of the original paper are: Rudra PK Poudel, Stephan Liwicki and Roberto Cipolla. The code used in this article is not the official implementation from the authors but an attempt by me to re-construct the model as described on the paper.

Since the rise of autonomous vehicles, it is highly desirable that there exist a model that can process input in real-time. There already exist some state-of-the-art offline semantic segmentation models, but these models are large in size and memory requirement and requires expensive computation, Fast-SCNN can provide solution to all these problems.

Some key aspects of Fast-SCNN are:

Above real-time segmentation on high resolution images (1024 x 2048px)
Yield Accuracy of 68% mean intersection over union
Process Input at 123.5 frames per second on Cityscapes dataset
No large pre-training required
Combines spatial details at High resolution with deep features extracted at lower resolution

Moreover, Fast-SCNN uses popular techniques found in state-of-the-art models to ensure the performance as listed above, techniques like Pyramid Pooling Module or PPM as used in PSPNet, Inverted Residual Bottleneck layers as used in MobileNet V2 and Feature Fusion Module used in model such as ContextNet, which uses both deep features extracted from the lower resolution data and spatial details from higher resolution data to ensure better and fast segmentation.

Let us now begin with the exploration and the implementation of the Fast-SCNN. Fast-SCNN is constructed using 4 major building blocks. They are:

Learning to Down-sample
Global Feature Extractor
Feature Fusion
Classifier

Table 1: Fast-SCNN architecture as described in the paper

1. Learning to Down-sample

So far we know that the first few layers of Deep Convolutional neural network extract the low level features such as, edges and corners from the image. So to exploit this feature and make it re-usable for further layers, Learning to Down Sample is used. Its a coarse global feature extractor that can be re-used and shared by other modules in the network.

Learning to Down-sample module uses 3 layers to extract out these global features. These are: Conv2D layer followed by 2 Depthwise Separable Convolutional Layers. During the implementation, after each Conv2D and Depthwise Separable Conv layers, a Batchnorm layer followed by Relu activation is used, as usually its a standard practice to introduce BatchNorm and Activations after such layers. Here, all 3 layers uses stride of 2 and kernel size of 3x3.

Now, let us begin by first implementing this module. To begin, lets first install Tensorflow 2.0. It’s easier than ever to do this now. We can simply use Google Colab and begin our implementation. As of now, there is only alpha version of Tensorflow 2.0 available, which you can simply install using following command:

!pip install tensorflow-gpu==2.0.0-alpha0

Here, ‘-gpu’ implies that my Google Colab notebook uses GPU, and in case of yours, if you prefer to not use it, you can simply remove the ‘-gpu’ and then the Tensorflow installation will utilize the cpu of the system.

After that, let’s import Tensorflow:

import tensorflow as tf

Now, let’s first create the Input layer for our Model. In Tensorflow 2.0 using TF.Keras high level api, we can do so by:

This Input layer is our entry point to the model that we are going to build. Here we are utilizing Tf.Keras Functional api. The reason to use Functional api instead of Sequential is because, it provides flexibility that we require to construct this particular model.

Moving on, let us now define the layers for the Learning to Down-sample module. For that, to make the process easy and re-usable, I have created a custom function that will check if the layer I want to add is a Conv2D layer or Depthwise Separable layer and later checks if I want to add relu at the end of the layer or not. Using this code block made the implementation easy to understand and re-usable through out the implementation.

In TF.Keras, Convolutional layer is defined as tf.keras.layers.Conv2D and Depthwise Separable Layer as tf.keras.layers.SeparableConv2D

Now, let’s add the layers for the module by simply calling our custom function with proper parameters:

2. Global Feature Extractor

This module aimed to capture the global context for the segmentation. It directly takes the output from the Learning to Down-sample module. In this section different bottleneck residual blocks are introduced and a special module called Pyramid Pooling Module or PPM is also introduced to aggregate different region-based context information.

Let’s begin with Bottleneck residual block.

Table 2: Bottleneck residual block from the paper

Above is the description for the bottleneck residual block from the paper. Similar to above, let us now implement it using tf.keras high level api.

We begin by first describing some custom functions as per the table above. We begin with residual block which will call our custom conv_block function to add Conv2D and then adds DepthWise Conv2D layer followed by point-wise convolution layer, as described on table above. Then the final output from point-wise convolution is added with the original input to make it residual.

Bottleneck residual block here is inspired from the implementation used in MobileNet v2

This bottleneck residual block is added multiple times in the architecture, the number of times they are added is denoted by the ’n’ parameter in the Table 1 above. So, as per the architecture described in the paper, in order to added it ’n’ times, we introduce another custom function that will just do that.

Now let’s add these bottleneck blocks to our model.

Here, you will notice that the first input to these bottleneck block is from the output of the Learning to down-sample module.

The final block to this Global Feature Extractor section is the Pyramid Pooling Module or PPM in short.

Fig 2: Diagram taken from PSPNet original paper: https://arxiv.org/abs/1612.01105

PPM takes the feature maps from the last convolutional layer and then applies multiple sub region average pooling and upscaling functions to harvest different sub-region representations, and then they are concatenated together which carries both local and global context information from the image making the segmentation process more accurate.

To implement that using TF.Keras, I have used another custom function

Lets add this PPM module that will take input from the last bottleneck block

The second argument here are the number of bin sizes to be provided to the PPM module, the bin sizes here used are as per described in the paper. These bin sizes are used to make AveragePooling at different sub regions as described in custom function above.

3. Feature Fusion

Fig 3: Taken from Fast-SCNN original paper

In this module, two inputs are added together to better represent the segmentation. The first one is Higher level feature extracted from the Learning to Down-sample module, the output from this Learning to Down-sample module is point-wise convoluted first, before adding to the second input. Here no activation is added at the end of the point-wise convolution.

The Second input is the output from the Global Feature Extractor. But before adding the second input, they first Upsampled by the factor of (4,4) and then DepthWise Convoluted and finally followed by another point-wise convolution. No activation is added to the point wise convolution output, the activations are introduced only after adding these two inputs.

Table 3: Feature Fusion Module from the original paper

Here is the lower resolution operations implemented using TF.Keras

Now, let us add these two inputs together for the feature fusion module.

4. Classifier

In classifier section, 2 DepthWise Separable Convolutional Layers are introduced followed by 1 Point-wise Convolutional layer. After each of these layers, BatchNorm layers followed by ReLU activations are also introduced.

One thing to note here is that, in the original paper’s table (Table 1 above), there is no mention of Upscaling and Dropout layers after the point-wise convolutional layer, but in later part of the paper it is described that these layers are added after the point-wise convolutional layer. Therefore, during the implementation, I have also introduced these two layers as per written in the paper.

After Upscaling as per desired from the final output, the SoftMax activation is introduced as the final layer.

Compiling the model

Now that we have added all the layers, let’s create our final model and compile it. To create the model, as already mentioned above, we are using Functional api from TF.Keras. Here the input to the model is the initial Input Layer described at the Learning to Down-sample module and the output is the final classifier output.

Now, let’s compile it with optimizers and loss functions. In the original paper, the authors have used SGD optimizer with momentum value of 0.9 with batch-size of 12 during the training process. They have also used poly-learning rate for the learning rate scheduling with base value of 0.045 and power as 0.9. For the sake of simplicity, I have not used any learning rate scheduling here, but if required you can add it yourself for your particular training process. Further, it is always good idea to start with ADAM optimizer while compiling the model, but in this particular case with CityScapes dataset, authors have used only SGD. But in general case, it’s always good to start with ADAM optimizer and then move towards other different variations if required. For the loss function, authors have used Cross Entropy loss and so have used here during the implementation.

In the paper, authors have used 19 categories from CityScapes Dataset for the training and evaluation. With this implementation you can tweak your model as per desired with any number of output you might require for your particular project.

Here are some of the validation results from Fast-SCNN, compared with input image and ground truth.

Fig 4: Picture taken from the original paper

I hope you enjoyed this implementation article and if you think I have made any mistake during the implementation or explanation process, feel free to correct me or suggest me the changes.

In this way, we can easily implement Fast-SCNN using Tensorflow 2.0 and its high level api TF.Keras. Below are the references I have used while implementing the model.

For full code, you can visit my GitHub Repo:

kshitizrimal/Fast-SCNN

Implementation of Fast-SCNN using Tensorflow 2.0. Contribute to kshitizrimal/Fast-SCNN development by creating an…

github.com

References

Link to the Original paper: https://arxiv.org/abs/1902.04502
Link to PSPNet Original paper: https://arxiv.org/pdf/1612.01105.pdf
Link to ContextNet original paper: https://arxiv.org/abs/1805.04554
CityScapes Dataset official website: https://www.cityscapes-dataset.com/
Official Guide to Tensorflow 2.0: https://www.tensorflow.org/alpha
Full code for the implementation: https://github.com/kshitizrimal/Fast-SCNN/blob/master/tf_2_0_fast_scnn.py
Official Implementation of ContextNet: https://www.toshiba.eu/eu/Cambridge-Research-Laboratory/Computer-Vision/Resources/ContextNet/?fbclid=IwAR1T-eLK_xLq1Hu7Xz161YCaKzoZBtQMyvUFTySxbEqM6NNHY7xWV7nq9rA
Pyramid Pooling Module code, inspired from the PSPNet Implementation: https://github.com/dhkim0225/keras-image-segmentation/blob/master/model/pspnet.py
Bottleneck Residual block code inspired from MobileNet V2 Implementation: https://github.com/xiaochus/MobileNetV2/blob/master/mobilenet_v2.py