BiSeNet for Real-Time Segmentation Part I

In this post, I’m going to give you an introduction to Bilateral Segmentation Network for Real-time Semantic segmentation.

Facebook’s AI team Detectron

For the last two weeks I have been reading and distilling the knowledge from this interesting research paper I found on arXiv.org that aims to improve the accuracy, inference speed and many other problems our state-of-the-art model’s approach have in the field of semantic segmentation.

What is semantic segmentation & why should you care?

Semantic segmentation is an essential area of research in computer vision for image analysis task. The main goal of it is to assign semantic labels to each pixel in an image such as (car, house, person…).

Example of semantic segmentation
In other words I think semantic segmentation is type of Object Detection system that can section/trace (segment) an object and assign labels to each pixel from inputs like image or video.

I have another post with more detailed information about this topic. So, If you want to know more about it, like the areas semantic segmentation can be applied and the core tasks check out this post about it.


Previous approaches

They all are wrong, the difference is one is less wrong.

Semantic segmentation extracts important features from the input image/video using two components which are :

  • Rich spatial information &
  • Sizable receptive field

However, modern approaches usually comprise both spatial resolution and receptive field to achieve real-time inference speed, which leads to poor performance.

How does comprising any of these or the two components affect the model’s prediction?

Recent papers/works in the field of semantic segmentation have shown that there are 3 ways to accelerate a model inference speed.

Just like Cap, they tried their best but they failed in the end.

ICNet & Real- time Image Segmentation via Spatial Sparsity for example focus on building a practically fast semantic segmentation system with decent prediction accuracy. Meaning make semantic segmentation run fast, reducing computational costs and while not sacrificing too much quality is left behind.

  • The two papers I mention above use one of the 3 approaches that is to try to restrict the input size to reduce computation complexity by cropping or resizing. Though the method is simple and effective, the loss of spatial details (features) corrupts the prediction especially around boundaries, leading to the accuracy decrease on both metrics and visualization.
Restrict input size — Limit the size of the input image i.e. network input image size is restricted to 512x512 so images with higher resolution with will be resized/cropped.
  • Instead of resizing the input image, some works like Xception and etc, prune the channels of the network to boost the inference speed, especially in the early stages of the base model.
Pruning channels — channel pruning directly reduces feature map width, which shrinks a network into thinner one. It is efficient on both CPU and GPU because no special implementation is required .

For more info about pruning checkout this paper here.

  • For the last case, ENet(Efficient Neural Network) proposes to drop the downsampling operations in the last stage, instead it uses upsampling operations which is the opposite of downsampling, resulting in a poor discriminative ability. On their paper they state that downsampling images has two main drawbacks. Firstly, reducing feature map resolution implies loss of spatial information. Secondly, full pixel segmentation requires that the output has the same resolution as the input. However, downsampling has one big advantage. Filters operating on downsampled images have a bigger receptive field, that allows them to cover larger objects.
Downsampling — To make a digital audio signal smaller by lowering its sampling rate or sample size. In other words reducing the number of pixels of an image, it is a form of image resampling or image reconstruction.

Overall, all the above methods compromise the accuracy for speed.

Researches also try to remedy the loss of spatial details mentioned above by utilizing U-shape structure. By fusing the hierarchical features of the backbone network, the U-shape structure gradually increases the spatial resolutions and fills some missing details. However this technique has two weaknesses.

  • The complete U-shape structure can reduce speed of the model due to the extra computation.
  • Most of the spatial information lost in pruning cannot be easily recovered.

The solution to the problems mentioned above

Let me show you how it's done!

To address the dilemma of sacrificing accuracy for speed the paper proposes Bilateral Segmentation Network(BiSeNet) with two parts:

  • Spatial Path
  • Context path
I already made a post explaining in detail all about this two parts and how they work click here to check it out.

The researchers designed a Spatial Path(SP) with a small stride to preserve the spatial information and generate high-resolution features. They also designed a Context Path(CP) with a fast downsampling strategy is employed to obtain sufficient receptive field that works parallelly to the SP. In the pursuit of better accuracy without loss of speed they implemented a fusion of two paths and refinement of final prediction. They propose a new Feature Fusion Module(FFM) to combine features efficiently & a Attention Refinement Module(ARM) to refine the features of each stage. ARM employs a global average pooling to capture the global context and refine the output feature at each stage in the CP. It doesn't require any upsampling operation. Therefore, it demands less computation cost.

Just like that, I rendered your work obsolete, BOOM!

Summary

The proposed architecture makes the right balance between the speed and segmentation performance on Cityscapes, CamVid, and COCO-Stuff datasets. Specifically, for a 2048×1024 input, they achieve 68.4% Mean IOU(Intersection Over Union) on the Cityscapes test dataset with speed of 105 FPS(Frames Per Sec) on one NVIDIA Titan XP card, which is significantly faster than the existing methods with comparable performance.