BiSeNet for Real-Time Segmentation Part I
In this post, I’m going to give you an introduction to Bilateral Segmentation Network for Real-time Semantic segmentation.
For the last two weeks I have been reading and distilling the knowledge from this interesting research paper I found on arXiv.org that aims to improve the accuracy, inference speed and many other problems our state-of-the-art model’s approach have in the field of semantic segmentation.
What is semantic segmentation & why should you care?
Semantic segmentation is an essential area of research in computer vision for image analysis task. The main goal of it is to assign semantic labels to each pixel in an image such as (car, house, person…).
In other words I think semantic segmentation is type of Object Detection system that can section/trace (segment) an object and assign labels to each pixel from inputs like image or video.
I have another post with more detailed information about this topic. So, If you want to know more about it, like the areas semantic segmentation can be applied and the core tasks check out this post about it.
Semantic segmentation extracts important features from the input image/video using two components which are :
- Rich spatial information &
- Sizable receptive field
However, modern approaches usually comprise both spatial resolution and receptive field to achieve real-time inference speed, which leads to poor performance.
How does comprising any of these or the two components affect the model’s prediction?
Recent papers/works in the field of semantic segmentation have shown that there are 3 ways to accelerate a model inference speed.
ICNet & Real- time Image Segmentation via Spatial Sparsity for example focus on building a practically fast semantic segmentation system with decent prediction accuracy. Meaning make semantic segmentation run fast, reducing computational costs and while not sacrificing too much quality is left behind.
- The two papers I mention above use one of the 3 approaches that is to try to restrict the input size to reduce computation complexity by cropping or resizing. Though the method is simple and effective, the loss of spatial details (features) corrupts the prediction especially around boundaries, leading to the accuracy decrease on both metrics and visualization.
Restrict input size — Limit the size of the input image i.e. network input image size is restricted to 512x512 so images with higher resolution with will be resized/cropped.
- Instead of resizing the input image, some works like Xception and etc, prune the channels of the network to boost the inference speed, especially in the early stages of the base model.
Pruning channels — channel pruning directly reduces feature map width, which shrinks a network into thinner one. It is efficient on both CPU and GPU because no special implementation is required .
For more info about pruning checkout this paper here.
- For the last case, ENet(Efficient Neural Network) proposes to drop the downsampling operations in the last stage, instead it uses upsampling operations which is the opposite of downsampling, resulting in a poor discriminative ability. On their paper they state that downsampling images has two main drawbacks. Firstly, reducing feature map resolution implies loss of spatial information. Secondly, full pixel segmentation requires that the output has the same resolution as the input. However, downsampling has one big advantage. Filters operating on downsampled images have a bigger receptive field, that allows them to cover larger objects.
Downsampling — To make a digital audio signal smaller by lowering its sampling rate or sample size. In other words reducing the number of pixels of an image, it is a form of image resampling or image reconstruction.
Overall, all the above methods compromise the accuracy for speed.
Researches also try to remedy the loss of spatial details mentioned above by utilizing U-shape structure. By fusing the hierarchical features of the backbone network, the U-shape structure gradually increases the spatial resolutions and fills some missing details. However this technique has two weaknesses.
- The complete U-shape structure can reduce speed of the model due to the extra computation.
- Most of the spatial information lost in pruning cannot be easily recovered.
The solution to the problems mentioned above
To address the dilemma of sacrificing accuracy for speed the paper proposes Bilateral Segmentation Network(BiSeNet) with two parts:
- Spatial Path
- Context path
I already made a post explaining in detail all about this two parts and how they work click here to check it out.
The researchers designed a Spatial Path(SP) with a small stride to preserve the spatial information and generate high-resolution features. They also designed a Context Path(CP) with a fast downsampling strategy is employed to obtain sufficient receptive field that works parallelly to the SP. In the pursuit of better accuracy without loss of speed they implemented a fusion of two paths and refinement of final prediction. They propose a new Feature Fusion Module(FFM) to combine features efficiently & a Attention Refinement Module(ARM) to refine the features of each stage. ARM employs a global average pooling to capture the global context and refine the output feature at each stage in the CP. It doesn't require any upsampling operation. Therefore, it demands less computation cost.
The proposed architecture makes the right balance between the speed and segmentation performance on Cityscapes, CamVid, and COCO-Stuff datasets. Specifically, for a 2048×1024 input, they achieve 68.4% Mean IOU(Intersection Over Union) on the Cityscapes test dataset with speed of 105 FPS(Frames Per Sec) on one NVIDIA Titan XP card, which is significantly faster than the existing methods with comparable performance.
This concludes the Part I of this series about BiSeNet, stay tuned for more amazing content and Part II with the code for implementing this state-of-the-art Real time semantic segmentation Network research paper.
Thank you for reading if you have any thoughts, comments or critics please comment down below.
If you like it please give me a round of applause👏👏 👏and share it with your friends.
Open access to 1,431,489 e-prints in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative…arxiv.org
The recipe of the Success of BiSeNet (Bilateral Segmentation Network).medium.com
Abstract: We focus on the challenging task of real-time semantic segmentation in this paper. It finds many practical…arxiv.org
Abstract: We propose an approach to semantic (image) segmentation that reduces the computational costs by a factor of…arxiv.org
Abstract: We present an interpretation of Inception modules in convolutional neural networks as being an intermediate…arxiv.org