BANet

Published in

30 Days Study Project

3 min readFeb 8, 2021

Contributions:
1. Proposed the refinement block which is consists of a channel attention branch and a spatial attention branch
2. Proposed the pooling fusion block which let high level feature maps can be recovered by the guidance of low level feature maps, and let the fusion of the high level and low level information becomes more effective

I. Method:
1. Channel and Spatial Attention Blocks:
1.1 Attention Mechanism: Let feature map F be of shape (C×h×w ) as the input, then the channel attention mask M_c will be of shape (C×1×1), and the spatial attention mask M_s will be of the shape (1×h×w) is the spatial attention mask
After these, the output feature map F_out will be computed by

F_out = F⊗Bs(M_c), F_out = F⊗Bc(M_s) — (1)
where ⊗ denotes element-wise multiplication, Bs and Bc stand for spatial and channel broadcast operation respectively.

1.2 Implement:
Motivated by the CBAM, which is one of the earliest model which adopted the attention mechanism in computer vision. Details are shown in figure below (figure 3 in the paper), also notice that all the convolution operations are with 1x1 kernel, © stands for concatenation of arrays, and ⊗ denotes the element-wise multiplication.

2 Pooling Fusion Block
2.1 Motivation:
a. The high level feature map has accurate semantic information but poor spatial information, and the low level feature map is opposite to the high level one. (The larger receptive field, the better semantic information)
b. The upsampled high level feature map may bring the incorrect inter-class boundaries to model, so the authors want to use the relative accurate boundary information from low level feature map to guide the high level feature map. The concept is illustrated in the figure 4 below. (The term ‘boundary’ here means the distinction between class, read this for more detail)

2.2 Implement:
The authors use the average pooling operation to alleviate the wrong impact of false boundary, and details are also shown in the figure 5 below. Notice that the kernel of average pooling is 3x3, and the convolution operation is followed by batch normalization operation and the ReLU function.

2.3 The whole model
After all of above, the model is designed as below and with the loss function:

L_total = L_main + λ ∗ L_aux — (2)

II. Training Details for Cityscapes dataset
．Learning rate scheduler policy: polynomial w/ power= 0.9
．Initial learning rate: 0.1
．Optimizer: SGD w/ momentum=0.9 & weight decay of 0.0005
The model is trained in 80000 iterations and the input images are of the size 1024x1024, which are randomly scaled between 0.5 and 2 and flipped.

III. Ablation Studies
4.1 Loss Function
For the lambda in loss function (2), the authors tried different values between {0.01, 0.05, 0.1, 0.25, 0.5}, and the result shows that the best λ is 0.1
4.2 Number of filters of conv. in PFB
For the number of filters of the 1x1 convolution operation in PFB ( pooling fusion block, fig.5), the authors tried different values between {32, 64, 128, 256}, and after taking the result and the number of parameters into consideration, the authors set the number of filters as 128.

IV. Reference:
Chengli Peng, Jiayi Ma, Chen Chen, Xiaojie Guo,Bilateral attention decoder: A lightweight decoder for real-time semantic segmentation,Neural Networks,
2021,ISSN 0893–6080,https://doi.org/10.1016/j.neunet.2021.01.021.

Here’s the end of my note of BANet, I did not go through all the details of the paper, so if you are interested in the paper, just read it!
All rights of the figures I use are reserved to Elsevier Ltd., if there is copyright infringement, please let me know.

BANet

Written by 陳德瑋