Understanding Attention Modules: CBAM and BAM — A Quick Read

Understanding one of the interesting attention mechanisms in convolutional neural networks.

Shreejal Trivedi
VisionWizard
7 min readJun 12, 2020

--

In this article, we will be going through two articles quickly viz. Bottleneck Attention Modules(BAM)¹ and Convolutional Block Attention Modules(CBAM)².

Recently, many different SOTA networks have leveraged these attention mechanisms that have significantly improved and refined real-time results.

Lightweight network and straightforward implementations have made it easier to incorporate directly into the feature extraction part of convolutional neural networks.

So, let’s get started by paying some ATTENTION to these topics.

PS: Stay tuned for exciting new articles like this in future by following VisionWizard Page.

1. Attention Module: What is?

  • Attention modules are used to make CNN learn and focus more on the important information, rather than learning non-useful background information. In the case of object detection, useful information is the objects or target class crop that we want to classify and localize in an image.
  • The attention module consists of a simple 2D-convolutional layer, MLP(in the case of channel attention), and sigmoid function at the end to generate a mask of the input feature map.
Fig. 1 Base structure of attention module.
  • It takes a CxHxW feature map as an input and gives 1xHxW(or CxHxW in case of 3D Attention Map) as an output attention map. This attention map is then multiplied element-wise with the input feature map to get a more refined and highlighted output.
  • Generally, attention mechanisms are applied to spatial and channel dimensions. These two attention mechanisms viz. Spatial and Channel Attention Map generation is done sequentially as shown in Fig. 2 or parallelly.
  • These attention mechanisms first experimented in residual architectures. The below-given figure is an overall structure of an attention layer.
Fig. 2 Types of attention mechanisms. Spatial and Channel Attention Modules. It can be applied sequentially or parallel. Source: [2].

2. Bottleneck Attention Modules(BAM)

Fig. 3 Structure of the Bottleneck Attention Module. Source[1]
  • As we discussed, the attention module as a whole consists of the Channel Attention Module and Spatial Attention Module. For the given input feature map F ∈ C×H×W, BAM infers a 3D attention map M(F)C×H×W. The refined feature map F’ is computed as:-
  • Here ⊗ denotes element-wise multiplication. They adopt a residual learning scheme along with the attention mechanism to facilitate the gradient flow. So after multiplying the attention mask, they again add the output of with the input tensor F. To design an efficient module, they first compute the channel attention Mc(F) ∈ C×1×1 and the spatial attention Ms(F) ∈ 1×H×W at two separate branches, then compute the attention map M(F) as shown in above Fig. 3. Here σ is a sigmoid function. Both branch outputs are resized to C×H×W before addition.

2.1 Channel Attention Module

  • Steps to generate channel attention map are:-
  1. Do Global Average Pooling of feature map F and get a channel vector Fc∈ Cx1x1.
  2. Pass this Fc to a small MLP of one hidden layer of dimension C/r. Here r is a reduction ratio for the hidden channel(For example if the channel vector length is 1024 and reduction ratio r is 16, then the number of neurons in the hidden layer will be 64).
  3. Add Batch Normalization Layer in front of this MLP.

2.2 Spatial Attention Module

  • Steps to generate spatial attention map are:-
Fig. 4 Effect of dilation value: d during a convolution operation on a feature map.
  1. Input feature map F is passed through chains of 1x1 and 3x3 Convolutional Layers. Here 3x3 convolutional layers contain a dilation value d=4. Dilation value is used to increase the effective receptive field of the network. We can see from Fig. 4 the effect of increased dilation value. Higher the value of dilation more will be the receptive field.
  2. The first 1x1 layer reduces the channel dimensions to C/r. This reduced feature map is passed to two 3x3 convolutional blocks with d=4. Finally, output channels of these layers are reduced to 1xHxW by the last 1x1 convolutional layer followed by Batch Norm as shown in Fig. 3 to generate Ms(F).

2.3 Combination of Channel and Spatial Attention Maps

  • Once we get Mc and Ms, they are added elementwise to get the final attention map. They chose addition over multiplication due to the smooth gradient flow at the time of backpropagation.
  • As we can see that both will be having different dimensions(Cx1x1 and 1xHxW), so feature map is broadcasted in spatial and channel dimensions respectively to generate CxHxW.

2.4 What to do with this generated Attention Maps?

  • This attention map M(F) is multiplied with the input feature map and finally added again to get a more refined and highlighted map. This is the structure followed by them and resembles the ResNet family.

2.5 Where to keep BAM?

  • This module is kept at every bottleneck of ResNet architecture. BAM denoises low-level features such as background texture feature at the early stage and then gradually focuses on the exact target which is a high-level semanticity.
Fig. 5 Placing of BAM in ResNet.

3. Convolutional Block Attention Module(CBAM)

Fig. 6 Structure of Spatial and Channel Attention in CBAM. Source: [2].
  • Given an intermediate feature map F ∈ C×H×W as input, CBAM sequentially infers a 1D channel attention map Mc ∈ C×1×1 and a 2D spatial attention map Ms ∈ 1×H×W as shown in Fig. 6. The overall attention process can be summarized as:
  • Here ⊗ denotes element-wise multiplication. During multiplication, the attention values are broadcasted (copied) accordingly: channel attention values are broadcasted along the spatial dimension, and vice versa. F’’ is the final refined output. Fig. 6 depicts the computation process of each attention map.

3.1 Channel Attention Map

  • Here Channel Attention Map follows the same generation process as BAM, but here with Average Pooling, Max Pooling is also added for getting more distinctive channel features as shown in Fig. 6.
  • Here σ denotes the sigmoid function, W0 ∈ C/r×C, and W1 ∈ C×C/r. Note that the MLP weights, W0 and W1, are shared for both inputs and the ReLU activation function is followed by W0.

3.2 Spatial Attention Map

  • Spatial map generation steps:-
  1. Take the input feature map F and generate two intermediate feature maps viz. Fsavg, and Fsmax ∈ 1xHxW.
  2. Concatenate these two outputs GAP(Global Average Pool)and MP(Max Pooling)and pass it through a small convolutional block of 7x7 kernel size. Here unlike BAM to increase receptive field, CBAM uses large kernel sizes to accomplish the same. Also, this is a simple convolutional block with d=1.

3.3 How to leverage this generated attention maps

Fig. 7 Placement of Spatial and Channel Attention Modules sequentially.
  • As shown in Fig. 7, these generated feature maps are arranged sequentially. First, the Channel attention map is applied on the input map, followed by the application of the spatial map. Finally, this F’’ is added to the previous input convolutional layer — ResNet Family(This is the shortcut layer for that particular ResNet block).

3.4 Where to keep CBAM

  • Unlike BAM, these modules are placed inside residual blocks as well as at the bottlenecks.
Fig. 8 Placement of CBAM module in ResNet architecture.

3.5 What is the difference between CBAM and BAM?

In the case of BAM, only GAP was used to get the statistics of the feature map in spatial and channel dimensions. Whereas, CBAM also considered to use the MaxPooling with Average Pool. They proved that using Maxpooling accounts to generate the most salient features from the feature map and compensate the GAP output which encodes the global statistics softly.

In the case of BAM, the convolutional operation was done using dilation value d=4 to increase the receptive field as we go deep in the network. Whereas, CBAM used the higher kernel size of 7x7 and normal convolutional layer with d=1 to incorporate the same.

In the case of BAM, parallel generation of spatial and channel attention maps were taken into consideration which was later added to get the final attention map. Whereas in the case of CBAM, a sequential approach was used. Firstly, the channel attention map was calculated and then the Spatial map was finally obtained from the generated intermediate feature map(Channel Map×Input Feature Map).

Order of the sequential arrangement in the case of CBAM is Channel Attention → Spatial Attention.

Official Pytorch Implementation of BAM and CBAM: [code]

If you have managed to reach here, then I believe you are a part of an elite group who have a thorough understanding on attention modules — CBAM and BAM.

Please feel free to share your thoughts and ideas in the comment section below.

If you think that article was helpful, please do share it, and also clap(s) would hurt no one.

--

--

Shreejal Trivedi
VisionWizard

Deep Learning || Computer Vision || AI || Editor — VisionWizard