Squeeze-and-Excitation block explained

4 min readSep 27, 2021

Back in 2018, with the introduction of the SE (Squeeze and Excitation) block, the ImageNet accuracy improved from last year by 2.5%. This block takes into account the interdependencies between different feature channels of CNNs. In this article, we’re going to see why and how these blocks can improve the performance of any convolutional neural network with little computational cost.

The Main Idea

In a typical convolutional layer of a CNN, the weights for each channel are uniform when computing the output. This means that the same 2D matrix of weights is convolved with every channel, treating all channels equally. The SE block, however, introduces an adaptive approach where the importance of each channel is individually assessed based on its context. In simpler terms, the SE block takes into account the relevance of each channel when computing the output.

The Details

Now, let’s explore the inner workings of the SE block to understand how it achieves the adaptive weighting.

Squeeze Phase: In the first step, the SE block gains a global understanding of each channel by squeezing them into a single numeric value. This process involves performing global average pooling over the spatial dimensions of each channel, resulting in a vector of size ‘n,’ where ’n’ is determined by the number of channels in the input tensor.
Excitation Phase: Once we have the vector of size ‘n,’ it is fed into a two-layer feed-forward neural network. This network is designed to capture intricate dependencies and relationships among the channels effectively. The output of this network is again a vector of the same size ‘n,’ containing ’n’ values that represent the learned importance weights for each channel.
Scale and Combine: The final step involves using the ’n’ weights obtained from the excitation phase to scale each channel of the input tensor. By applying these adaptive weights to the original channels, the SE block effectively highlights the most important features and suppresses less relevant information.

The Implementation

The SE block is defined below as a function. The function takes the feature map and number of channels as input. GlobalAveragePooling converts each channel to a single numerical value (Squeezing part). In the following a stack of two Dense blocks transform the n values to n weights for each channel (Excitation). Finally, the output is computed by applying weights to the channels by multiplication.

from tensorflow.keras.layers import GlobalAveragePooling2D, Dense, multiply
import tensorflow.keras.backend as K

def se_block(in_block, ch, ratio=16):
    # Squeeze: Global Average Pooling converts each channel to a single numerical value
    y = GlobalAveragePooling2D()(in_block)
    
    # Excitation: Two Dense blocks transform the n values to n weights for each channel
    y = Dense(ch // ratio, activation='relu')(y)  # The first layer with a ReLU activation
    y = Dense(ch, activation='sigmoid')(y)        # The second (last) layer with a sigmoid activation (acting as a smooth gating function)
    
    # Scale and Combine: Apply weights to the channels by element-wise multiplication
    return multiply([in_block, y])

The first layer of the feed-forward network has a relu activation function and the second (last) layer has a sigmoid activation function acting as a smooth gating function.

Pytorch Implementation

import torch.nn as nn
import torch

class SE_Block(nn.Module):
    def __init__(self, c, r=16):
        super(SE_Block, self).__init__()
        self.squeeze = nn.AdaptiveAvgPool2d(1)
        self.excitation = nn.Sequential(
            nn.Linear(c, c // r, bias=False),
            nn.ReLU(inplace=True),
            nn.Linear(c // r, c, bias=False),
            nn.Sigmoid()
        )

    def forward(self, x):
        bs, c, _, _ = x.size()
        y = self.squeeze(x).view(bs, c)
        y = self.excitation(y).view(bs, c, 1, 1)
        return x * y.expand_as(x)

By using this implementation of the SE block, you can easily incorporate it into your CNN architecture to enhance feature learning and improve performance on various computer vision tasks.

Credits

Performance

Below you can see the Vanilla ResNet block (left) and the SE version called the SE-ResNet (right).

By adding SE-Blocks to ResNet-50 the network delivers almost the same accuracy as ResNet-101! At the same time, the SE-ResNet-50 (SE-ResNet50) has about 56% lower computational cost than ResNet-101! The table below from the original paper compares the new SE enhanced models and the past ones in terms of error on ImageNet and GFLOPS.

References

Hu, Jie, Li Shen, and Gang Sun. “Squeeze-and-excitation networks.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. 2018.
Paul-Louis Pröve’s post
Pytorch code repo