SENet and Co-ordinatte attention modules and their integration into the YOLOv5 Object Detection pipeline

6 min readJul 19, 2022

YOLOv5 is an object detection algorithm developed by UltraLytics and the attention mechanisms are a tool used in object detection that may or may not improve detection performance depending on what type of problem you are trying to solve with your object detection pipeline. This article gives an outline and explanation of these technologies and clear instruction on how to incorporate attention and YOLOv5 together.

SENet

SENet¹ won the image classification task in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with their SENet paper (Squeeze-and-excitation Networks), and essentially pioneered the mechanism of channel attention in computer vision. SENet consists of two parts; the squeeze operation (Fsq(·)) and the excitation operation (Fex(·,W)). The main functionality of adding a SENet block in a Deep-CNN is to model the feature map weight information of the input tensors channel dimension and learn more relevant features, thereby increasing the recognition capability of the detection model. What is the most beneficial about SENet is that it is small, and can often achieve larger performance/speed ratio than by just increasing the parameters of the network.

Fsq(·): Following a convolution operation, the input to the SENet block is a 4-dimensional feature tensor of size B × H × W × C (B = Batch Size, H = Height and W = Width of each single feature map, C = number of channels). The Squeeze operation extracts the global information from each channel of the feature map through a GAP operation, which reduces the shape of the input feature map to B × 1 × 1 × C. Hence each feature map of size H × W is reduced to a singular vector 1 × 1.

Fex(·, W ): Following the Squeeze operation, the excitation consists of two successive fully-connected layers (FCL)/multi-layer perceptron (MLP) with three layers. ReLU activation is performed after the first FCL and a Sigmoid is performed after the second FCL. Finally, the output feature map is multiplied with the original feature map which reduces the feature map to shape B X C.

The transformation of feature map: B × 1 × 1 × C → B × 1 × 1 × C/r → B × 1 × 1 × C

Co-ordinatte Attention

Squeeze-and-excitation networks have shown that channel attention is incredibly useful in improving model performance. However, SENet uses a 2D GAP operation to transform a feature tensor to a single feature vector, which means spatial information is lost. Co-ordinate Attention encodes both channel and spatial information into its attention vector. Co-ordinate is similar to CBAM, however improves the modelling power by using higher- order statistical modelling to model interdependencies.

YOLOv5 Object Detection model overview (Version 6.1)

YOLOv5 is an open-source object detection model provided by Ultralytics and is the model that will be used for detection in this project. YOLOv5 is in constant develop- ment, hence the latest release is used in this project, version 6.1 which was released on the 22nd February 2022. The YOLOv5 architecture has changed in various ways since the first release of YOLOv5 in 2020, hence much of the diagrams used in YOLOv5-related research papers are outdated and misleading as of today. See below a diagram I created that demonstrates clearly the architecture of YOLOv5 version 6.1. The diagram divides the architecture into Backbone, Neck, and, Head. The flow of information in the diagram is demonstrated by the black arrows, beginning at the input image and ending with the prediction. The entire YOLO family of object detection models treat detection as a single-regression problem, where the entire input image is fed through the network once and a prediction is made. For this reason, YOLOv5 is fast at making predictions as one pass of the network is sufficient for a prediction to be made.

The architecture of YOLOv5 version 6.1 divided into three parts: Backbone, Neck and Head.

YOLOv5 Neck and Head

The YOLOv5 head is the last stage of the network pipeline and is responsible for making the final predictions, and is identical to the head used in YOLOv3 and YOLOv4. The head applies anchor boxes on features and generates final output vectors with class probabilities, objectness scores, and bounding boxes. The model neck is mainly used to generate feature pyramids, which helps the model to identify the same object with different sizes and scales (object scaling), which translates to the model performing well on unseen data (the ability to generalise). The YOLOv5 neck is CSP-PANet, which uses PANet as it’s neck architecture, and consists of 4 C3 modules, 3 Conv modules, and 2 Upsample modules. Furthermore, there are 2 Concat modules which connect the neck to lower-level feature maps in the backbone, which adds to the performance of multi-scale object detection.

YOLOv5 Backbone

The model backbone is mainly used to extract important features from the given input image, by extracting rich and informative features from an input image. YOLOv5 uses CSP-Darknet-53 as its backbone, which is based off of CSPNet. CSPNet has shown significant improvement in processing time with deeper networks when compared to other backbones, which is one of the reasons YOLOv5 has low inference times. The YOLOv5 backbone consists of 5 Conv layers, 4 C3 modules, and a single SPP (Spatial Pyramid pooling) module at the end of the backbone, connecting the backbone to the neck of the network. Before version 6.0, YOLOv5 had a Focus module instead of the first Conv module, however this was replaced in version 6.0 by an equivalent Conv module with a kernel size of 6, a stride of 2, and a padding of 2 (6X6 Conv2d). Version 6.0 also swapped the last layer of the backbone from an SPP module to a SPPF module, which yielded twice the inference speed of it’s predecessor due to reduced operations.

Add SENet and/or Co-ordinatte attention into YOLOv5 (instructions)

It can be intimidating editing such a large open source code-base like YOLOv5, and having everything working fine, hence my goal here is to make it as simple as possible to the reader how they can do so.

You must ensure that the attention mechanisms are located in the models/common.py file, as this is the location of all other modules, such as the Conv, C3 and SPPF modules described previously. For the PyTorch implementation of the attention mechanisms, see the code repository associated with the original Co-ordinatte attention paper (can be found on scholar). In order to train YOLOv5 using the attention module, an argument was parsed to the function in the ’train.py’ file, (code shown below).

Without this change, the model will still recognise the mechanism, however will not use it in training, which essentially renders it useless. ’models/yolo.py’ was changed to include the attention mechanism on line 269. Furthermore, the attention was added to the backbone architecture.

Finally, the choice is yours as to where you want to add the attention layer (the options are backbone, neck or head). I went with the backbone and to do so, edit the YOLOv5 YAML file to change the architecture to include the attention module (see the code below as to what you need to add).

Final Thoughts

Personally, I have seen better results with the Co-ordinatte attention module and would suggest you to experiment with the code parameters before assuming you have achieved the best model possible. Feel free to message me for code review or help with anything discussed in this post ;)