My Experiments with Yolov5:Almost everything you want to know about Yolov5-Series -Part 2

Manjusha Sithik
8 min readDec 26, 2023

--

Image by DALL.E

Change Scaling related architecture to improve small object detection

Lets start with a quick introduction of Yolov5 architecture

Yolo being a single stage detector it uses a single neural network to process the entire picture, then separates it into parts and predicts bounding boxes and probabilities for each component. These bounding boxes are weighted by the expected probability. The method look only once at the image in the sense that it makes predictions after only one forward propagation run through the neural network. It then delivers detected items after non-max suppression to ensure that the object detection algorithm only identifies each object once. Yolov5 models consists of three main architectural blocks .

1.Backbone

It employs CSPDarknet as the backbone for feature extraction from images consisting of cross-stage partial networks. CSPDarknet is based on DenseNet which was designed to reduced vanishing gradient problem and strengthen feature propagation at the same time reusing features and reducing the network parameters.

2. Neck

The neck network layer contains feature pyramid FPN ,path aggregation structure PANet , FPN top-down in the network, using up-sampling to transfer and fuse the information to obtain the predicted feature map; PANet bottom- up transfer of localization information, fusion of information from different network layers in backbone to further enhance the detection capabilities.

3. Head

This is responsible for predictions generated from the anchor boxes for object detection. It has tasks mainly related to Non Max Suppression and bounding box loss functions(GIoU). The detailed architecture of Yolo v5 can be seen from Figure 1 -Yolo v5 Architecture.

https://github.com/ultralytics/yolov5/issues/280
Architecture implementation

Lets try to understand it starting from the head . The final output is drawn using inputs from P3 small#17, P4 medium #20, P5 large#23 where 17,20,23 are the step number in the architecture if you can compare it with the arch diagram. The main difference is the scale , P3 is small that is p3/8 , P4 is medium that is p4/16 and p5 large is p5/32 .These form the neck of the architecture( where the extracted features will be scaled in multiples of 8).P2 output is not taken to the detector head .

A very important point to note here is Feature size of p2 is the largest followed by p3, p4,p5 respectively.

What are P2,P3,P4,P5- They are the feature maps for different strides -Bigger the strides smaller the extracted features.2 in P2 represents the number of stride that is 2²= 4. Similarly P3 represents 2³ = 8 strides and P4 represents 32 strides.

If you look at the notation in the backbone, it says P2/ 4 , if the feature is 64x64 the P2 output will be of size 16x16 (64/4) and P3/8 would be 8x8 (64/8) . P4 would be 4x4 (64/16)and P5 would be 2x2 (64/32).In summary P3,P4,P5 are same object features at different scales. This information can be used in bounding box refinements.

Why this is important ?-To deal with multi scale object detection
Here the model generate multiple features just by upscaling at different ratio from the same feature. For eg, The extracted feature is 64x64 .The model will generate 2x2, 4x4 , 8x8 features 64x64 from and learns the features so that when a 8x8 features comes up next it can detect it easily.

If you want to input p2 also to the the final output, we just have to include the input from step 13 in the detector head as below:

[[13,17,20,23], 1, Detect, [nc, anchors]], # Detect(P2, P3, P4, P5).This can be useful if you want to improve the detection of bigger objects. Since we want to improve the detection of small objects am leaving this option out for this use case.

BiFPN

BiFPN was introduced in EfficientDet to improve the low level information aggregation which is then passed to the classifier /head of the detector. This structure takes the features extracted by the backbone and aggregate and fed to the head . The more information it is able to pass , the better the model will become. Initially FPN was used to deal with multi scale object detection tasks which was one directional and at a later stage bi-directional Path Aggregation network referred as PANet was introduced which adds bottom up path to FPN to pass low level information to higher layers. BiFPN optimizes this network by removing nodes with only one input and by enhancing the features by adding an extra edge from the original input to output node if they are at the same level, in order to fuse more features as shown below Architecture of FPN , PANet, BiFPN making it more lighter and efficient compared to PANet & FPN.Yolov5 uses PAN + FPN as feature integration modules and this study experiments to replace these modules with BiFPN to improve small object detection. BiFPN aggregates maximum features extracted by the model backbone .It stops the information loss from previous levels by adding or concatenating one extra input feature map directly from the same level of the backbone . Features of small objects is be preserved and fed this way to the detector head contributing to better classification .

Bidirectional PANet (https://arxiv.org/pdf/1803.01534.pdf)

Yolov5 has only 3 feature maps P3,P4,P5 (for 8,16,32 strides respectively) from the backbone to the head ,where as in BiFPN we have five feature maps P3,P4,P5,P6,P7.Hence while infusing BiFPN to Yolov5 , P6 and P7 are removed . P3,P5 being the first and last feature maps ,wont have inputs from the same level layer. Only P4 will have an extra input directly from the same level skipping the intermediate layer preserving the information loss of small objects.

Yolo+ BiFPN

That is exactly what is implemented in BiFPN version of yolov5. In step 19 along with the output of step 14, step 6(P4 ) also is concatenated as an input as highlighted below. That is the only change to fuse BiFPN to Yolov5 architecture! Amazing isn't it ?

Yolov5 with BiFPN

Result

This method made a huge positive impact on the overall performance , especially on Without_Helmet class which is our main class of interest. Precision of this class was improved by 6 % and recall by 3% and mAP50 by 6%. The With_Helmet class recall reached up to 61% from 47 % making a remarkable improvement of 14% .We witnessed 11% hike in the mAP50 of this class. The overall Recall reached 59% from 51 and overall mAP50 jumped from 53% to 62% making a difference of 9% . It is worth noting that the precision of With_Helmet class was the only value which saw a drop of 7%.Ratio of correctly identified With_Helmet class and Without_Helmet were improved by .07 and .04 .Detailed results are shown in Figure -BiFPN results.

BiFPN result
Baseline

There are other variant architectures also fused to yolov5 and available readily to use along with BiFPN as listed below : ( please visit for other implementations https://github.com/ultralytics/yolov5/tree/master/models/hub )

Yolov5-p7.yaml for eg, shows the architecture implementation to concatenate P3, P4, P5, P6, P7 feature maps output in the detector head as given below. Feature map#9 & #11 is added in the backbone to define P6,P7 respectively .

# YOLOv5 v6.0 backbone

backbone:

# [from, number, module, args]

[[-1, 1, Conv, [64, 6, 2, 2]], # 0-P1/2

[-1, 1, Conv, [128, 3, 2]], # 1-P2/4

[-1, 3, C3, [128]],

[-1, 1, Conv, [256, 3, 2]], # 3-P3/8

[-1, 6, C3, [256]],

[-1, 1, Conv, [512, 3, 2]], # 5-P4/16

[-1, 9, C3, [512]],

[-1, 1, Conv, [768, 3, 2]], # 7-P5/32

[-1, 3, C3, [768]],

[-1, 1, Conv, [1024, 3, 2]], # 9-P6/64

[-1, 3, C3, [1024]],

[-1, 1, Conv, [1280, 3, 2]], # 11-P7/128

[-1, 3, C3, [1280]],

[-1, 1, SPPF, [1280, 5]], # 13

]

# YOLOv5 v6.0 head with (P3, P4, P5, P6, P7) outputs

head:

[[-1, 1, Conv, [1024, 1, 1]],

[-1, 1, nn.Upsample, [None, 2, ‘nearest’]],

[[-1, 10], 1, Concat, [1]], # cat backbone P6

[-1, 3, C3, [1024, False]], # 17

[-1, 1, Conv, [768, 1, 1]],

[-1, 1, nn.Upsample, [None, 2, ‘nearest’]],

[[-1, 8], 1, Concat, [1]], # cat backbone P5

[-1, 3, C3, [768, False]], # 21

[-1, 1, Conv, [512, 1, 1]],

[-1, 1, nn.Upsample, [None, 2, ‘nearest’]],

[[-1, 6], 1, Concat, [1]], # cat backbone P4

[-1, 3, C3, [512, False]], # 25

[-1, 1, Conv, [256, 1, 1]],

[-1, 1, nn.Upsample, [None, 2, ‘nearest’]],

[[-1, 4], 1, Concat, [1]], # cat backbone P3

[-1, 3, C3, [256, False]], # 29 (P3/8-small)

[-1, 1, Conv, [256, 3, 2]],

[[-1, 26], 1, Concat, [1]], # cat head P4

[-1, 3, C3, [512, False]], # 32 (P4/16-medium)

[-1, 1, Conv, [512, 3, 2]],

[[-1, 22], 1, Concat, [1]], # cat head P5

[-1, 3, C3, [768, False]], # 35 (P5/32-large)

[-1, 1, Conv, [768, 3, 2]],

[[-1, 18], 1, Concat, [1]], # cat head P6

[-1, 3, C3, [1024, False]], # 38 (P6/64-xlarge)

[-1, 1, Conv, [1024, 3, 2]],

[[-1, 14], 1, Concat, [1]], # cat head P7

[-1, 3, C3, [1280, False]], # 41 (P7/128-xxlarge)

[[29, 32, 35, 38, 41], 1, Detect, [nc, anchors]], # Detect(P3, P4, P5, P6, P7)

]

Does it improve your small object detection use case performance? please try yourself and let me know!

NOTE: Transformer fusion also is available at yolov5-transformer.yaml.

That brings us to the end of Part2 ,Part 3 can be found here: https://medium.com/@manjusha.bs/my-experiments-with-yolov5-almost-everything-you-want-to-know-about-yolov5-series-part-3-8d930eb36677

Part 1: https://medium.com/@manjusha.bs/my-experiments-with-yolov5-almost-everything-you-want-to-know-about-yolov5-series-part-1-b38feef5359a

--

--

Manjusha Sithik

A Data Scientist Passionate about Computer Vision and Time Series Forecasting