PANet: Path Aggregation Network In YOLOv4

Published in

Clique Community

6 min readJul 3, 2020

Image Segmentation is one of the most important computer vision processes which involves partitioning the image into smaller, multiple segments, such that their representation and further analysis becomes simple. This process has a variety of applications ranging from locating tumors in medical images and bio-metric identification to developing machine vision in object detection. The process of image segmentation is divided into two main parts: Semantic segmentation and Instance segmentation.

Semantic segmentation refers to the process of classifying the pixels of an image into meaningful classes of objects like sky or road or a bus.

Instance segmentation consists of identifying, classifying and localizing various instances (objects) present in an image at the pixel level, and requires retention of the finest of the features present in the image. It is one of the most complicated tasks in the process of object detection. Previously, Mask R-CNN[4] was the most popularly used technique for instance segmentation. The god of single-shot detector techniques YOLO used Feature Pyramid Network (FPN)[2] in its third version, YOLOv3[5] for the same. But the latest version of YOLO, called YOLOv4[3] uses a new approach for instance segmentation called the Path Aggregation Network[1] or PANet or just PAN for short. Let’s understand this technique in a bit detail.

PANet:

PANet is present in the neck of the YOLOv4 model and it is mainly incorporated in the model to enhance the process of instance segmentation by preserving spatial information.

Source: YOLOv4 — Part 3: Bag of Specials | VisionWizard

Properties of PANet:

The reason why PANet is chosen for instance segmentation in YOLOv4 is because of its ability to preserve spatial information accurately which helps in proper localization of pixels for mask formation.

The properties which make PANet so accurate are:

1. Bottom-up Path Augmentation:

Source: Path Aggregation Network for Instance Segmentation

As the image passes through the various layers of the neural network, the complexity of the features increases and the spatial resolution of the image decreases concurrently. Due to this, the pixel-level masks cannot be identified accurately by the high level features.

The FPN, which is used in YOLOv3, uses a top-down path to extract and combine semantically rich features with the precise localization information. But for producing masks for large objects, this technique can be excessively lengthy as the spatial information may need to be propagated to hundreds of layers.

PANet on the other hand, takes an additional bottom-up path to the top-down path taken by FPN. This helps in shortening that path by using clean lateral connections from the lower layers to the top ones. This is called the “shortcut” connection which is only about 10 layers.

2. Adaptive Feature Pooling:

Previously used techniques like the Mask-RCNN used features from a single stage to make mask predictions. It used ROI Align Pooling to extract features from the higher levels if the region of interest was large. Although quite accurate, this could still lead to undesired results as sometimes two proposals with as minute as 10-pixel differences can be assigned to two different layers, whereas they are in fact quite similar proposals.

To avoid this, PANet uses features from all the layers, and lets the network decide which ones are useful. It performs ROI Align operation on each feature map to extract the features for the object. This is followed by an element-wise max fusion operation to enable the network to adapt new features.

3. Fully-Connected Fusion:

In Mask-RCNN, Fully Convolutional Network (FCN) is used instead of the fully-connected layers because it preserves the spatial information and reduces the number of parameters in the network. However, since the parameters are shared for all the spatial positions, the model doesn’t actually learn how to use pixel locations for making predictions; it will by default show sky in the top part of the image and roads in the bottom part.

Fully-connected layers, on the other hand, are location sensitive and can adapt to different spatial locations.

PANet uses information from both these layers to provide a more accurate mask prediction.

Modifications for YOLOv4:

PANet conventionally adds the neighbouring layers together for doing mask predictions using the adaptive feature pooling. However, this approach is slightly twisted when PANet is employed in YOLOv4, such that instead of adding the neighbouring layers, a concatenation operation is applied on them which improves the accuracy of predictions.

Source: YOLOv4: Optimal Speed and Accuracy of Object Detection

Performance Analysis:

Using a ResNet-50 backbone, and training using multi-scaled images, PANet already outperforms Mask-RCNN and Champion of 2016, further it also won the 2017 COCO Instance Segmentation challenge and is ranked second in Object Detection task without large-batch training.

It also constantly outperforms Mask-RCNN on the Cityscapes dataset. And pre-trained on COCO, the model is able to outperform Mask-RCNN by 4.4 points.

Due to its simple implementation and high performance, PANet was employed in YOLOv4, increasing the accuracy of prediction and making it twice as fast as the EfficientDet.

In terms of APs, YOLOv4 achieved an AP value of 43.5% (65.7% on AP₅₀) on MS COCO dataset, and achieved a real time speed of ~65FPS on the Tesla V100, making it the fastest and the most accurate detectors of all time. The performance of YOLOv4 due to the inclusion of PANet instead of FPN used in YOLOv3 has been increased by a whopping 10–12% !!!

Conclusion:

PANet is fast, simple and very effective. It contains components that boost the information propagation through the pipelines. It pools features from all the levels and even shortens the distance between the lowermost and the top layers. And features are enriched for each level using augmented paths.

It has shown amazing results when tested in YOLOv4 and boosted its feature extraction process by a great amount, securing its position in the neck of the YOLOv4 model.

REFERENCES:

[1] S. Liu, L. Qi, H. Qin, J. Shi and J. Jia, “Path Aggregation Network for Instance Segmentation”, arXiv.org, 2020. [Online]. Available: https://arxiv.org/abs/1803.01534. [Accessed: 25- Jun- 2020].

[2] T. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan and S. Belongie, “Feature Pyramid Networks for Object Detection”, Openaccess.thecvf.com, 2020. [Online]. Available: http://openaccess.thecvf.com/content_cvpr_2017/html/Lin_Feature_Pyramid_Networks_CVPR_2017_paper.html. [Accessed: 01- Jul- 2020].

[3] A. Bochkovskiy, C. Wang and H. Liao, “YOLOv4: Optimal Speed and Accuracy of Object Detection”, arXiv.org, 2020. [Online]. Available: https://arxiv.org/abs/2004.10934v1. [Accessed: 01- Jul- 2020].

[4] K. He, G. Gkioxari, P. Dollar and R. Girshick, Openaccess.thecvf.com, 2020. [Online]. Available: https://openaccess.thecvf.com/content_ICCV_2017/papers/He_Mask_R-CNN_ICCV_2017_paper.pdf. [Accessed: 01- Jul- 2020].

[5] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement”, arXiv.org, 2020. [Online]. Available: https://arxiv.org/abs/1804.02767. [Accessed: 01- Jul- 2020].