“FCOS”, One Shot Anchor-Free Object Detection

Published in

LSC PSD

4 min readDec 24, 2019

Introduction

Computer Vision used in multiple task has been studied for years, and now is indeed a well studied field. Object detection may be the hottest subject in computer vision. Anchor based one-stage object detection models such as SSD, YOLOv2 has dominate this subject for years. However, detection relies on preset anchor comes up with multiple problems. Therefore numbers of anchor-free object detection(e.g. DenseBox) had published, but the performances were mediocrities.

Today I’m going to introduce a anchor free object detection models named FCOS( FCOS: Fully Convolutional One-Stage Object Detection ), which outperform anchor-based model. There are other nowadays anchor free object detection models like FoveaBox or CenterNet also outperformed anchor-based one. Simply felt concepts of FCOS makes more sense to me :D.

Anchor-Free

As mentioned earlier, anchor based object detection has some unsolved issue.

The numbers of hyper parameters to set
Anchor based needed to set anchor for manually. Every person tried to tune hyper parameters knows how suffer it is to decide aspect ratio and for each feature maps. Anchor free don’t need that.
Imbalances between positive and negative samples
Anchor based models set positive box (box with object) by calculating IOU between anchor box and ground truth box. However, anchor based model like SSD produced roughly 9,000 boxes at time and labeled only some of them to positive boxes while labeled most boxes as negative(background). While FocalLoss(RetinaNet) weighted loss to reduce effect cause by sample imbalance, the problem still remain. Anchor free will balance from very beginning when positive and negative sample were chosen, and lead to better recall rate.

Model Structure

In general, FCOS is using Feature Pyramid Network(FPN) to create feature maps, and adding head after every feature maps to train Classification, Bounding Box Regression and a noble index called Center-ness.

Label Mapping

Left: FCOS is using (l, t, r, b) annotation. Right: ambiguous sample annotation

For every point on feature maps, if point located in ground truth bounding box, that point will treated as positive sample. If point located in multiple ground truth bounding boxes, that point will treated as ambiguous sample and represent the ground truth bounding boxes with minimal area in range.
Ambiguous sample might looks like a problem ,but in fact, the ambiguous sample can reduced to ~4% after applying FPN.

Unlike other object detection models, regression target of bounding boxes has been changed to range between edge of bounding box and point (l, t, r, b) instead of axis (x, y, x, y). By doing so, FCOS is able to pick more positive sample to reduce the imbalances.

Feature Pyramid Network(FPN)

Structure of FPN (Photo by Sik-Ho Tsang)

Feature Pyramid Network is a methods for creating levels of feature maps. This method is design to share the rich semantics of higher level feature maps to all other levels. By adding upsampling high level feature to lower one, lower level feature maps will become semantically strong. And also, the detection size of each layer will be different, higher levels in charge of bigger object and lower level in charge for smaller object, in order to divide the task for different layers.

For more details in FPN, I recommend this review wrote by Jonathan Hui:

Understanding Feature Pyramid Networks for object detection (FPN)

Detecting objects in different scales is challenging in particular for small objects. We can use a pyramid of the same…

medium.com

Center-ness

FCOS treat every point in ground truth box as positive sample. That results in lot of low-quality predicted bounding boxes produced by locations far away from the center of an object.
To prevent that, they added effective index to suppress such predicted bounding boxes called Center-ness. Center-ness is a index describe the distance of point to the center of ground truth box, and added as a branch after feature maps. The definition (MATH!) is showed above.
The range of Center-ness is between 0~1. While training, center-ness will calculate a Loss_center by using BCELoss(Binary Cross Entropy). And when model is used for predicting, Center-ness can multiplies to Classification score to suppress the low-quality bounding boxes, and those boxes can easily remove by Non Maximum Suppression.

Results

FCOS successfully outperformed anchor-based SoTA RetinaNet. Without showing FPS. Anchor free object detection model still needs some more improvement, but so far it achieved its milestones.
Concept of anchor box itself is not intuitive, if we’re seeking machine perform like human brain, anchor box is indeed not the answer. Therefore, I hope there will be a fascinating anchor free SoTA soon.

Github of FCOS is here, but since it needed installation, I personally like this fork by rosinality with higher legibility.

if you like(this_article):
    please(CLAPS)
    follow(LSC_PSD)
# Thanks :)