Paper Review: “OTA: Optimal Transport Assignment for Object Detection”

In this paper, I will review OTA which is an advanced label assignment for object detection from a global perspective and formulated the assigning procedure as an Optimal Transport problem.

7 min readJan 23, 2022

Original paper: https://arxiv.org/pdf/2103.14259.pdf

1. Label Assignment

CNN-based object detectors like YOLO, RetinaNet, FCOS, CenterNet,… detect objects by predicting classification labels and regressing offsets for a predefined set of anchors (For anchor-free detectors like FCOS, feature points can be viewed as shrunk anchor boxes. The authors refer to anchor box and anchor point as “anchor”). So to train the detector, defining cls and reg targets for each anchor is a necessary procedure, which is called label assignment in object detection.

Different models have different strategies for matching the ground-truth (gt) object or background for each anchor. As shown figure 1(a), RetinaNet uses IoU to divide the anchor boxes from different pyramid levels into positives and negatives. First, it labels the best anchor box of each object and the anchor boxes with IoU > θ_p as positives then regards the anchor boxes with IoU < θ_n as negatives, finally other anchor boxes are ignored during training. For the FCOS model, it divides the anchor points from various pyramid levels using spatial and scale constraint. It examines the ground-truth box anchor points as candidate positive samples, then picks the final positive samples from candidates depending on the scale range provided for each pyramid level, and lastly considers the unselected anchor points as negative samples.

**Fig. 2:** An illustration of ambiguous anchor points in object detection ([2])

However, above methods only focus on individual gt objects while failing to consider context information from a global perspective. Example in Fig 2, the above methods have difficulty dealing with ambiguous anchors (anchors that are qualified as positive samples for multiple gts simultaneously). Assigning ambiguous anchors to any gt (or background) can cause problems during training and inference.

Thus a better assigning strategy should get rid of the convention of pursuing optimal assignment for each gt independently and turn to the ideology of global optimum, in other words, finding the global high confidence assignment for all gts in an image. ([2])

To solve this problem, the authors propose to formulate label assignment as an Optimal Transport (OT) problem — a special form of Linear Programming (LP) in Optimization Theory.

2. Method

2.1 Optimal Transport Problem

The Optimal Transport (OT) problem descibles as follows: supposing there are m suppliers and n demanders in a certain area. The i-th supplier holds s_i units of goods while the j-th demander needs d_j units of goods. Transporting cost for each unit of good from supplier i to demander j is denoted by c_{ij}. The goal of OT problem is to find a transportation plan π*={π_{i, j}|i=1,2,…m, j = 1, 2, …, n} according to which all goods from suppliers can be transported to demanders at a minimal transportation cost:

This is a linear program with a polynomial time solution. we use a rapid iterative approach called Sinkhorn-Knopp to solve this problem (details in [2]).

2.2 OT for Label Assignment

Supposing there are m gt targets and n anchors for an image input I. We treat each gt as a supplier that hold k units of positive labels (i.e., s_i = k)and each anchor as a demander who needs one unit of label (i.e., d_j = 1). The cost c^{fg} for transporting one unit of positive label from gt_{i} to anchor a_j is defined as the weighted summation of their cls and reg losses:

**Fig. 3:** An iIlustration of IoU loss ([3])

Besides, a large set of anchors are treated as negative samples during training. The authors introduce an other supplier-background , who only provides negative labels. The number of negative labels that background can supply as n − m × k. The cost for transporting one unit of negative label from background to a_j is defined as:

where ∅ means the background class. The supplying vector s should be correspondingly updated as:

As we already have the cost matrix, supplying vector s and demanding vector d, the optimal transportation plan π* can be obtained by solving this OT problem via the off-the-shelf Sinkhorn-Knopp Iteration. Noted OTA only increases the total training time by less than 20% and is totally cost-free in testing phase.

2.3 Advanced Designs

**Fig. 4:** An iIlustration of Optimal Transport Assignment. ([2])

Previous methods only select positive anchors from the center region of objects with limited areas, called Center Prior. For general detection datasets like COCO, the authors find the Center Prior still benefit the training of OTA. Hence, they impose a Center Prior to the cost matrix. For each gt, they select r^2 closest anchors from each FPN level according to the center distance between anchors and gts. As for anchors not in the r^2 closest list, their corresponding entries in the cost matrix c will be subject to an additional constant cost to reduce the possibility they are assigned as positive samples during the training stage.

Dynamic k Estimation. Normally, the appropriate number of positive anchors for each gt should be different and based on many factors like objects’ sizes, scales, and occlusion conditions, etc. The authors propose a simple but effective method to roughly estimate the appropriate number of positive anchors for each gt based on the IoU.

Specifically, for each gt, we select the top q predictions according to IoU values. These IoU values are summed up to represent this gt’s estimated number of positive anchors. We name this method as Dynamic k Estimation. Such an estimation method is based on the following intuition: The appropriate number of positive anchors for a certain gt should be positively correlated with the number of anchors that well-regress this gt. ([2])

Conclusion, Optimal Transport Assignment algorithm is depicted in the figure 5 (including Center Prior and Dynamic k Estimation).

**Fig. 5:** Optimal Transport Assignment algorithm. ([2])

3. Experiments and Results

3.1 FCOS

In the experiments, the author uses FCOS — an anchor-free detector because of its simplicity. By eliminating the predefined set of anchor boxes, FCOS avoids computation related to anchor boxes such as calculating overlapping during training. It also avoids all hyper-parameters related to anchor boxes, which are often very sensitive to the final detection performance. we conduct extensive experiments on MS COCO 2017 which contains about 118k, 5k and 20k images for train, val, and test-dev sets, respectively.

**Fig. 6:** The network architecture of FCOS. ([4])

3.1 The results

The pictures below are some comparison results obtained (details in the original paper).

**Fig. 7:** Performances of different label assigning strategies([2])

**Fig. 7:** Visualizations of assigning results ([2])

**Fig. 8:** Performance comparison with state-of-the-art one-stage detectors on MS COCO 2017 ([2])

**Fig. 9:** Performance comparison on the CrowdHuman validation
set. ([2])

Conclusion

OTA turns the labeling process in object detection into an Optimal Transport issue, with the goal of conveying labels from ground-truth objects and backgrounds to anchors for the least amount of cost. However, OTA uses Sinkhorn-Knopp Iteration algorithm which brings 20% extra training time, which is quite expensive for training.

You can find official source code of paper at: