DETR, Object detection with Transformer

Neil Wu
LSC PSD
Published in
3 min readJun 1, 2020
Every articles about Transformer, start with transformer

TL; DR

  1. Facebook AI Research(FAIR) published a Object Detection Model named DETR, which using encoder-decoder transformer.
  2. It successfully removed hand-designed components that needs prior knowledge in previous object detection models.
  3. Outperformed Faster R-CNN baseline on COCO dataset. (Yes, the same Faster R-CNN published 2015.)
  4. Code & pretrained model available: https://github.com/facebookresearch/detr

FAIR published first object detection model named DETR (DEtection TRansformer) which adopt transformer as part of detection structure on May 2020. The paper “End-to-End Object Detection with Transformers” can be found here.

For those who aren’t familiar with Transformer, please check this article:

What’s Good

Compares to previous state-of-the-art models in object detection, DETR has significantly less hyperparameter to set. DETR doesn’t need to set the number of anchor box, aspect ratio, default cordinates of bounding boxes, even threshold for non-maximum surpression. DETR hand all those task to encoder-decoder transformer and bipartite matching, and achieve more general models for diversified usage.

Architecture

The architecture of DETR

The architecture of DETR has three main components, which are a CNN backbone to extract a compact feture representation, encoder-decoder transformer, Feed-Forward Netoworks.

After feature extractions by CNN, 1x1 convolution will reduced the channel dimension of final outputs of CNN. Since transformer is permutation invariant, the fixed positional encoding will be supplement before input transformer encoder.

Transformer decoder is difference with the originals. For N inputs, it decodes N outputs in parallel instead of decodes one element at time. The final predictions will compute by Feed Forward Network(FFN). The FFN predicts the center coordinates(normalized), height and width, and the linear layer predicts the class by softmax function.

What’s New

Besides the transformer part in architecture, DETR also adopt two major components from previous research.

  • Bipartite Matching Loss
  • Parallel Decoding

Bipartite Matching Loss

Loss in DETR is the sum of bipartite matching loss

Unlike other object detection models label bounding boxes (or point, like methods in object as points) by matching multiple bounding boxes to one ground truth box, DETR is using bipartite matching, which is one-vs-one matching.

By performing one-vs-one matching, its able to significantly reduce low-quality predictions, and achieve eliminations of output reductions like NMS.

Bipartite matching loss is designed based on Hungarian algorithm. Won’t go over detail here, please check the paper for further informations.

Parallel Decoding

As mentioned above, transformer decoder decodes N outputs in parallel instead of decodes one element at time.

Performance

It outperformed the SoTA (in 2015) model, Faster R-CNN!

The performance of DETR was compared with Faster R-CNN in COCO dataset. To be honest, compared with a SoTA published years ago seems not quite fair.

However, its undoubtedly a big step in Object Detection field. After published of transformer, researchers tried a lot to reasonably implement transformer into computer vision models. But charecteristics of transformer isn’t suitable for two-dimension(image) input. DETR achieves it by extracting features from CNN, and changing the final output of CNN into one-dimension data. This implementation of transformer is not only reasonable, its brilliant.

Why Important?

For everyone of us who experienced 2018, we all know how big the impact was transformer brought to Nature Language Processing. Basically every prior model was replaced by transformer and be outperformed.

Although computer vision doesn’t looks like a field for transformer, it still appears and achieved a good score. What’s more, it overcomes some problems that current object detection model can’t solve.

History may repeats itself. It’d be better to understands it earlier.

if you like(this_article):
please(CLAPS)
follow(LSC_PSD)
# Thanks :)

--

--