DETR的天馬行空：用Transformer走出Object Detection簡潔Pipeline

Published in

AI Blog TW

18 min readSep 3, 2020

DETR (Detection with Transformer) 是2020年FAIR團隊發表於ECCV的論文 (Oral)。看大標題以為只是用NLP的神器Transformer做文章，改個網路架構之類。事實上，這是一個在object detection問題上大膽想像的實踐，透過大幅度地借鏡自然語言處理領域的許多技巧，帶來了很新穎的pipeline，跨越了1 stage與2 stage、anchor-based與anchor-free。

雖然DETR最終在COCO上的實驗數據並不是特別突出，甚至在小物件的偵測能力偏弱，不過在pipeline上的創新點非常突出，特別對於原本了解sequence-to-sequence等NLP知識的人來看特別地有趣。

用Transformer來做object detection (組圖內小圖來源)

文章難度：★★★☆☆
閱讀建議：由於 DETR沿襲著 NLP的經典思路，本篇文章前段會簡單地介紹NLP的幾個經典改變 (sequence-to-sequence、attention、Transformer)，了解回頭介紹 Transformer在 computer vision的應用－ DETR的精神與實作方法。
推薦背景知識： object detection、sequence-to-sequence、attention、self-attention、Transformer。

先一覽創新處

大體上看過來，DETR [1] 的創新處在於給予object detection一個更簡單乾淨的pipeline：在網路上不需要proposal、anchor與window center；在後處理上不需要non maximum suppression。這個新的pipeline得利於DETR將object detection視為set prediction問題，並且在訓練時要求predict set與ground truth set間的bipartite matching。

Holistic prediction

在架構上，影像會先透過傳統的CNN (比如說ResNet) 抽出feature，再將這個feature直接送到以Transformer組成的auto-encoder網路中，直接輸出一組N個set prediction，每個set包含box center、scale與object class。

DETR的天馬行空：用Transformer走出Object Detection簡潔Pipeline

先一覽創新處

Holistic prediction

Written by Jia-Yau Shiau