DETR and Efficient DETR

11 min readMar 20, 2022

Motivation for using transformer in Computer Vision

Since the publication of paper “Attention Is All You Need” in 2017, transformer has become more and more popular and achieved lots of success in Natural Language Processing (NLP) tasks. At the same time, there is also a trend to explore how transformer can be used in Computer Vision (CV) field. This kind of attempt is worthy at least from two aspects:

Unify models for NLP and CV domains. As we know, NLP and CV are the two most important application fields for deep learning models. However, RNN-based models work best for NLP because sentences are series data, while CNN-based models work best for CV because of local connections. This separation makes multimodal tasks like VQA difficult. If we can unify models for NLP and CV (use transformers), it will be a huge improvement.
Leverage Global attention to compensate for CNN’s local receptive fields. Although local receptive field is beneficial for CNN to exploit spatial information, it results in a lack of global viewpoint. Transformer’s attention mechanism is a perfect compensation since for every input token, it has an attention weight.

Object Detection with Transformers

Among all CV tasks, Object Detection is a pioneer to take the advantages of transformers. DETR is the first transformer-based model for object detection, which is proposed in paper “End-to-End Object Detection with Transformers”. It has three main advantages over traditional object detection models like Faster R-CNN and YOLO.

It is an end-to-end model without any hand-designed components like anchors, region proposals or non-maximum suppression.
It can achieve similar results with SOTA models, but using fewer parameters and FLOPs, which indicates efficiency.
Thanks to global attention mechanism, DETR can achieve better performance on large objects.

In this post, we will include technical details of DETR and Efficient DETR. It may require some knowledge about Transformer and Deformable DETR for better understanding, which will be discussed too much. You can refer to this excellent post for further information.

As a brief introduction, DETR views object detection as an image-to-set problem. It utilizes CNN+Transformer Encoder to extract features from images. These features will be stored in encoder memory. Then, a set of object queries are initialized as initial guess of objects in the image and sent into Transformer Decoder. After several stages of refinements, these object queries will collect enough information for each object, then sent to a shared FFN to finally predict object classes and bounding boxes.

We’ll delve into more details about how DETR works in the next section.

DETR

The entire work flow of DETR (as shown in Fig1) can be broken into three parts.

Part I: Feature Extraction

This part aims at extracting useful information from original image. It serves the same position as CNN backbone in traditional object detection model but includes transformer encoder to learn global dependency.

The input image first comes through a CNN model, gets transformed into a smaller feature map. For example, if initial images come in shape (3,H₀,W₀), the output feature size will be (C, H, W). Here H and W are much smaller than H₀ and W₀. Then, an additional 1×1 convolution is used to reduce channel dimension from C to d. After that, the spatial dimensions will be flattened because encoder expects a sequence as input, resulting in a d×HW feature map.

Then, this newly generated feature map will be added by positional encoding. Positional Encoding is used to recover spatial information which is lost during flattening. Its input is spatial position (h,w) and output is a positional vector with length d which can be added to feature vector. The projection can be either learnt during training or hard-coded. The core principle is that the encoding should be different for all positions, so that we can tell differences between any two positions. A typical example is to use sin/cos functions.

After that, there will be HW tokens and each has d channels. They will be sent into encoder as input. Inside encoder, they will learn dependencies on each other and finally become higher-level features stored in encoder memory. The advantage from encoder over traditional CNN originates here. In CNN, neurons of next layer can only get access to a set of adjacent features points from previous layer. While in encoder, they have access to all feature points with different attention weights. It enables larger and sparse receptive field.

Part II: Object Queries & Decoder

Fig3 *Object Queries & Decoder* part of DETR

After first part, we already have extracted feature information stored in encoder output. Thus in second part we should collect and assign information for each object, which is accomplished by using object queries. Object queries are K vectors with length d (same as input token), working as inputs to decoder. Each object query can be considered as storing information for one unique object inside the image. They are randomly initialized (yes, it sounds unreasonable) and being updated after each decoder layer.

Let’s take a deeper look inside each decoder layer, as visualized in Fig3. There are two different kinds of attentions: self-attention and cross-attention. In self-attention layer, value, key and query are all projected from object queries. This layer can be considered as to learn the dependencies between different objects, which is common in real-life, like there are always humans on bicycles (sometimes monkeys).

In cross-attention layer, only query is from object queries, while value and key are from encoder output, which is fixed after first part. This layer can be considered as searching useful information inside a warehouse (encoder output) with all information about the image. Each object query is searching information for corresponding object, assigning high attention weight to highly-correlated token.

At first decoder layer, object queries are random vectors with no information. However, after 6 layers’ refinement, they have collected enough information about objects and can be used for final prediction.

One thing worth noting is that we predict all objects in a single pass instead of auto-regressively output a sentence in original transformer. That is because there is no sequence-dependency in objects. All objects are independent of each other, so we can predict them simultaneously.

Part III: Final prediction & Bipartite Matching

Fig4 *Final prediction & Bipartite Matching* part of DETR

Now, we already have information about each object contained in corresponding object query. It’s time to finally predict the object based on object query. Here a FeedForward Network (FFN) is used and shared by all object queries. Its input is a (d,) vector and output is the class and bounding box of one object. At last, we accomplish object detection of one image by combining output of all object queries.

There remains one problem to solve: since we are designing an object detection pipeline for whatever image, we need to fix the number of object queries (K) and this number must be much larger than the number of possible objects in any image, e.g. K=91 used in original paper. However, when this model is used on a specific image, there may be only N objects (N ≪ K). When testing, it is reasonable to expect that most of the object queries will predict ‘no object’ and we just need to combine all queries that have valid object predictions.

However, for training, it is important to match K predictions and N ground truths for computing loss. We first pad N objects into K, leaving N-K ‘no-object’ class. Then, there are two sets of K elements and we need to match them one-by-one. It can be viewed as a Bipartite Matching (BM) problem where we need to use edges to connect two sets and every endpoint can and only can have one edge, and maximize the sum of edge weights. Here we define the weight of an edge as the IoU value of predicted and ground truth bounding box. There is an example shown in Fig5.

A common way to solve BM problem is to use famous Hungarian Algorithm, which will not be illustrated here. We recommend this post for reference if you are interested in how this algorithm works.

Results

After giving a detailed description about the work flow of DETR, let’s take a closer look at its performance. These experiments shown in Fig6 are performed on COCO 2017 detection dataset. We can see that DETR model can achieve similar result (42.0 mAP) with Faster RCNN-R101-FPN model with fewer parameter (41M vs 60M) and nearly one-third GFLOPs (86 vs 246). With more complex architecture, it can achieve even better results.

Another important observation is that most performance improvement comes from middle and large scale objects. It is reasonable since the advantage of transformer model is its larger receptive field, which will definitely benefits detection of larger objects.

While DETR is a novel and well-performing object detection model, it also has some cons:

DETR needs more time to converge. More specifically, the training time of DETR is 10x longer than Faster R-CNN, even if it has fewer parameters and FLOPs.
DETR performs worse on small scale objects. We can see a drop on small scale objects (21.4 to 20.5), which means global attention mechanism sometimes make it hard for the model to concentrate on small regions.

These two main weaknesses bring our next paper — Efficient DETR to solve.

Efficient DETR

The first milestone variant proposed to address two weaknesses discussed above is Deformable DETR. It improves small scale object performance by incorporating multi-scale input image and reduces training time by replacing dense attention with deformable attention which is sparse. Another important concept introduced by this work is reference point, which provides better interpretability for attention. We will not provide too much detail here. You can refer to another post for more information about Deformable DETR.

Efficient DETR is built upon Deformable DETR, but utilizes Dense Prior for better initialization of object containers to reduce training time.

Recall from DETR and Deformable DETR, object containers (object queries + reference points) are randomly initialized, which is counterintuitive because we want them to include as much information as possible about the object. And several tests show that if we remove one layer from model, the impact of deleting decoder is greater than that of encoder. It is also counterintuitive because for decoder, only object queries are updated after one layer. These two phenomena leads to an assumption that it it is bad object container initialization is the reason of long time training?

To answer this question, the authors design two investigation test:

1st test

For first test, the authors visualized reference points distribution for first and last decoder layer, as shown in Fig7.

Fig7 Test1 — Reference points distribution

According to the figure, one observation is that although randomly initialized, a nearly uniform distribution of reference points for first layer can be learnt after training. It is reasonable since if we do not know what input image will look like, the object can be present on any part of the image. In this situation, a uniform distributed attention is always the best for first layer. Another observation is that after 6 layers’ refinement, reference points are almost focused on potential objects in the image (like cows in left picture and humans in right picture). It proves that what decoder layers doing is to refine attention distribution stage by stage.

2nd test

For second test, the authors tried different initialization pattern and observed difference of performance.

Fig8 Test2 — Different initialization pattern and performance

We can see with different initialization pattern, performance of first decoder layer changes a lot. But both distribution and performance for last layer are similar. A thing to note that is if we use dense prior initialization that proposed in this paper, we can achieve similar result with sixth layer in the first layer (39.0 vs 42.x), which means we can drop most of, even all following layers. But how to generate such initialization? Let’s come to the architecture of Efficient DETR.

Architecture

The whole architecture of Efficient DETR, as shown in FIg9, can be divided into three parts. Part I and Part III are normal transformer encoder and decoder, exactly the same as Deformable DETR. The only difference is that thanks to better object container initialization, there is only 3 encoder layers and 1 decoder layer, which is ‘efficient’ from.

The core idea is inside Part II, whose input is extracted features, output is K initialized object queries. There are 3 steps to generate dense prior initialization.

Step1: Use RPN (Regional Proposal Net) Head to propose N region candidates with object scores.

Step2: Rank all candidates by object scores and fetch top-K as 4-d reference point. Note that here reference point is a 4-d point (a region) instead of 2-d (a point) in Deformable DETR, because experiments show that this scheme can include more object information.

Step3: Pick encoder feature that corresponding to selected region as (d,) object query. Here we manually set output feature map has the same number of channels (d), so a selected region has feature map size (d,h,w). A Global Average Pooling layer is used to collapse spatial dimension and generate (d,) vector as object query.

Results

Observing the results shown in Fig10, we can tell that deformable attention mechanism and dense prior initialization significantly reduce training time of DETR from 10x longer than to the same as faster R-CNN. Also, they even make up for the lack of detection ability on small objects (20.5->26.4->28.4), with just little drop on large objects.

Conclusion

In this article, we have given detailed illustration of a novel object detection model — DETR and its variants. We have seen the pros and cons of DETR and how some creative tricks to mitigate, even solve these problems. DETRs may not be the best object detection model for now, but there is one thing to note: DETR just used 2 years to catch up with Faster R-CNNs and YOLOs, which have taken near 8–10 years to develop. Therefore, we believe DETR still has very large potential to improve.

References

Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).
Carion, Nicolas, et al. “End-to-end object detection with transformers.” European conference on computer vision. Springer, Cham, 2020.
Zhu, et al. “Deformable detr: Deformable transformers for end-to-end object detection.” arXiv preprint arXiv:2010.04159 (2020).
Yao, Zhuyu, et al. “Efficient detr: improving end-to-end object detector with dense prior.” arXiv preprint arXiv:2104.01318 (2021).
Ren, Shaoqing, et al. “Faster r-cnn: Towards real-time object detection with region proposal networks.” Advances in neural information processing systems 28 (2015).