A Milestone in Object Detection with Transformers

mingzhe
9 min readMar 26, 2022

--

Motivation

Back in 2017, the groundbreaking proposal of transformer injects a huge booster to the natural language processing (NLP) field. More and more NLP tasks are benefited by the temporal features with long connections and no gradient explosion or vanish.

In an effort to unify computer vision (CV) and NLP, more and more CV tasks manage to shift from convolution neural network (CNN) to transformer-based models, for instance, both object re-identification and multi-object tracking have transformer as their backbones. These tasks can leverage temporal information under consecutive frames, but when it comes to object detection, no temporal information could be used as images are static. Also, it seems that the boom of the Yolo variants indicates that CNN should dominate object detection. However, one major problem still exists — dense detection causes repetition: Yolo family segments images with grids, followed by anchors of different sizes and ratios; CenterNet uses each pixel to detect. Why all CNN-based methods can’t get rid of dense detection? Convolution has a limited receptive field due to the small kernel size for the sake of computation efficiency. In other words, convolution can only provide local attention.

Fig 1. Receptive field provides limited attention, that’s why we need deeper neural network

Is it eligible to do sparse detection? The answer is YES! As an attention-based model, a transformer can provide both local and non-local attention, even global attention! A transformer will take feature maps as encoder input and candidate embeddings as decoder inputs. Recall that transformer will take embeddings as inputs, it should be dimensionally smaller than either gathering of images pixels (embedding is to compress the image with learnable parameters) or anchors (there are overlaps amongst anchors, causing more pixel repetition).

Transformer

Prior to introducing the transformer-based object detection algorithm, let’s take a look at the transformer itself. We will focus on the core elements of the transformer.

Attention Module

We may look at the following figure hundreds of times, but still, are not clear why it is able to provide the attention of all scales. The transformer itself has both self-attention and cross-attention, where self-attention happens within the encoder or decoder and cross-attention happens when information flows from encoder to decoder. All attentions are presented as key-query pairs, and the computation proceeds in multi-head attention. Multi-head attention is one of the most crucial components and it is packed with scaled dot products.

Fig 2. Transformer Architecture

The name of the scaled dot product tells you everything that it can do. The word ‘scaled’ means using appropriate scaling to avoid gradient explosion/vanish. The word ‘dot product’ means the computation is shifted to matrix multiplication — no receptive field anymore and all elements are involved in computation! If the key and value originate the same from the query, it is self-attention, otherwise, it should be cross-attention.

Fig 3. Scaled-dot product

Skip Connection & MLP

In 2021, researchers from Google found the success of the transformer does not mean attention is everything. The skip connection (add & norm) and MLP (feed-forward) shown in Fig. 2 are also of great importance. Without such a design, the self-attention network will exponentially collapse to rank-1 as the information flow across transformers.

Positional Encoding

We may also notice that there is an encoding after the embedding. As the core operation is matrix multiplication and the attention is a scaler, it is easy to know which key-query pair has the highest attention, whereas, it is hard to know the tokens at which location are the best matching, in other words, context should also be included. That’s why we need encoding.

The encoding should be insensitive to the sequence length, as input length varies and positional encoding should be applied on all scales, from token to sentence, and should be normalized to a relatively small scale, for instance, zero to one. If we set it as one-hot encoding, there will be a huge memory cost. If we have sequential labeling followed by normalization, the distribution is sensitive to the sequence length. So we are desperate for a strategy that can monotonously increase with respect to the position and spontaneously scale itself to 0~1. Luckily, we find that sine and cosine are good fits.

Teacher-forcing Training

Now we have had a thorough review of the transformer architecture, let’s take a brief look at the training strategy. When it comes to training a spatial model, we tend to think of an autoregressive way — take the prediction as the next input state. This hugely harms the parallelism efficiency. While in the transformer, we can maximize parallelism with teacher-forcing strategy— take the ground truth as the next input state. This can also be applied in transformer-based models in object detection to be discussed in the following chapters.

To have a consistent reading and understanding of transformer-based object detectors, it is suggested to dive into the explanation of the DETR part first, followed by the introduction of deformable DETR in the next chapter, then back to the efficient DETR part of the referenced post, the further optimized version of deformable DETR with prior knowledge.

Deformable DETR

We assume you have read the introduction of DETR. It is interesting and surprisingly working, isn’t it? But you may feel a bit disappointed when you take a look at the result of DETR under the COCO dataset as both training speed and mAP are not as good as expected. The intuition should be aimed to address these two issues, or at least, partially address them. Deformable DETR is proposed.

:( Small object detection is not strong enough

Compared with Faster-RCNN, we found mAP in small object detection is not desirable with DETR. This is a common challenge in object detection. DETR attempts to alleviate the situation by introducing ResNet50-DC5, a backbone activating deformable convolution in the last layer of ResNet-50, and achieved better performance in mAP. However, this minor optimization may not elegantly solve the problem, as though deformable convolution expands the receptive field, the computation cost (FLOPs) is shocking, twice of original DETR. Looking back to the history of dealing with small object detection, there are successful attempts with multi-scale features, to illustrate, FPN, a top-down path with feature maps of different scales. Inspired by deformable convolution, deformable DETR is born.

Having a brief glance at Fig 4, you may quickly associate multi-scale attention with FPN, while it is not, actually, it leaves FPN behind. As feature maps of all scales are imported into the encoder, information can automatically exchange across feature maps, thanks to multi-scale self-attention (discussed in the following paragraphs). The information flow is strong enough as there is no improvement even if FPN is explicitly added.

Taking a closer look at Fig. 4, you may find multi-scale attention only happens in the encoder. This is because object queries in the decoder are initially randomized and there is no multi-scale information.

Note that each feature map comes from all feature maps from the previous encoder, we need to distinguish which pixel comes from which feature map, thereby we need extra scale-level embedding, besides positional embedding.

Fig 4. Deformable DETR Architecture

:( Training is too slow

Knowing that the decoder input of DETR is only learnable embeddings with random initializations, we are surprised to find it working. Though global attention is a good proposal, it takes approx. 500 epochs and thousands of GPU hours to converge. It is high time that we should adopt a sparse attention mechanism.

There are three mainstream candidates: fixed sliding window; leverage low-rank property to do dimensionality reduction, and data-dependent sparse attention. A fixed sliding window restricts the attention to neighboring pixels, but it loses global attention. We need to figure out a rubric for selecting pixels from input feature maps. Implementing dimensionality reduction is another candidate and there are successful attempts like doing linear projection on key elements or channel reduction. But these methods are tricky and need a lot of experiments. Data-dependent sparse attention is the one that deformable DETR is looking for. We can do the sampling of the input feature maps with limited pixels to participate in attention computation.

While query features are still learnable embeddings, they will undergo two more embeddings, one is to determine offsets with respect to the current referenced point (each pixel is one referenced point) regardless of the spatial size of the feature map, another is to determine the attention weights, scaled by softmax. The output is the aggregation of offsets and weights. Note that we only sample a rather small number of pixels of interest, 4 in this case, we tremendously reduce the computation cost of attention.

Fig 5. Sampled attention in deformable DETR

We think of an extreme case — we have only one attention head and sample one offset, also the weights are fixed and identity, the deformation attention will be degraded to deformable convolution. We also think of another extreme case — we sample every pixel in the input feature map, the deformation attention will be generalized to global attention. This means the method is built upon deformable convolution and optimized based on global attention.

Further Optimizations

We have concluded all the core features of deformable DETR. But the authors made further optimizations: bounding box refinement iteratively & two-stage deformable DETR. Iteratively refining the bounding box is inspired by optical flow, iteratively refining the bounding box based on output from the previous decoder. Two-stage deformable DETR is a variant of the original proposal, meaning we want to leverage region proposals from the encoder output as priors, whose intuition is similar to efficient DETR. The two-stage deformable DETR is an encoder-only network, with a feature extractor as the first stage and encoder as the second.

One of the anonymous reviewers of deformable DETR asked if these optimizations can be applied to DETR. While the theoretical answer is yes, considering humongous computation cost, the authors did not test this.

Experiments

Based on the aforementioned two concerns, we focus on training time and mAP on small object detection.

Fig 6. Performance Comparison between DETR and Deformable DETR

Fig. 6 quantitatively presents us the performance of DETR and deformable DETR under the COCO 2017 dataset. Deformable DETR needs only 1/10 epochs of DETR to converge, meaning training speed is much faster. The metric AP@S represents the mAP of small objects, and there is a 6% improvement with 50 epochs.

Fig 7. More quantitative evaluation on COCO 2017 dev set

Fig. 7 shows more quantitative result comparisons. Deformable DETR does relieve the burden on learnable parameters and the FLOPs, thus training speed is much faster.

Comments on Deformable DETR

Follow-up on Fig. 7, both DETR and deformable DETR has a slower inference speed than Faster-RCNN while they are trying to have competitive mAPs. There is room for inference speed improvement as we may not need 6 encoders+6 decoders given appropriate prior information. You will have a better understanding after reviewing efficient DETR.

References

[1] Nikolas Adaloglou, Understanding the receptive field of deep convolutional networks, https://theaisummer.com/receptive-field/

[2] Vaswani, Ashish & Shazeer, Noam & Parmar, Niki & Uszkoreit, Jakob & Jones, Llion & Gomez, Aidan & Kaiser, Lukasz & Polosukhin, Illia. (2017). Attention Is All You Need.

[3] Dong, Yihe & Cordonnier, Jean-Baptiste & Loukas, Andreas. (2021). Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth.

[4] Zhu, Xizhou & Su, Weijie & Lu, Lewei & Li, Bin & Wang, Xiaogang & Dai, Jifeng. (2020). Deformable DETR: Deformable Transformers for End-to-End Object Detection.

--

--