Revolutionary Object Detection Algorithm from Facebook AI

A 2020 approach to state of the art object detection

Published in

VisionWizard

10 min readJun 14, 2020

A detailed breakdown of an unorthodox object detection algorithm “Detection Transformer or DETR” from Facebook AI released in May 2020.

Let’s jump right in.

Table of Content

Introduction
Overall Idea of DETR
How DETR differs from other object detection methods?
What is a transformer?
DETR Pipeline
DETR Architecture
Loss Function
Results
Code
Conclusion
References

Introduction

In object detection world, the research has pivoted to new approaches and techniques to improve the accuracy further. Object detection is one of the most researched fields in Artificial Intelligence. All major university/company are publishing papers every year.

In the initial stage of object detection research, researchers had a focus on improving the image features from the backbone. The idea was to create a backbone architecture that creates most suitable features for further box classification and prediction.

Furthermore, the introductions of FPNs, ResNet modules, Inception modules, etc. brought the idea of effective routing of features that can lead to better refinement in later layers. It further increased the accuracy of object detection models. Afterwards, researchers started focusing on the efficiency part of the task. Several algorithms like Yolo, SSD, MobileNet, SqueezeNet etc. named as one stage detectors that can be used in real time came along the way.

Now, after years of research, object detection has turned into this beautiful mess of many moving parts. Complex object detectors of today have all the capabilities like high accuracy, real time, small object predictions in low resolution images and many more that were unheard of in less than a decade ago.

Still object detection tasks lacks the simplicity of classification task in terms of training, testing and a unified architecture that can generalize well with limited parameters.

To overcome this, Facebook AI launched a paper “DETR : End-to-End Object Detection with Transformers” in May 2020. The authors used an intersection of NLP(Natural Language Processing) and Computer Vision to accomplish the object detection task.

Overall Idea of DETR

In DETR, object detection problem is modeled as a direct set prediction problem.

The approach don’t require hand crafted algorithms like non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task. It makes the detection pipeline a simple end to end unified architecture.

The two novel components of the new framework, called DEtection TRansformer or DETR, are

a set-based global loss that forces unique predictions via bipartite matching.
a transformer encoder-decoder architecture.

Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel.[1]

How DETR differs from other object detection methods?

DETR formulates the object detection task as an image-to-set problem.

Given an image, the model must predict an unordered set (or list) of all the objects present, each represented by its class, along with a tight bounding box surrounding each one. Transformer acts as a reasoning agent between the image features and the prediction. [2]

If you want to consider transformer as a black box, you can skip the next part. I would recommend to read this if you never heard about what transformer does. It is an Natural Language Processing(NLP) concept.

What is a transformer?

The paper ‘Attention Is All You Need’ introduces a novel architecture called Transformer. As the title indicates, it uses the attention-mechanism.

Like LSTM, Transformer is an architecture for transforming one sequence into another one with the help of two parts (Encoder and Decoder), but it differs from the previously described/existing sequence-to-sequence models because it does not imply any Recurrent Networks (GRU, LSTM, etc.).[3]

Fig 1 : Architecture of Transformer *From ‘Attention Is All You Need’ by Vaswani et al.*

The Encoder is on the left and the Decoder is on the right. Both Encoder and Decoder are composed of modules that can be stacked on top of each other multiple times, which is described by Nx in the figure. We see that the modules consist mainly of Multi-Head Attention and Feed Forward layers .[3]

Transformers rely on a simple yet powerful mechanism called attention, which enables AI models to selectively focus on certain parts of their input and thus reason more effectively.

Transformers have been widely applied on problems with sequential data, in particular in natural language processing (NLP) tasks such as language modeling and machine translation, and have also been extended to tasks as diverse as speech recognition, symbolic mathematics, and reinforcement learning. But, perhaps surprisingly, computer vision has not yet been swept up by the Transformer revolution. [2]

DETR completely changes the architecture compared with previous object detection systems. It is the first object detection framework to successfully integrate Transformers as a central building block in the detection pipeline.

DETR Pipeline

Inference

Calculate image features from a backbone.
To transformer encoder-decoder architecture.
Calculate set of predictions (all at once)

Training

Calculate image features from a backbone.
To transformer encoder-decoder architecture.
To a set loss function which performs bipartite matching between predicted and ground-truth objects to remove false or extra detection's.

Advantages of the DETR pipeline

Easy to use
No custom layers
Easy extension to other tasks
Aprior information about anchors or hand crafted algorithms like NMS are not needed

DETR Architecture

It contains three main components as follows

1. CNN backbone to extract a compact feature representation

2. An encoder-decoder transformer

3. A simple feed forward network (FFN) that makes the final detection prediction.

Backbone

The authors have used ResNet50 as the backbone of DETR. Ideally, any backbone can be used depending upon the complexity of the task at hand

Backbones provide low dimensional representation of the image having refined features.

Before we move to details of transformer encoder and decoder, I recommend you to go through this great explanation before proceeding further.

Transformer

As you can see, it is very similar to the original transformer block with minute differences adjusted to this task.

Encoder

First, a 1x1 convolution reduces the channel dimension of the high-level activation map from C to a smaller dimension d, creating a new feature map d×H×W. The encoder expects a sequence as input so it is collapsed to one dimension, resulting in a d×HW feature map.

Each encoder layer has a standard architecture and consists of a multi-head self-attention module and a feed forward network (FFN).

Since the transformer architecture is permutation-invariant, they supplement it with fixed positional encodings that are added to the input of each attention layer.

For in-depth details about the multi-head attention module, I would recommend you to read detailed explanation provided in the supplementary material of the [1].

Decoder

The decoder follows the standard architecture of the transformer, transforming N embeddings of size d using multi-headed self- and encoder-decoder attention mechanisms.

The difference with the original transformer is that DETR model decodes the N objects in parallel at each decoder layer.

These N input embeddings are learnt positional encodings that they refer to as object queries,and similarly to the encoder, they are added them to the input of each attention layer.

The N object queries are transformed into an output embedding by the decoder. They are then independently decoded into box coordinates and class labels by a feed forward network (FFN), resulting N final predictions.

Using self- and encoder-decoder attention over these embeddings, the model makes a judgement about all objects based on entire context of an image using pair-wise relations between them.

The decoder receives queries (initially set to zero), output positional encoding (object queries), and encoder memory, and produces the final set of predicted class labels and bounding boxes through multiple multi-head self-attention and decoder-encoder attention. The first self-attention layer in the first decoder layer can be skipped.[1]

Feed-Forward Network(FFN)

FNN is a 3-layer perceptron with ReLU activation function and hidden dimension d, and a linear projection layer. FFN layers are effectively multi-layer 1x1 convolutions, which have Md input and output channels.

The FFN predicts the normalized center coordinates, height and width of the box w.r.t. the input image, and the linear layer predicts the class label using a softmax function.

Since we predict a fixed-size set of N bounding boxes, where N is usually much larger than the actual number of objects of interest in an image, an additional special class label ∅is used to represent that no object is detected within a slot. This class plays a similar role to the “background” class in the standard object detection approaches.[1]

Loss Function

The following explanation may seem a lot to grasp but trust me when you read it carefully it is just two simple steps.

Calculate the best match of predictions with respect to given ground truths using a graph technique with a cost function.(unique to DETR)
Next, we define a loss to penalize the class and box predictions. (usual step)

A short detour to what is bipartite matching from GeeksForGeeks

A matching in a Bipartite Graph is a set of the edges chosen in such a way that no two edges share an endpoint. A maximum matching is a matching of maximum size (maximum number of edges).
In a maximum matching, if any edge is added to it, it is no longer a matching. There can be more than one maximum matching’s for a given Bipartite Graph.

In object detection context, we have to find best predicted box for a given ground truth. This part eliminates the need of Non Maximum Suppression(NMS) and the anchors used in the object detectors of today.

The loss function is an optimal bipartite matching function. Allow me to simplify it.

Let y to be the number of ground truths in the image I.
Let y_ be the number of predictions by the network.

The y_ is fixed to the value N which is assumed to be much larger than overall predictions in any image. The (actual predictions + the remaining ones are padded as no object labels ) equates to N.

Next, they find a bipartite matching between these two sets using a matching function across a permutation of N elements with the lowest cost as follows:

**Fig 6 : Best match between pred and gt with lowest cost [1]**

where Lmatch(yi,ˆyσ(i)) is a pair-wise matching cost between ground truth yi and a prediction with index σ(i). It is formulated as an assignment problem with m ground truth and n predictions and is computed efficiently with the Hungarian algorithm over mxn matrix.

The matching cost takes into account both the class prediction(classification) and the similarity of predicted and ground truth boxes(regression).

Each element i of the ground truth set can be seen as a yi= (ci,bi) where ci is the target class label (which may be∅) and bi∈[0,1] is a vector that has four attributes — normalized ground truth box center coordinates, height and width relative to the image size. For the prediction with index σ(i) we define probability of class ci as ˆpσ(i)(ci) and the predicted box as ˆbσ(i). The first part of loss takes care of class prediction and the second part is the loss for the box prediction. The matching cost is defined as follows:[1]

Fig 7 : Matching Cost Function [1]

After receiving all matched pairs for the set, the next step is to compute the loss function, the Hungarian loss.

**Fig 8 : Hungarian Loss between pred and gt [1]**

where ˆσ is the optimal assignment/best match computed in the matching cost function.

It does a negative log likelihood between all N permutations of predictions and ground truth to penalize the extra and incorrect boxes and classifications corresponding to them. It is same as most of the other object detectors out there.

In the paper, the author’s down-weight the log-probability term when ci=∅(no object) by a factor 10 to account for class imbalance. It is similar to FasterRCNN and other two stage detectors to account for positive to negative imbalance ratio.[1]

Box Loss Function (L_box)

The paper uses a linear combination of L1 and Generalized IOU loss (scale invariant in nature).

Fig 9 : Box Loss Function [1]

This loss helps to predict the box directly without any anchor reference or scaling issue. These two losses are normalized by the number of objects inside the batch.

Results

Code

Authors have provided the inference code (< 50 lines of pytorch code) on the last page of the paper[1].

Github Link : https://github.com/facebookresearch/detr

Conclusion

DETR is a fresh design for object detection systems based on transformers and bipartite matching loss for direct set prediction.
It achieves significantly better performance on large objects than Faster R-CNN, likely thanks to the processing of global information performed by the self-attention.
It under performs on a smaller objects compared to other object detectors of same magnitude.
It take long takes long training hours and is not real time.
The transformer architecture leads to significant overhead in training/inference.

We have not covered extensive results and ablation studies given in paper. I would recommend you to go through it in paper as they are fairly straightforward. Additionally, we have skipped panoptic segmentation related discussions as they are beyond the scope of this article.
I hope you found the content meaningful. If you want to stay updated with the latest research in AI, please follow us at VisionWizard.