Transformers for Vision/DETR

Published in

The Startup

5 min readJul 11, 2020

Transformers are widely known for their accomplishments in the field of NLP. Recent investigations prove that transformers have the inherent ability to generalize and fit into many tasks, this inherent capability of the architecture has become the primary reason to adopt the transformers in the vision field

What is DETR? Why DETR?

well, DETR stands for “detecting transformers”. This was proposed by FAIR and the benchmarks are slightly better than Faster-RCNN

Existing object detection frameworks are carefully crafted with different design principles. Two-stage detectors(Faster RCNN) predict boxes w.r.t. proposals, whereas single-stage methods(YOLO) make predictions w.r.t. anchors or a grid of possible object centers. The performance of these systems is influenced by post-processing steps to collapse near-duplicate predictions (NMS), and the exact way these initial guesses are set (anchor boxes). DETR proposes an end-to-end architecture by eliminating any customized layers and can predict the bounding boxes w.r.t to the input Image(absolute box prediction).

DETR architecture

DETR mainly comprises of four blocks as depicted in the below diagram

Backbone
Transformer Encoder
Transformer Decoder
Prediction heads

End-to-End Object Detection with Transformers

Backbone.

Starting from the initial image a CNN backbone generates a lower-resolution activation map. The input images are batched together, applying 0-padding adequately to ensure they all have the same dimensions (H, W) as the largest image of the batch.

Transformer Encoder

The transformer encoder expects a sequence as input, hence we collapse the feature map from the backbone into a 1-d vector. Since the transformer architecture has no order for the sequences (why do we need an order at all? ), we supplement it with fixed positional encoding’s that are added to the input before passing it into a multi-head attention module. I am not explaining Transformers in detail since there are many great tutorials, but at high-level attention mechanism is key for the success of this architecture. The extracted latent features from the CNN backbone are passed to the multi-head attention module where each extracted feature is allowed to interact with every other extracted feature with the help of query, key, and value mechanism. This should have helped the network to figure out the features that belong to a single object and distinguish with other objects which lie very close even if the object is of the same class. That is why Transformers are so powerful at generalizing the aspects.

Transformer decoder

A standard transformer decoder expects three inputs queries, keys, and values. Outputs from the encoder are passed as keys and values to the decoder but what are queries to the decoder? well, formulation of decoder part is fairly cool, queries are simply numbers ranging from 1 to N which are passed as embedding’s and the number of objects detected by the model is equal to the value N. So what do these queries do? These queries try to ask what objects lie at a specific position E.g. embedding for value 1 might ask for the objects present in the bottom left corner as shown in the below diagram.

Visualization of query embedding’s from 1 to 20

This sort of explains the need for positional embedding’s in the encoder block.

Prediction heads.

The FFN predicts the normalized center coordinates, height, and width of the box w.r.t. the input image and the linear layer predicts the class label using a soft-max function.

Set predictions and Set-based loss

Set prediction is nothing new, the network predicts a tuple of class-label and the co-ordinates of the object w.r.t input image. Given a predefined number N network predicts N of such tuples if there no object in the proposed instance of tuple then network defaults it to the “no object” class(∅).

Network infers a fixed-size set of N predictions in a single forward pass where N is set to be significantly larger than the typical number of objects in an image. Now let’s consider a single prediction in a batch, as the N is higher than the actual number of ground truth objects in an image, padding is done to the ground truth with “no-object” class for any extra predictions that are made by the network.

Each prediction is assigned to one ground truth which lies in the closest proximity of the prediction and the left out predictions are mapped to “no-object” class, in the paper, this is termed as a “bipartite matching”.

The above function clearly describes how the loss is computed, for each of the assigned pairs the above computes the cross-entropy loss and bounding box loss(bounding box loss is not computed if the actual class is ‘no object’) also the loss for “no object” class is downscaled by 10 times to avoid class imbalance because of the large value of N. Bounding box loss is calculated by weighted addition of IOU loss and L1 loss between predicted and actual boxes normalized by the number of ground truth objects.

Conclusion and Results

The approach achieves comparable results to an optimized Faster R-CNN baseline on the COCO dataset. In addition, DETR achieves significantly better performance on large objects than Faster R-CNN but lags a bit for the smaller objects.