DEtection TRansformer (DETR) vs. YOLO for object detection.

Fahim Rustamy, PhD
9 min readAug 20, 2023

--

Ever wondered how computers can analyze images, identifying and localizing objects within them? That’s exactly what object detection accomplishes in the world of computer vision. DEtection TRansformer (DETR) and You Only Look Once (YOLO) are the two prominent approaches for object detection. YOLO has earned its reputation as the go-to model for real-time object detection and tracking problems. Meanwhile, DETR, a rising contender powered by transformer technology, has the potential to revolutionize computer vision, similar to its impact on natural language processing. In this blog post, I will explore these two methods to understand how they work their magic!

You can find the code in this GitHub repo:

https://github.com/RustamyF/detr-vision

Since 2012, computer vision has undergone a revolutionary transformation driven by the arrival of Convolutional Neural Networks (CNNs) and deep learning architectures. Notable among these architectures are AlexNet (2012), GoogleNet (2014), VGGNet (2014), and ResNet (2015), which incorporated numerous convolutional layers to enhance image classification accuracy. While image classification task involves assigning labels to entire images, like categorizing a picture as a dog or a car, object detection not only identifies what’s in an image but also pinpoints where each object is located within that image.

Example of object detection and classification on images.

The original YOLO (2015) paper was a breakthrough in real-time object detection when it was released, and it is still one of the most used models in practical vision applications. It moved the detection process from a two or three-stage process (i.e., R-CNN, Fast R-CNN) to a single-staged convolutional stage and outperformed in both accuracy and speed compared to all the state-of-the-art object detection methods. The model architecture from the original paper has changed over time by adding different hand-crafted features to improve the model’s accuracy. Here is an overview of the first three versions of YOLO and their differences.

YOLO v1 (2015) was the original version and set the foundation for subsequent iterations. It used a single deep convolutional neural network (CNN) to predict bounding boxes and class probabilities. YOLO v1 divided the input image into a grid and made predictions at each cell of the grid. Each cell was responsible for predicting a fixed number of bounding boxes and their corresponding class probabilities. This version achieved real-time object detection with impressive speed but had some limitations in detecting small objects and accurately localizing overlapping objects.

Architecture of the original YOLO v1 (2015)

YOLO v2 (2016) addressed some of the limitations of the original YOLO model. It introduced anchor boxes, which helped to better predict bounding boxes of different sizes and aspect ratios. YOLO v2 used a more powerful backbone network, Darknet-19, and was trained not only on the original dataset (PASCAL VOC) but also on the COCO dataset, which significantly increased the number of detectable classes. The combination of anchor boxes and multi-scale training helped improve the detection performance for small objects.

YOLO v3 (2018) further improved the performance of object detection. This version introduced the concept of feature pyramid networks, with multiple detection layers that allowed the model to detect objects at different scales and resolutions. YOLO v3 used a larger network architecture with 53 convolutional layers, called Darknet-53, which improved the model’s representational capacity. YOLO v3 uses three different scales for detection: 13x13, 26x26, and 52x52 grids. Each scale predicts a different number of bounding boxes per grid cell.

Architecture of YOLO v3. (taken from Reference)

Wait, how many bounding boxes are we predicting?? At a resolution of 416 x 416, YOLO v1 predicts 7 x 7 = 49 boxes. YOLO v2 predicted 13 x 13 x 5 = 845 boxes. For YOLO v2, at each grid cell, 5 boxes are detected using 5 anchors. On the other hand, YOLO v3 predicts boxes at 3 different scales. For the same image of 416 x 416, the number of predicted boxes are 13 x 13 x 3 + 26 x 26 x 3 + 52 x 52 x 3 = 10,647. Non-Maximum Suppression (NMS), a post-processing technique is used to filter out redundant and overlapping bounding box predictions. In NMS algorithm, first, the boxes that are below a certain confidence score are removed from the prediction list. Then, the prediction with the highest confidence score is considered as the “current” prediction, and all other predictions with lower confidence scores that have an IoU above a certain threshold (e.g., 0.5) with the “current” prediction are marked as redundant and suppressed. For implementing NMS in PyTorch, please refer to this youtube video.

DETR (DEtection TRansformer) is a relatively new object detection algorithm that was introduced in 2020 by researchers at Facebook AI Research (FAIR). It is based on the transformer architecture, a powerful sequence-to-sequence model that has been used for various natural language processing tasks. Traditional object detectors (i.e., R-CNN and YOLO) are complex and have gone through multiple variations and rely on hand-designed components (i.e., NMS). DETR, on the other hand, is a direct set prediction model that uses a transformer encoder-decoder architecture to predict all objects at once. This approach is simpler and more efficient than traditional object detectors and achieves comparable performance on the COCO dataset.

The DETR architecture is simple and consists of three main components: a CNN backbone (i.e., ResNet) for feature extraction, a transformer encoder-decoder, and a feed-forward network (FFN) for final detection predictions. The backbone processes the input image and generates an activation map. The transformer encoder reduces the channel dimension and applies multi-head self-attention and feed-forward networks. The transformer decoder uses parallel decoding of N object embeddings and independently predicts box coordinates and class labels using object queries. DETR reasons about all objects together using pair-wise relations, benefiting from the whole image context.

DETR Architecture Taken From the Original Paper

The following code (taken from DETR's official GitHub repository) defines the forward pass of this DETR model, which processes input data through various layers, including the convolutional backbone and transformer network. I included the output shape from each layer of the network in the code to give a sense of all the data transformations.

class DETRdemo(nn.Module):

def __init__(self, num_classes, hidden_dim=256, nheads=8,
num_encoder_layers=6, num_decoder_layers=6):
super().__init__()

# 2. create ResNet-50 backbone
self.backbone = resnet50()
del self.backbone.fc

# create conversion layer
self.conv = nn.Conv2d(2048, hidden_dim, 1)

# 3. create a default PyTorch transformer
self.transformer = nn.Transformer(
hidden_dim, nheads, num_encoder_layers, num_decoder_layers)

# 4. prediction heads, one extra class for predicting non-empty slots
# note that in baseline DETR linear_bbox layer is 3-layer MLP
self.linear_class = nn.Linear(hidden_dim, num_classes + 1)
self.linear_bbox = nn.Linear(hidden_dim, 4)

# 5. output positional encodings (object queries)
self.query_pos = nn.Parameter(torch.rand(100, hidden_dim))

# spatial positional encodings
# note that in baseline DETR we use sine positional encodings
self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))

def forward(self, inputs):
# propagate inputs through ResNet-50 up to avg-pool layer
# input: torch.Size([1, 3, 800, 1066])
x = self.backbone.conv1(inputs) # torch.Size([1, 64, 400, 533])
x = self.backbone.bn1(x) # torch.Size([1, 64, 400, 533])
x = self.backbone.relu(x) # torch.Size([1, 64, 400, 533])
x = self.backbone.maxpool(x) # torch.Size([1, 64, 200, 267])

x = self.backbone.layer1(x) # torch.Size([1, 256, 200, 267])
x = self.backbone.layer2(x) # torch.Size([1, 512, 100, 134])
x = self.backbone.layer3(x) # torch.Size([1, 1024, 50, 67])
x = self.backbone.layer4(x) # torch.Size([1, 2048, 25, 34])

# convert from 2048 to 256 feature planes for the transformer
h = self.conv(x) # torch.Size([1, 256, 25, 34])

# construct positional encodings
H, W = h.shape[-2:]
pos = torch.cat([
self.col_embed[:W].unsqueeze(0).repeat(H, 1, 1),
self.row_embed[:H].unsqueeze(1).repeat(1, W, 1),
], dim=-1).flatten(0, 1).unsqueeze(1) # torch.Size([850, 1, 256])
src = pos + 0.1 * h.flatten(2).permute(2, 0, 1) # torch.Size([850, 1, 256])
target = self.query_pos.unsqueeze(1) # torch.Size([100, 1, 256])

# propagate through the transformer
h = self.transformer(pos + 0.1 * h.flatten(2).permute(2, 0, 1),
self.query_pos.unsqueeze(1)).transpose(0, 1) # torch.Size([1, 100, 256])
linear_cls = self.linear_class(h) # torch.Size([1, 100, 92])
liner_bbx = self.linear_bbox(h).sigmoid() # torch.Size([1, 100, 4])
# finally project transformer outputs to class labels and bounding boxes
return {'pred_logits': linear_cls,
'pred_boxes': linear_bbx}

Here’s an explanation of the code step by step:

  1. Initialization: The __init__ method defines the structure of the DETR module. It takes several hyperparameters as inputs, including the number of classes (num_classes), hidden dimensions (hidden_dim), number of attention heads (nheads), and the number of layers for the encoder and decoder (num_encoder_layers and num_decoder_layers).
  2. Backbone and Conversion Layer: The code creates a ResNet-50 backbone (self.backbone) and removes its fully connected (fc) layer since it won't be used for detection. The conv layer (self.conv) is added to convert the output of the backbone from 2048 channels to hidden_dim channels.
  3. Transformer: A PyTorch transformer is created using the nn.Transformer class (self.transformer). This transformer will process both the encoder and decoder parts of the model. The number of encoder and decoder layers, as well as other parameters, are set based on the provided hyperparameters.
  4. Prediction Heads: The model defines two linear layers for prediction: self.linear_classwhich predicts class logits. There's an additional class added for predicting non-empty slots, hence num_classes + 1. self.linear_bbox predicts the coordinates of the bounding boxes. The .sigmoid() function is applied to ensure the bounding box coordinates are within the [0, 1] range.
  5. Positional Encodings: Positional encodings are crucial for transformer-based models. The model defines the query positional encoding (self.query_pos) and spatial positional encodings (self.row_embed and self.col_embed). These encodings help the model understand the spatial relationships between different elements.

The model produces 100 valid predictions. We only keep the outputs with a probability above a specific limit from the output prediction and discard all the other predictions.

Example

In this section, I showcase an example project from my Github repository, where I used the DETR and YOLO models on a real-time video stream. This project’s objective was to investigate the performance of DETR on the real-time video stream compared to YOLO, which is the de facto model for most real-time applications in the industry. The server.py script shown below uses YOLO v8 from Ultralytics and the pre-trained DETR model from the torch hub.

import torch
from ultralytics import YOLO
import cv2
from dataclasses import dataclass
import time
from utils.functions import plot_results, rescale_bboxes, transform
from utils.datasets import LoadWebcam, LoadVideo
import logging

logging.basicConfig(
level=logging.DEBUG, format="%(asctime)s - %(levelname)s - %(message)s"
)


@dataclass
class Config:
source: str = "assets/walking_resized.mp4"
view_img: bool = False
model_type: str = "detr_resnet50"
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
skip: int = 1
yolo: bool = True
yolo_type = "yolov8n.pt"


class Detector:
def __init__(self):
self.config = Config()
self.device = self.config.device
if self.config.source == "0":
logging.info("Using stream from the webcam")
self.dataset = LoadWebcam()
else:
logging.info("Using stream from the video file: " + self.config.source)
self.dataset = LoadVideo(self.config.source)
self.start = time.time()
self.count = 0

def load_model(self):
if self.config.yolo:
if self.config.yolo_type is None or self.config.yolo_type == "":
raise ValueError("YOLO model type is not specified")
model = YOLO(self.config.yolo_type)
logging.info(f"YOLOv8 Inference using {self.config.yolo_type}")
else:
if self.config.model_type is None or self.config.model_type == "":
raise ValueError("DETR model type is not specified")
model = torch.hub.load(
"facebookresearch/detr", self.config.model_type, pretrained=True
).to(self.device)
model.eval()
logging.info(f"DETR Inference using {self.config.model_type}")
return model

def detect(self):
model = self.load_model()
for img in self.dataset:
self.count += 1
if self.count % self.config.skip != 0:
continue
if not self.config.yolo:
im = transform(img).unsqueeze(0).to(self.device)
outputs = model(im)
# keep only predictions with 0.7+ confidence
probas = outputs["pred_logits"].softmax(-1)[0, :, :-1]
keep = probas.max(-1).values > 0.9
bboxes_scaled = rescale_bboxes(
outputs["pred_boxes"][0, keep].to("cpu"), img.shape[:2]
)
else:
outputs = model(img)
logging.info(
f"FPS: {self.count / self.config.skip / (time.time() - self.start)}"
)
# print(f"FPS: {self.count / self.skip / (time.time() - self.start)}")
if self.config.view_img:
if self.config.yolo:
annotated_frame = outputs[0].plot()
cv2.imshow("YOLOv8 Inference", annotated_frame)
if cv2.waitKey(1) & 0xFF == ord("q"):
break
else:
plot_results(img, probas[keep], bboxes_scaled)
logging.info("************************* Done *****************************")


if __name__ == "__main__":
detector = Detector()
detector.detect()

The server.py script is responsible for fetching data from sources such as webcams, IP cameras, or local video files. This source can be modified within the server.py config data class. Performance evaluations revealed that when employing the yolov8m.pt model, it achieved an impressive processing speed of 55 frames per second (FPS) on a Tesla T4 GPU. On the other hand, employing the detr_resnet50 model resulted in a processing speed of 15 FPS.

Conclusion

In conclusion, YOLO is an excellent choice for applications requiring real-time detection with a focus on speed, making it suitable for applications like video analysis and live object tracking. On the other hand, DETR shines in tasks demanding improved accuracy and handling complex interactions between objects, which might be particularly important in fields like medical imaging, fine-grained object detection, and scenarios where detection quality outweighs real-time processing speed. It’s important to recognize, however, that a new iteration of DETR — known as RT-DETR or real-time DETR —was published in 2023, claiming superior performance in both speed and accuracy compared to all YOLO detectors of similar scale. This innovation, although not covered in this blog, underscores the dynamic nature of this field and the potential to further refine the choice between YOLO and DETR based on specific application requirements.

--

--