COMPUTER VISION

TransVOD: The Next Big Thing in Video Object Detection?

Hansa hettiarachchi
5 min readOct 23, 2023

TransVOD paper explained - Part 1

[img src:https://aisuperior.com/blog/how-computer-vision-is-transforming-the-healthcare-industry/]

What is video object detection?

Video object detection is the task of identifying and tracking objects in video sequences. It is a challenging task because it requires the model to not only detect objects in each frame of the video, but also to track them over time.

Video object detection has a wide range of applications in areas such as autonomous driving, video surveillance, and medical imaging. For example, video object detection can be used to develop self-driving cars that can safely navigate the road, to develop video surveillance systems that can detect and track suspicious activity, and to develop medical imaging systems that can help doctors diagnose diseases.

Traditional video object detection methods typically rely on a hand-crafted pipeline that includes components such as optical flow, recurrent neural networks, and relation networks. These components can be complex and difficult to train.

Traditional video object detection methods have a number of limitations. First, they can be slow and computationally expensive. Second, they can be difficult to generalize to new datasets and tasks. Third, they can be sensitive to noise and other artefacts in video sequences.

What is TransVOD?

TransVOD is a new end-to-end video object detection framework based on Transformer architecture. Transformers are a type of neural network that has been shown to be very effective for various natural language processing tasks. TransVOD adapts the Transformer architecture to video object detection by using a spatial-temporal transformer to learn both spatial and temporal dependencies between objects in a video clip.

TransVOD, a novel end-to-end video object detection framework based on a spatial-temporal Transformer architecture. Our TransVOD views video object detection as an end-to-end sequence decoding/prediction problem. For the current frame, it takes multiple frames as inputs and directly outputs the current frame detection results via a Transformer-like architecture. In particular, we design a novel temporal Transformer to link each object query and outputs of memory encodings simultaneously [1]

How does TransVOD work (simply explained)

TransVOD works by first extracting feature maps from all frames of a video clip. These feature maps are then passed to the spatial-temporal transformer, which learns both spatial and temporal dependencies between objects in the video clip. The output of the spatial-temporal transformer is then passed to a set of object queries, which are learned to converge to the positions and sizes of the objects in the video clip. Finally, the detection head predicts the class, bounding box, and confidence score for each object in the video clip.

Dig deep into TransVOD Architecture

The architecture of TransVOD can be consist of four main components:

  1. Spatial Transformer for single frame object detection Extracting both object queries and compact features representation:
  2. Temporal Deformable Transformer Encoder (TDTE) to fuse memory outputs from Spatial Transformers:
  3. Temporal Query Encoder (TQE) to link objects in each frame along the temporal dimension:
  4. Temporal Deformable Transformer Decoder (TDTD) to obtain final outputs for the current frame:
The whole pipeline of TransVOD. A shared CNN backbone extracts features of multiple frames. Next, a series of shared Spatial Transformer Encoders (STE) produce the feature memories and these memories are linked and fed into Temporal Deformable Transformer Encoder (TDTE). Meanwhile, the Spatial Transformer Decoder (STD) decodes the spatial object queries. Naturally, we use a Temporal Query Encoder (TQE) to model the relations of different queries and aggregate these queries, thus we can enhance the object query of the current frame. Both the temporal object query and the temporal feature memories are fed into the Temporal Deformable Transformer Decoder (TDTD) to learn the temporal contexts across different frames. Our TransVOD framework can be trained in a fully end-to-end manner
  1. Spatial Transformer for single frame object detection Extracting both object queries and compact features representation:

The spatial transformer is a module that learns to transform the spatial features of an image. This is useful for object detection because it allows the model to learn to focus on the most important parts of the image and to ignore irrelevant background information.

The spatial transformer in TransVOD is implemented using a series of self-attention layers. Self-attention is a mechanism that allows the model to learn long-range dependencies in the image. This is important for object detection because it allows the model to learn relationships between objects that are far apart in the image.

The spatial transformer in TransVOD also extracts object queries and compact features representation (memory for each frame). The object queries are used to represent the objects in the image. The compact features representation is a compressed version of the image features that is used to store information about the objects in the image.

2. Temporal Deformable Transformer Encoder (TDTE) to fuse memory outputs from Spatial Transformers:

The temporal deformable transformer encoder (TDTE) is a module that fuses the memory outputs from the spatial transformers for all frames in a video clip. The TDTE is also a deformable transformer, which means that it is able to learn to deform the temporal features of the video clip. This is useful for object detection because it allows the model to learn relationships between objects that are far apart in time.

3. Temporal Query Encoder (TQE) to link objects in each frame along the temporal dimension:

The temporal query encoder (TQE) is a module that links the object queries in each frame along the temporal dimension. This allows the model to track objects over time.

4. Temporal Deformable Transformer Decoder (TDTD) to obtain final outputs for the current frame:

The temporal deformable transformer decoder (TDTD) is a module that decodes the fused temporal features from the TDTE and outputs the final detection results for the current frame.

Overall, The four components of TransVOD work together to detect objects in a video clip. The spatial transformer extracts object queries and compact features representation (memory for each frame). The temporal deformable transformer encoder fuses the memory outputs from the spatial transformers for all frames in a video clip. The temporal query encoder links the object queries in each frame along the temporal dimension. The temporal deformable transformer decoder decodes the fused temporal features and outputs the final detection results for the current frame.

Key findings of the TransVOD paper:

  1. TransVOD achieves state-of-the-art performance on the ImageNet VID benchmark, outperforming all previous methods, including those that use recurrent neural networks and relation networks.
  2. TransVOD is able to track objects over time with high accuracy.
  3. TransVOD is a simple, effective, and scalable approach to video object detection.
  4. TransVOD has the potential to be used in a wide range of applications, such as autonomous driving, video surveillance, and medical imaging.

Overall, TransVOD is a promising new approach to video object detection. It is simple, effective, and scalable. TransVOD has the potential to be used in a wide range of applications, such as autonomous driving, video surveillance, and medical imaging.

Results of TransVOD

Comparison with state-of-the-art methods on ImageNet VID with Res50 backbone

TransVOD still achieves superior performance against the state-of-the-art methods by a large margin. In particular, TransVOD achieves 79.9% with ResNet50, which makes 1.3%∼2.6% absolute improvements over the best competitor MEGA

TransVOD is a new end-to-end video object detection framework based on Transformer architecture. It achieves state-of-the-art performance on the ImageNet VID benchmark, which is a standard dataset for video object detection. TransVOD is able to track objects over time with high accuracy, and it is also able to detect objects in videos of different resolutions and frame rates.

References

[1] https://arxiv.org/pdf/2201.05047.pdf

[2] https://github.com/SJTU-LuHe/TransVOD

--

--