Geek Culture
Published in

Geek Culture

Video-level Computer Vision Advances Business Insights

From Spatial to Spatio-temporal Visual Processing

Image By mnbb on Getty Images

Instance-based classification, segmentation, and object detection in images are fundamental problems in the context of computer vision. Different from image-level information retrieval, the video-level problems aim at detection, segmentation, and tracking of object instances in spatio-temporal domain that have both space and time dimensions.

Video domain learning is a crucial task for spatio-temporal understanding in camera and drone-based systems with applications to video-editing, autonomous driving, pedestrian tracking, augmented reality, robot vision, and a lot more. Furthermore, it helps us to encode spatio-temporal raw data to meaningful insights along with the video as it has richer content compared to visual-spatial data. With the addition of temporal dimension to our decoding process, we get further information about

  • Motion,
  • Viewpoint variations,
  • Illuminations,
  • Occlusions
  • Deformations
  • Local ambiguities

from the video frames. Hence, video-level information retrieval gained popularity as a research area and it attracts the community along the line of research for video understanding.

Conceptually speaking, video-level information retrieval algorithms are mostly adapted from image-level processes, by adding additional heads to capture temporal information. Aside from simpler video-level classification and regression tasks, video object detection, video object tracking, video object segmentation, and video instance segmentation are the most common ones. I will briefly describe the mentioned problems as follows.

Gif on Gfycat

To start with, let’s recall the image-level instance segmentation problem.

Image-level Instance Segmentation

Instance segmentation not only group pixels into different semantic classes, but also group them into different object instances [1]. A two-stage paradigm is usually adopted, which first generates object proposals using a Region Proposal Network (RPN), and then predicts object bounding boxes and masks using aggregated RoI features [1]. Different from semantic segmentation, which segments different semantic classes only, instance segmentation segments the different instances of each class also. Here is the sample figure.

Left Figure: Semantic Segmentation, Right Figure: Instance Segmentation (Image by Vinorth Varatharasan/ResearchGate [6])

Video Classification

The video classification task is a direct adaption of image classification to the video domain. Instead of giving images as inputs, video-frames are given to the model to learn from it. By nature, sequences of images that are temporally correlated are given to learning algorithms, that incorporate features of both spatial and temporal visual information to produce classification scores.

Image by anubhavmaity/Github

Video Captioning

Video Captioning is the task of generating captions for a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text [5]. Hence, given the video frames, we are generating natural languages that describe the concept and context of the video.

Video Captioning Example (Image By Author)

Video Object Detection (VOD)

Video object detection aims at detecting objects in videos, which is first proposed as part of the ImageNet visual challenge [2]. Even the association and providing identity improves the detection quality, this challenge is limited to spatially preserved evaluation metrics for per-frame detection and does not require joint object detection and tracking [3]. However, there is no joint detection, segmentation, and tracking are as opposed to video-level semantic tasks.

Video Object Detection (Gif by Author)

Video Object Tracking (VOT)

Video object tracking task is generally considered detection-based and detection-free tracking approaches. In detection-based tracking algorithms, objects are jointly detected and tracked such that the tracking part improves the detection quality whereas in detection-free approaches we’re given an initial bounding box and try to track that object across video frames [3, 4].

Video Object Tracking (Image By Zhongdao/GitHub)

Video Instance Segmentation (VIS)

Video instance segmentation is the recently introduced computer vision research that aims at joint detection, segmentation, and tracking of instances in the video domain. As the video instance segmentation task is supervised, it requires human-oriented high-quality annotations for bounding boxes and binary segmentation masks with predefined categories. It requires both segmentation and tracking, it is a more challenging task compared to image-level instance segmentation. Hence, as opposed to previous fundamental computer vision tasks, video instance segmentation requires multi-disciplinary and aggregated approaches. VIS is like a contemporary all-in-one computer vision task that is the composition of general vision problems.

Video Instance Segmentation Prediction (Image By Auther)

Knowledge brings values: Video-level Information Retrieval in Action

Acknowledging the technical boundaries of video-level information retrieval tasks will improve the understanding of business concerns and customer needs from a practical perspective. For example, when a client says, “we have videos and want to extract only the locations of pedestrians from the videos,” you’ll recognize that your task is video object detection. What if they want to both localize and track them in videos? Then your problem is translated to the video object tracking task. Let’s say that they also want to segment them across videos. Your task is now video instance segmentation. However, if a client says that they want to generate automatic captions for videos, from a technical point of view, your problem can be formulated as video captioning. Understanding the scope of the project and drawing technical business requirements depends on the kind of insights clients want to derive, and it is crucial for technical teams to formulate the issue as an optimization problem.

That’s the end of the article. If you have any issues, let me know. Here is how.

Here I am

Twitter | Linkedin | Medium | Mail

Originally published at


[1] G. Bertasius, L. Torresani, and J. Shi. Object detection in video with spatiotemporal sampling networks, 2018.

[2] K.Chen,J.Pang,J.Wang,Y.Xiong,X.Li,S.Sun,W.Feng, Z. Liu, J. Shi, W. Ouyang, C. C. Loy, and D. Lin. Hybrid task cascade for instance segmentation, 2019

[3] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks, 2017.

[4] X.Chu,Z.Tian,Y.Wang,B.Zhang,H.Ren,X.Wei,H.Xia, and C. Shen. Twins: Revisiting the design of spatial attention in vision transformers, 2021.

[5] Singh, Alok, Thoudam Doren Singh, and Sivaji Bandyopadhyay. “Nits-vc system for vatex video captioning challenge 2020.” arXiv preprint arXiv:2006.04058 (2020).

[6] Varatharasan, Vinorth & Shin, Hyo-Sang & Tsourdos, Antonios & Colosimo, Nick. (2019). Improving Learning Effectiveness For Object Detection and Classification in Cluttered Backgrounds. 78–85. 10.1109/REDUAS47371.2019.8999695.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store