The Surveillance phenomenon you must know about : Multi Object Tracking

Published in

VisionWizard

6 min readApr 26, 2020

Before we dive deep into the world of object tracking, one has to understand the ‘whys’ and ‘whats’ of the surveillance world. Countries across the globe have millions of cameras in place but hardly anyone to monitor them. The ratio of human to the camera is minimal.

Intelligent software paradigm is taking over the surveillance in the last decade due to the pervasive nature of deep learning techniques emerging in the surveillance space. The complex problems like people tracking, traffic density estimation, theft protection, etc. are tackled by the brilliant researchers across the globe and have shown great promise.

This article will introduce you to one of the most critical topics in intelligent surveillance — Object Tracking.

Introduction

Object tracking means estimating the state of the target object present in the scene from previous information.

On a high level of abstraction, there are mainly two levels of object tracking.

Single Object Tracking(SOT)
Multiple Object Tracking(MOT).

Object tracking is not limited to 2D sequence data and can be applied to 3D domains as we.

In this article, let’s study in-depth on a sub-domain of the object tracking paradigm, Multiple Object Tracking(MOT) in a 2D video sequence using Deep Learning.

The strength of Deep Neural Networks (DNN) resides in their ability to learn rich representations and to extract complex and abstract features from their input.[1]

Multiple Object Tracking (MOT), also called Multi-Target Tracking (MTT), is a computer vision task that aims to analyse videos to identify and track objects belonging to one or more categories, such as pedestrians, cars, animals and inanimate objects, without any prior knowledge about the appearance and number of targets.[1]

While in Single Object Tracking (SOT) the appearance of the target is known a priori, in MOT a detection step is necessary to identify the targets that can leave or enter the scene. The main difficulty in tracking multiple targets simultaneously stems from the various occlusions and interactions between objects that can sometimes also have a similar appearance. Thus, merely applying SOT models directly to solve MOT leads to poor results, often incurring in target drift and numerous ID switch errors, as such models usually struggle in distinguishing between similar-looking intra-class objects.[1]

In recent years, due to the exponential rise in the research of deep learning methods, there have been tremendous gains in accuracy and performance of the detection and tracking approaches.

Most of the state of the art tracking approaches follow the ‘Tracking by Detection’ scheme where they first find objects in the scene and then find the corresponding tracklets(position of it in the next frame) of the objects.

Today the detectors are performing exceptionally well and can scale to different scene adaptations. Consequently, it has led to the standard input to tracking algorithms.

There are other approaches like Lucas Kanade’s Optical Flow, Sort based tracking etc. which are wonderful in their own right based on traditional computer vision methods.

Challenges

While solving the object tracking problem, there arises a number of issues which can lead to a poor outcome. The algorithms over the years have tried to tackle these issues but till now we have not arrived at a full proof solution keeping it an open-ended area of research.

Variations due to geometric changes Eg:- Pose, articulation, the scale of objects

Difference due to photo-metric factors. E.g.:- Illumination, appearance
Non-Linear motion
Limited resolution Eg:- Video captured from a low-end phone
Similar objects in the scene Eg:- Same colour of clothes, accessories etc.
Highly crowded scenarios like streets, concerts, stadiums, markets.

Track initiation & termination. Before you start any tracking algorithm, you need prior information of an object you want to track. Initiating the algorithm with a target object may or may not be possible always.
Tracks can get merged/switched due to sudden change in movements, a sharp change in camera quality etc.

ID’s of the target object switched due to similar characteristics like similar clothes, face structure, glasses, skin complexion, height etc.
Drifting due to the wrong update of the target model. One wrong update might lead to a continuous update in the wrong direction and forget the correct one throughout the video.

Literature Survey

Now that you can appreciate the magnanimity of the problem, let us dive into some of the most exciting research work in object tracking domain.

Criteria for handpicking leading research methods in the object tracking domain are based on five key metrics:-

Leading conferences, (eg : CVPR, NeurIPS, ICCV, ICML, ECCV etc.
Benchmark results on (e.g., MOT, Kitti, VOT, CVPR19 challenge)
Publicly available code(by author/3rd party) supporting the results given in the paper.
Citations
Novel idea

There is a plethora of exciting research work. Still, if the author can’t provide code (can be due to several reasons) to reproduce the results in the paper, then, we have to believe the results mentioned in the document with a grain of salt.

We have mentioned papers related to 2D MOT, yet some ideas can be extrapolated to 3D versions as well.

Interesting research papers

Tracking without bells and whistles.[Paper] [Github]
Extending IOU Based Multi-Object Tracking by Visual Information.[Paper][Github]
Tracking Objects as Points. [Paper][Github]
Fast Visual Object Tracking with Rotated Bounding Boxes[Paper][Github — Not very reliable]
ODESA: Object Descriptor that is Smooth Appearance-wise for object tracking tasks(Not yet released, No 1 on CVPR MOT 2019 challenge)
Online Multiple Pedestrian Tracking using Deep Temporal Appearance Matching Association.[Paper][Github — NA].

These are just a few handpicked out of the collection of highly regarded research papers out there. We will try to demystify some of these papers in our upcoming articles. Stay tuned.

If you have managed to reach here, then I believe you are a part of an elite group who have a thorough understanding to get started in the captivating problem of multi object tracking.
Please feel free to share your thoughts and ideas in the comment section below.
If you think that article was helpful, please do share it and also clap(s) would hurt no one.

References

[1] DEEP LEARNING IN VIDEO MULTI-OBJECT TRACKING: A SURVEY (Link).

[2] Lecture 5: Visual Tracking Alexandre Alahi Stanford Vision Lab(Link)

[3] Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: the clear mot metrics. Journal on Image and Video Processing, 2008:1, 2008.

[4]Rainer Stiefelhagen and John Garofolo.Multimodal Technologies for Perception of Humans: First InternationalEvaluation Workshop on Classification of Events, Activities and Relationships, CLEAR 2006, Southampton, UK, April 6–7, 2006, Revised Selected Papers, volume 4122. Springer, 2007.

[5]Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. European Conference on Computer Vision, pages 17–35. Springer, 2016

[6]https://motchallenge.net/

[7]https://neurohive.io/en/datasets/new-datasets-for-object-tracking/