A tour of Video Object Tracking — Part I: Presentation

5 min readSep 22, 2019

This article is the first of a series about Video Object Tracking, which I wrote during my internship at Wintics, with the great help of Levi Viana (CTO at Wintics) and Emeline Fay (Data scientist at Wintics).

- Part I: Presentation of Video Object Tracking
- Part II: Single Object Tracking
- Part III: Multiple Object Tracking

Unlike object classification and detection which have been widely reviewed, you can easily get lost in the literature of Video Object Tracking. This series aims at providing an overview of the subject of tracking, for anyone who wants to get a quick start in the field.

In this part, I will introduce you to the world of tracking. The next parts are more technical and assume basic knowledge of computer vision.

Structure of this article

1. What is Video Object Tracking ?
2. How performant are today’s trackers?
3. Good articles to read
4. Conclusion

1. What is Video Object Tracking?

In Video Object Tracking, the aim is to locate one or multiple objects of interest (the “targets”) in each frame of a video. We usually locate the target by drawing the smallest rectangle possible (the “bounding box”) in which it is included. Video Object Tracking applications are wide, for example, it can be used in autonomous driving, surveillance, human-computer interaction, sport analytics…

Fig 0: An example of how tracking is used at Wintics

How is tracking different from detection?

There is a tight relationship between tracking and detection. Detection consists in locating one or several objects in a given image whereas the goal of tracking is to locate these objects throughout a whole video, keeping track of which object is which along the video frames. In order to track an object, you first need to provide the image of the said object to the tracking algorithm, and this is either done by a detection algorithm (Detection-based trackers) or manually (Detection-free trackers).

A naive way to perform tracking is to apply a detection algorithm to each frame of a video, but there are several reasons why tracking is necessary or useful:
- Tracking allows to maintain object identities
- Detection is computationally expensive
- Detection-free trackers allow to track objects for which no detector has been trained.
- Tracking might help to tackle challenging common problems such as change of illumination, motion blur, change of scale, occlusions (when the target is partially or completely hidden by another object for a period of time in the video), poor quality of the image…

Two main approaches for tracking: Single Object tracking (SOT) and Multiple Object Tracking (MOT)

In SOT, the bounding box of the target in the first frame is given to the tracker. The goal of the tracker is then to locate the same target in all the other frames. We say Single Object Trackers belong to the category of detection-free trackers because the first bounding box is given. They should be able to track any object, without any training on the object.
Siamese network-based trackers and Correlation Filter-based trackers are the top performers for short-term tasks (i.e. without complete occlusion). If you want to know more, check out the next part of the series about Single Object Tracking.

In MOT, there are multiple objects to track. The tracking algorithm is expected first to determine the number of objects in each frame, and second, to keep track of their identities.
MOT is a more difficult problem, and it is harder to exhibit an explicit class of algorithms which would outperform the rest. If you want to know more about MOT, check out this article.

2. How powerful are today’s trackers?

For object classification and detection tasks, huge progress have been made, and algorithms achieve performance comparable to that of humans. But while it is easy for a human to track an object, today’s trackers are still far from that goal, which requires to capture the spatial and temporal relation between targets.

To give you an idea, in the summary metric of MOT, MOTA, the best algorithms do not go beyond 0.6 (the best possible score is 1.0). One should go into details to understand this metrics but it still means that there is room for improvement compared to the ideal case.

Fig2.LSST tracker ranked 1st (average MOTA 0.54) on MOT2017 dataset. You can check out results of other MOT trackers on the MOT challenge website.

Although the tracker in Fig2 manages to track easy targets in the foreground, when people go further behind, or are occluded, the tracker often loses them. Moreover, the quality of this video is good, people are moving slowly.
In real-life applications, the quality of the video might be altered by the camera or the weather (rain, fog, illumination), the targets or the camera might move faster, the scene be more crowded… Besides, industrial applications often require excellent performance and it is usually not enough to roughly be able to track some easy targets for half of their trajectory.

3. Good articles to read

Guidance for authors and articles to read from my internship supervisor has saved me a lot of time so the following references might be helpful to you as well.

SOT articles:
- VOT Challenge 2018 report (2019 should be coming soon!)
- MOSSE tracker: pioneer of Correlation-Filter based trackers.
- SiamFC: pioneer of Siamese-Network based trackers (you can also check SiamRPN which came first in the VOT 2018 real-time challenge)
MOT articles:
- MOT Challenge
- Multiple Object Tracking: A Literature Review: as its name indicates, this article gives a review of MOT. It gives good basics on different techniques which have been employed in MOT.
- Tracking without bells and whistles: this article gives good insights on where the field of MOT is currently at.

If you want to keep up with the current research in the field, I suggest you look at the latest releases of MOT and VOT challenges. Moreover, you can set up a notification on Google Scholar for you preferred authors. For example, well-known authors in the field include: Alan Lukezic, Junjie Yan (sensetime) , Anton Milan, Laura Leal-Taixé, Wei Xu (horizon robotic)…

4. Conclusion

In the world of Video Object Tracking literature, there are two main approaches: Single Object Tracking and Multiple Object Tracking.

In Single Object Tracking, the current best trackers for short term tasks provide rather good performances and are either based on Correlation-Filter, or Siamese Network methods. For long term tasks, which is a more interesting problem for real-life applications, the field is still a bit young to make any conclusion. (see the next article about SOT)

In Multiple Object Tracking, it is hard to exhibit a particular class of algorithms which shows outstanding performance compared to the others. And the field is still growing. (see the next article about MOT)

In both cases, there is still room for improvement in order to meet the needs of the industry.

If you have any questions, remarks or feedback, feel free to comment below !