Object Tracking State of the Art 2022

Pedro Azevedo
9 min readJun 8, 2022

--

Learn about the current SOTA like ByteTrack, DeepSORT, StrongSORT and OC-SORT

First, an introduction will be made to the key metrics used in literature to compare different trackers; after this introduction, a discussion on the SOTA algorithms of object tracking will be conducted.

Object detection tries to answer the question: “ What objects are in this image and where are they?

Object Tracking answers the question “Where are these objects going in between frames?”

Image Credit : https://viso.ai/deep-learning/object-tracking/

Multiple Object Tracking (MOT)

Multiple object tracking is defined as the problem of automatically identifying multiple objects in a video and representing them as a set of trajectories with high accuracy. (Vidushi Meel)

This series of articles is based on my master thesis, so, all the refferences will be linked to the number referred on the thesis. You can check it out on the link bellow.

To understand the algorithms we will discuss next, first, we need to introduce some key metrics used in object tracking.

Key Metrics

MOTP — Multiple Object Tracking Precision

The Multiple Object Tracking Precision is the average dissimilarity between all true positives and their corresponding ground truth targets. MOTP thereby gives the average overlap between all correctly matched hypotheses and their respective objects and ranges between 50% and 100% [36].

where ct denotes the number of matches in frame t and dt,i is the bounding box overlap of target i with its assigned ground truth object.

MOTA

The MOTA [51] is perhaps the most widely used metric to evaluate a tracker’s performance

Combining three sources of errors, where t is the index of frame and GT is the ground truth object count. F N is the number of false negatives, IDSW is number of identity switches the and F P is the number of false positives [36].

HOTA and IDF1

In addition to both of these metrics, HOTA and 1DF1 are also used. HOTA is a Higher Order Metric for Evaluating Multi-object Tracking composed of a family of sub-metrics. There are multiple sub-metrics each with multiple equations required to calculate them. In order to save the reader from going unnecessarily in depth on this one metric these equations will not be shown. HOTA scores typically better align with human visual evaluation of tracking performance [31]. The metric IDF1 is the ratio of correctly identified detections over the average number of ground-truth and computed detections.

I will now introduce some SOTA object trackers. The first three trackers introduced are the ones included in the DeepStream SDK. A development kit for rapid computer vision AI development and deployment on robotic and autonomous systems made by NVIDIA.

IOU Tracker

The Intersection-Over-Union (IOU) tracker uses the IOU values among the detector’s bounding boxes between the two consecutive frames to perform the association between them or assign a new target ID if no match found. This tracker includes a logic to handle false positives and false negatives from the object detector. However, this can be considered as the bare-minimum object tracker, which may serve as a baseline only to compare other trackers [42].

NvDCF Tracker

NvDCF tracker is a visual tracker based on the discriminative correlation filter (DCF) [32] which learns target-specific correlation filters and uses it to localize the same target in the next frames. This correlation filter learning and localization is usually carried out on a per-object basis in a typical MOT implementation, creating a potentially large number of small CUDA kernel launches when processed on GPU. This inherently poses challenges in maximizing GPU utilization, especially when a large number of objects from multiple video streams are expected to be tracked on a single GPU. NvDCF addresses this issues since its GPU-accelerated operations are designed to execute in batch processing mode to maximize the GPU utilization despite the nature of small CUDA kernels in per-object tracking model. The batch processing mode is applied in the entire tracking operations, including the bounding box cropping and scaling, visual feature extraction, correlation filter learning, and localization. This can be viewed as a similar model to the batched cuFFT (CUDA Fast Fourier Transform) or batched cuBLAS (CUDA Basic Linear Algebra Subroutine) calls, but it differs in the fact that the batched MOT execution model spans many operations in a higher level. The batch processing capability is extended from multiobject batching to the batching of multiple streams for even greater efficiency and scalability.

This tracker is compatible with NVIDIA development kit, this kit makes it easier to stack multiple detectors; this means taking the detection from a primary detector (PGIE) and feed it to a secondary detector (SGIE). This will be further discussed in the section talking about the NVIDIA Hardware and Software. Thanks to its visual tracking capability, the NvDCF tracker can localize and keep track of the targets even when the detector in PGIE misses them (i.e., false negatives) for potentially an extended period of time caused by partial or full occlusions, resulting in more robust tracking. Shadow Tracking, where a target is still being tracked in the background for a period of time even when the target is not associated with a detector object. The enhanced robustness characteristics the use a higher maxShadowTrackingAge (max time a object is being tracked in the background) value for longer-term object tracking and also allows PGIE’s interval to be higher only at the cost of slight degradation in accuracy. In addition to the visual tracker module, the NvDCF tracker employs a Kalman filter-based state estimator to better estimate and predict the states of the targets. [42]

NvDCF tracks each target by defining a search region around its predicted location in the next frame large enough for the same object to be detected in that region according to the following equations.

This tracker is proposed by NVIDIA and is part of the NVIDIA DeepStream SDK; this SDK will be approached in greater detail in later articles, this makes its integration with NVIDIA hardware much simpler than a normal tracker, on the other hand, there is currently no published paper with any metrics to compare it to other SOTA trackers; however, it is described to have comparable performance to DeepSORT according to NVIDIA Documentation.

DeepSORT

Simple Online and Real-time Tracking (SORT) is a pragmatic approach to multiple object tracking with a focus on simple effective algorithms [59]. It performs Kalman filtering in image space and frame-by-frame data association using he Hungarian method with an association metric that measures bounding box overlap, achieving favorable performance at high frame rates. This way SORT manages to combine location and motion cues in a simple way [65]. Despite this high performance SORT returns a high number of identity switches. DeepSort authors improved on the classical SORT by integrating appearance information on to the SORT algorithm, by doing this they are able to track objects trough longer periods of occlusions, reducing the number of identity switches. This is done through an offline pre-training stage where a deep association metric is learned on a large scale person re-identification dataset. These extensions to the original SORT improved identity switches by 45%.

BYTE and ByteTrack

While most methods obtain identities by associating detection boxes whose scores are higher than a threshold. The objects with low detection scores, e.g. occluded objects, are simply town away, which brings non-negligible true objects missing and fragmented trajectories. The authors of [65] propose an association method that tracks by associating almost every detection box instead of only the high score ones. Utilizing similarities with tracklets (small set of paths associated with individual detections in consecutive frames) to recover true objects and filter out the background detection. This method achieved an improvement in tracker performance on a large number of SOTA trackers. The authors in [65] proposed a new tracker named ByteTrack, it achieved state-of-the-art performance on MOTA17 and MOTA20 with 30 FPS running on a V100 GPU as shown in Figure 2.25. ByteTrack also achieved SOTA results on MOT20, HiEve and BDD100k. This tracker is built by equipping the high-performance detector YOLOX [19] with the association method BYTE.

One of the great contributions of this paper come from the way it performs the data association. To efficiently track, data association is an essential task that computes the similarity between tracklets and detection boxes, leveraging different strategies to match them according to the similarity. SORT combines location and motion cues by adopting a Kalman filter to predict the location of the tracklets in the new frame, then computes the IoU between the detection boxes and the predicted boxes as the similarity. DeepSort [59] adopts a stand-alone RE-ID model to extract appearance features from the detection boxes. [65]. After similarity computation matching strategy assigns identities to the objects. This can be done by the Hungarian Algorithm [27] or greedy assignment [69]. SORT [2] matches the detection boxes to the tracklets by once matching. DeepSORT [59] proposes a cascaded matching strategy that first matches the detection boxes to the most recent tracklets and then to the lost ones. Other authors proposed even different methods to perform the matching, these methods, however, focus on how to design better association methods, [65] argues that the way detection boxes are utilized determines the upper bound of data association, so the authors focus instead on how to make full use of detection boxes from high scores to low ones. The authors call this new data association method BYTE.

Using all the detection boxes; they first associate the high score detection boxes to the tracklets. Figure 2.26 illustrates the difference this data association method provides. Doing so, some tracklets get unmatched because they do not match to an appropriate high score detection box, which usually happens when occlusion, motion blur or size changing occurs. The authors then associate the low score detection boxes and these unmatched tracklets to recover the objects in low score detection boxes and filter out background, simultaneously. BYTE is very flexible and can be implemented in combination with other trackers/detectors. In the paper the authors compared DeepSORT with BYTE by using a light YOLOX model with a modified backbone; the data association method BYTE combined with the detector YOLOX is the foundation of what the authors call ByteTrack. This paper shows high improvement over the current SOTA and opens new doors to explore tracking with other SOTA detectors such as YoloV5 and YOLOR.

Figure 2.25: MOTA-IDF1-FPS comparisons of different trackers in the test set of MOT17. The horizontal axis is FPS (running speed),the vertical axis is MOTA, and the radius of circle is IDF1. ByteTrack achieves 80.3 MOTA, 77.3 IDF1 on MOT17 test set with 30 FPS running speed, outperforming all previous trackers [65].

OC-SORT

Most of current motion models in MOT (multiple object tracking) typically assume that the object motion is linear in a small time window and needs continuous observations, so these methods are sensitive to occlusions and non-linear motion, requiring high framerate videos. The work in [5] shows that a simple motion model can obtain state-ofthe-art tracking performance without other cues such as appearance. OC-SORT [5] provides multiple innovations to SORT, it adds Observation-centric Online Smoothing (OOS) strategy to alleviate the error accumulation in the Kalman Filter due to lack of observations. In addition to this the authors also incorporated the direction consistency of tracklets in the cost matrix for better matching between tracklets and observations, finally to deal with the case of objects being untracked due to occlusion in a short time window, the authors proposed to recover them by associating their last observations with the new observations, which they refer to as Observation-Centric Recovery (OCR). The authors named this method Observation-Centric SORT or OC-SORT. It remains simple, online, and real-time but improves robustness over occlusion and non-linear motion, achieving 63.2 and 62.1 HOTA on MOT17 and MOT20 [5].

StrongSORT

The authors of [14] revisited classical tracker DeepSORT and upgraded it from various aspects, i.e, detection, embedding and association. The resulting tracker, called StrongSORT sets new HOTA and IDF1 records on MOT17 and MOT20. In addition to this the authors also propose two plug-and-play algorithms to further refine the tracking results AFLink and Gaussian-smoothed interpolation (GSI). Applying both of this to StrongSORT originated StrongSORT++ which ranks first on MOT17 and MOT20 in terms of HOTA and 1DF1 metrics as shown on Figure 2.27 and Table 2.5 [14].

Figure 2.27: IDF1-MOTA-HOTA comparisons of state-of-the-art trackers with StrongSORT and StrongSORT++ on MOT17 and MOT20 test sets. The horizontal axis is MOTA, the vertical axis is IDF1, and the radius of the circle is HOTA. ”*” in DeepSORT represents [14] reproduced version [14].

--

--

Pedro Azevedo

Masters in University of Aveiro, Portugal. Focus in Deep Learning and Computer Vision for Autonomous Driving