Object Tracking using DeepSORT in TensorFlow 2

Published in

Analytics Vidhya

8 min readOct 25, 2020

Detecting and tracking objects are among the most prevalent and challenging tasks that a surveillance system has to accomplish in order to determine meaningful events and suspicious activities. In this article we introduce the concept of Object Tracking, challenges, traditional methods and implement such a system in TensorFlow 2.0.

Car tracking using YOLO and DeepSORT tracking

Object Tracking

Videos are actually sequences of images, each of which called a frame, displayed in fast enough frequency so that human eyes can percept the continuity of its content. It is obvious that all image processing techniques can be applied to individual frames. Besides, the contents of two consecutive frames are usually closely related.

Object detection in videos involves verifying the presence of an object in image sequences and possibly locating it precisely for recognition. Object tracking is to monitor an object’s spatial and temporal changes during a video sequence, including its presence, position, size, shape, etc. This is done by solving the temporal correspondence problem, the problem of matching the target region in successive frames of a sequence of images taken at closely-spaced time intervals. These two processes are closely related because tracking usually starts with detecting objects, while detecting an object repeatedly in subsequent image sequence is often necessary to help and verify tracking.

Object tracking is the task of taking an initial set of object detections, creating a unique ID for each of the initial detections, and then tracking each of the objects as they move around frames in a video, maintaining the ID assignment.

Challenges

Whenever there is a moving object in the videos, there are certain cases when the visual appearance of the object is not clear. In such case, detection fail while tracking as it also has the motion model and history of the object.

Occlusion problem on pedestrian tracking

Here are some challenges in object tracking:

Occlusion: It occurs when an object we are tracking is hidden (occluded) by another object. Like two persons walking past each other, or a car that drives under a bridge. The problem in this case is what you do when an object disappears and reappears again.
Scale change
Background clutter: Background near object has similar color or texture as the object. Hence, it become harder to track results for a small object with cluttered background.
Appearance change: Different viewpoint of an object may look very different visually and without the context. Hence, it become very difficult to identify the object using only visual detection.

Traditional methods

Optical Flow

Optical flow, or motion estimation, is a fundamental method of calculating the motion of image intensities, which may be ascribed to the motion of objects in the scene. It provides a concise description of both the regions of the image undergoing motion and the velocity of motion. In practice, computation of optical flow is susceptible to noise and illumination changes.

Optical-flow can be used to detect independently moving objects, even in the presence of camera motion. Of course, optical-flow-based techniques are computationally complex, and hence require fast hardware and software solutions to implement. Since optical flow is fundamentally a differential quantity, estimation of it is highly susceptible to noise; ameliorating the noise sensitivity can imply increases in complexity. Therefore, smart camera-based video surveillance systems that use optical-flow calculations of some type must be equipped with substantial computational resources. Indeed, it is because such technologies are being introduced to smart camera technology that video surveillance systems having significant intelligent capability are being envisioned and realized.

Meanshift

Mean Shift is a non-parametric iterative algorithm that can be used for lot of purposes like finding modes, clustering etc. It has been widely used in target tracking field because of some advantages like fewer iteration times and better real-time performance for many years. However, due to only single-color histogram representation of target feature has been used in traditional Mean Shift algorithm, it cannot track very well in some cases, especially under very complicated conditions.

Kalman Filters

The Kalman filter for tracking moving objects estimates a state vector comprising the parameters of the target, such as position and velocity, based on a dynamic/measurement model. We know different movement conditions and occlusions can hinder the vision tracking of an object. It is consider to use the capacity of the Kalman filter which allow small occlusions and complex movements of objects.

Kalman filter is a recursive estimator it means that while estimating the current state it requires previous state and its current measurements. These two are sufficient to estimate the current state. Kalman filter averages the prediction of system state with new measurement by using weighted average phenomenon.

Kalman filter is one of simplest linear state space model used widely in present days if the known system and measurement models are linear. While estimating the unknown state variables recursively with time there is certain uncertainty in measurement values. Uncertainty is that process noise and measurement noise included in the measured values and that noise included must be Gaussian in nature while Kalman filter is using to estimate the unknown state variables. From this we can’t model the system entirely deterministically. Kalman filter uses linear equation systems with white Gaussian noises as standard model.

Deep SORT

One of the most widely used, object tracking framework is Deep SORT, an extension to SORT (Simple Real time Tracker).

we integrate appearance information to improve the performance of SORT. Due to this extension we are able to track objects through longer periods of occlusions, effectively reducing the number of identity switches.

— SORT with a DEEP ASSOCIATION METRIC, 2017

Person Tracking using DeepSORT Algorithm. Source: https://arxiv.org/pdf/1703.07402.pdf

Kalman Filter

Our tracking scenario is defined on the eight dimensional state space (u, v, γ, h, x,˙ y,˙ γ, ˙ h˙) that contains the bounding box center position (u, v), aspect ratio γ, height h, and their respective velocities in image coordinates. Used a standard Kalman filter with constant velocity motion and linear observation model, where we take the bounding coordinates (u, v, γ, h) as direct observations of the object state.

For each detection, we create a “Track”, that has all the necessary state information. It also has a parameter to track and delete tracks that had their last successful detection long back, as those objects would have left the scene. To eliminate duplicate tracks, there is a minimum number of detections threshold for the first few frames.

Now, when we have the new bounding boxes tracked from the Kalman filter, the next problem lies in associating new detections with the new predictions. Since they are processed independently here comes the assignment problem because we have no idea how to associate track_i with incoming detection_k.

To solve this, we need 2 things: A distance metric to quantify the association and an efficient algorithm to associate the data.

Deep SORT authors decided to use the squared Mahalanobis distance (effective metric when dealing with distributions) to incorporate the uncertainties from the Kalman filter. Thresholding this distance can give us a very good idea of the actual associations. This metric is more accurate than Euclidean distance as we are effectively measuring the distance between 2 distributions (everything is a distribution under Kalman filter). In this case, the authors offered to use the standard Hungarian algorithm, which is a very effective and simple combinatorial optimization algorithm that solves the assignment problem.

The appearance feature vector

Up to this point, we have an object detector giving us detections, Kalman filter tracking detections, and giving us missing tracks and the Hungarian algorithm solving the association problem.

Despite the effectiveness of the Kalman filter, it fails in many of the real-world scenarios like occlusions, different viewpoints, etc.

So, to improve this, the authors of Deep SORT introduced another distance metric based on the “appearance” of the object.

So, authors first built a classifier over the dataset, trained it till it achieved a reasonably good accuracy, and then strip the final classification layer. Assuming a classical architecture, we will be left with a dense layer producing a single feature vector, waiting to be classified. That feature vector becomes our “appearance descriptor” of the object.

Deep Appearance Descriptor. Source: https://arxiv.org/pdf/1703.07402.pdf

The “Dense 10” layer shown in the above pic will be our appearance feature vector for the given crop. Once trained, we just need to pass all the crops of the detected bounding box from the image to this network and obtain the “128 X 1” dimensional feature vector.

The updated distance metric will be:

D=Lambda∗Dk+(1−Lambda)∗Da

Where Dk is the Mahalanobis distance and Da is the cosine distance between the appearance feature vectors and Lambda is the weighting factor.

A simple distance metric, combined with a powerful deep learning technique is all it took for deep SORT to be an elegant and one of the most widespread Object trackers.

Refer this paper “Comparison and study of Pedestrian Tracking using Deep SORT and state of the art detectors”. Performance evaluation and comparison have been performed on pedestrian tracking using the Deep Sort algorithm in conjunction with the various state-of-the-art object detectors: YOLO, SSD and Faster RCNN.