Want Object Tracking? Try Deep Sort

Published in

Red Buffer

9 min readFeb 15, 2022

Hello, fellow Computer Vision (CV) engineers. If you have landed on this article it means that you are among the members of the CV squad who have faced the problem of object tracking and are now looking for a solution that is able to solve this problem exceptionally well. If this is the case, you are in the right spot!

For those who are relatively new to the domain or are reading to acquire knowledge, let me give you a brief explanation of what the problem is. Often in a CV problem, we need to track certain objects detected to monitor the behavior or activity of that object over a certain period of time. For this, we need to constantly monitor the object. Some might argue that if that’s the case then why not detect the object in each frame and monitor it? Well, let me make the problem a bit more complex. Say there’s a football player with the ball and the detection model does a good job of detecting the ball initially but as the game moves on and the player starts dribbling there might be scenarios where the ball is half or partially visible. The model won’t be able to retain the detection here.

Let’s take it to the next step and imaging that you’re the coach of Real Madrid and you want your algorithm to see how Modric and Casemiro both perform with the ball. You give each of them a ball and now the problem isn’t single object detection, instead, you have to detect both the balls and associate them with a player. What if the balls cross each other, how will you then maintain the association?

Last but not least, what about the computational cost required to detect objects on each frame? Why not detect on a single frame and then keep tracking them? And this is why object tracking is required.

Now that we’re clear or why it’s required let’s move to the next step which is formally finding differences between the two.

Object Detection vs Object Tracking

In object detection, we detect an object in a frame, put a bounding box or a mask around it, and classify the object in one of the classes. Note that, the job of the detector ends here. It processes each frame independently and identifies numerous objects in that particular frame. While an object tracker on the other hand needs to track a particular object across the entire video. If the detector detects 3 cars in the frame, the object tracker has to identify the 3 separate detections and needs to track them across the subsequent frames (with the help of a unique ID).

Now that we know the what and why of object tracking along with the difference between tracking and detection let’s move towards the main problem which is why to use deep sort and how does a traditional approach solve this problem?.

Traditional Methods

If you look up object tracking, one of the most basic and easy-to-implement algorithms that you are going to find is the native cv2 tracking algorithms. Examples and comparisons are explained in the following pyimage link.

BOOSTING Tracker: Based on the same algorithm used to power the machine learning behind Haar cascades (AdaBoost), but like Haar cascades, is over a decade old. This tracker is slow and doesn’t work very well. Interesting only for legacy reasons and comparing other algorithms. (minimum OpenCV 3.0.0)
MIL Tracker: Better accuracy than BOOSTING tracker but does a poor job of reporting failure. (minimum OpenCV 3.0.0)
KCF Tracker: Kernelized Correlation Filters. Faster than BOOSTING and MIL. Similar to MIL and KCF, does not handle full occlusion well. (minimum OpenCV 3.1.0)
CSRT Tracker: Discriminative Correlation Filter (with Channel and Spatial Reliability). Tends to be more accurate than KCF but slightly slower. (minimum OpenCV 3.4.2)
MedianFlow Tracker: Does a nice job reporting failures; however, if there is too large of a jump in motion, such as fast-moving objects, or objects that change quickly in their appearance, the model will fail. (minimum OpenCV 3.0.0)
TLD Tracker: I’m not sure if there is a problem with the OpenCV implementation of the TLD tracker or the actual algorithm itself, but the TLD tracker was incredibly prone to false-positives. I do not recommend using this OpenCV object tracker. (minimum OpenCV 3.0.0)
MOSSE Tracker: Very, very fast. Not as accurate as CSRT or KCF but a good choice if you need pure speed. (minimum OpenCV 3.4.1)
GOTURN Tracker: The only deep learning-based object detector included in OpenCV. It requires additional model files to run (will not be covered in this post). My initial experiments showed it was a bit of a pain to use even though it reportedly handles viewing changes well (my initial experiments didn’t confirm this though).

Lets review a few more techniques. Assuming that we have bounding box information for all objects in the frame, in real-world applications, we need to do bounding box detections in advance so the tracker needs to be combined with a detector. For now, lets assume we are only working on tracking. Given bbox information for an ID in frame 1, how do we assign IDs in subsequent frames?

Centroid-based ID assignment — In its simplest form, we can assign IDs by looking at the bounding box centroids. We do this by calculating centroids for each bounding box in frame 1. In frame 2, we look at the new centroids and based on the distance from previous centroids we can assign IDs by looking at a relative distance. The basic assumption is that frame to frame centroids would only move a little bit. This simple approach works quite well as long as centroids are spaced apart from each other. As you can imagine this approach fails when people are close to each other since it may switch IDs then
Kalman Filter — Kalman Filter is an improvement over simple centroid-based tracking. This blog does a great job of explaining a Kalman filter. Kalman Filter allows us to model tracking based on the position and velocity of an object and predicts where it is likely to be. It models future position and velocity using Gaussians. When it receives a new reading it can use probability to assign the measurement to its prediction and update itself. It is light in memory and fast to run. And since it uses both position and velocity of motion, it has better results than the centroid-based tracking.

Once you get your hands on one of these techniques you’ll see that these methods do not perform as well as a real case scenario might need. So what’s next? The one thing that pops up in our mind is why not use a good detector like Yolo and then on the basis of bounding box achieved we get frame-level difference to track the object. If yes then your intuition is right because that’s an advanced technique from the native cv2 tracking algorithms. This is what we call SORT.

SORT

This is the approach that uses the Kalman filter. Using the bounding boxes detected by YOLO v3, we can assign an ID and track an object by mapping bounding boxes of similar size and similar motion in the previous and following frames. On the basis of this ID assigned we can now track the objects and monitor their actions for the entire video segment that we have and life’s good unless.

Problems with SORT?

Sort performs well so long as the objects that it is tracking do not collide or overlap. Let me clarify the above statement. As we all know that in the world of cv one of the major problems that we face is that we are left with a 2d representation of our world unless we can do depth estimation. When we have a 2d representation even if a person walks several meters behind another one the intersection is represented in 2d world as a collision. As soon as the collision occurs the SORT tracker assigns them a new id and hence all the previous actions or activities of that particular object are lost. Solution?

Solution?

What if we somehow could make our tracker so smart that once it gets the detection from the detector it learns how the object actually looks like and what its attributes are so that after the collision it can somehow remember which object had which ID and then assign the same ids back again instead of assigning new IDs. Wouldn’t that solve all our problems?

Deep SORT

Deep sort is the solution that does exactly what we discussed above. Let's dive deeper into how it does that.

Architecture

In DeepSort, the process is as follows.

Compute bounding boxes using YOLO-v3(detections)
Use Sort (Kalman filter)and ReID (identification model) to link bounding boxes and tracks
If no link can be made, a new ID is assigned and it is newly added to tracks.

What is referred as “detections” is the list of people in one frame, and “tracks” is the list of objects currently being tracked. Each item of tracks is assigned an ID, and by assigning a bounding box to each one of those items, you can assign an ID to the person.

ReID is mainly used when linking bounding boxes and tracks. The distance between the feature vectors computed by ReID from the object image of the current tracking target (tracks) and the feature vectors also calculated by ReID from the person image cut out by the bounding box (detections) in YOLO v3, is used to link bounding boxes and tracks. Simply put, the object with the smallest distance is considered to be the same object instance and assigned a track ID. To calculate the vector distance, feature vectors for the last 100 frames for each track are used. At this time, the coordinate information of the track is not taken into account.

The cost function is defined as Sort distance * λ + ReID distance, but in the paper, λ = 0 turned out to empirically give good results, so the coordinate information is not taken into account.

During our experiments we found that setting λ = 0 is a reasonable choice when there is substantial camera motion. In this setting, only appearance information are used in the association cost term. However, the Mahalanobis gate is still used to disregarded infeasible assignments based on possible object locations inferred by the Kalman filter.

If the position in the current frame, which is assumed based on the Sort past tracking information, is too far apart, the ID will not be assigned. When a bounding box is left without any ID, Sort is used to assign one.

If the bounding box is “lost” for 70 frames, it will be removed from the tracking.

The ReID model is trained from 1,100,000 images of 1,261 pedestrians from the large-scale person re-identification dataset.

Comparison of Deep Sort to other techniques

Usage

To use DeepSort with the ailia SDK, use the sample below. In this sample, DeepSort is used to track a person detected by YOLOv3. You can use the following command to track against the web camera.

$ python3 deepsort.py -v 0

You can also calculate the similarity of an object by giving it two still images.

$ python3 deepsort.py — pairimage IMAGE_PATH1 IMAGE_PATH2

Note that scipy is required since it is used to calculate the Kalman filter.

Conclusion

There are many implementations of Deep Sort. Feel free to check one repository at GitHub here. I have personally used deep sort for 2 of my past projects and believe me the results are promising. While many of you might be frustrated and thinking if this article is about deep sort then why explain everything else this is actually because it’s easier to use a technique off the shelf but believe it’s really much more beneficial to know comparison b/w techniques to choose one over the other. That being said it also helps you to avoid overkill in situations where a simple tracking algorithm would have sufficed. I have been thinking to write on this particular algorithm and appreciate the work done here from where I took help for my write-up. Credit where it’s due man xD.