DeepSORT — Deep Learning applied to Object Tracking

Published in

Augmented Startups

12 min readAug 31, 2020

Introduction

So in this article, Im going to give to you a clear and simple explanation on how Deep SORT works and why its so amazing compared to other models like Tracktor++, Track-RCNN and JDE. But to understand how DeepSORT works, we first have to go back, waaay back and understand the fundamentals of object tracking and the key innovations that had to happen along the way, for DeepSORT to emerge.

Before we get started, if you are interested in developing object tracking apps, then checkout my course in the link down below, where I show you how you can fuse the popular YOLOv4 with DeepSORT for robust and real-time applications.

Okay so back to Object Tracking. Now lets imagine that you that you are working for Space X, and Mr. Musk has tasked you with ensuring that on launch, the ground camera is always pointing at the Falcon 9 as it thrust into the atmosphere. As much as you are excited to be personally chosen by Elon to work on this task you ask yourself,

“How will I go about this?”

Well given that you have a PTZ or pan tilt zoom camera aimed at the rocket you will need to implement a way to track the rocket and keep the the rocket at the center of the image, So far so good…? Just note that if you do not track it properly, your PTZ motion will stray off the target and you’ll end up with a really disappointed Musk. And you cannot screw this up, because this is your first job and you really want Elon Musk to be impressed. I mean who wouldn’t want to right.

Soo, Question…how will you track the rocket? Well you might say,

Well Ritz, you did a whole tutorial series on object detection, why don’t we just track by detection, you know umm using something like YOLOv4 or Detectron2?

Hahaha Okay okay okay lets see what happens if we use this method.

So the Falcon 9 launches on a day with a clear blue skies, you are armed with the state of the art detection models for centralizing the camera on rocket. Everything going well until all of a sudden a stray pigeon swoops in front of the camera You see me rolling they hating song, occluding the rocket from you and just like that the rocket is out of sight…The boss is not happy. Deep down inside you feel your heart sink and your soul crushed by the disappointment.

But you light up some greens, he chills out and after a smoke or two, he decides to give you another chance.

The high has also given you a chance to reflect on why this did not work, you conclude that while detection works great for single frames, there needs to be a correlation of tracked features between sequential images of the video. Otherwise any sort of occlusion, you will lose detection and Your target may slip out of the frame.

Optical Flow and Mean Shift

So you dig a little deeper in attempts as to not disappoint Mr. Musk again, you go back to traditional methods such as mean shift and optical flow.

Mean Shift

Starting with mean shift, you find out that it works by taking our object of interest, which you can visualize as a blob of pixels, so not just location, but also size. So in this case the falcon 9 rocket that we are detecting is our blob. Then you go to next frame and you search within a larger region of interest known as the neighborhood, for the same blob. You’ll want to find the best blob of pixels or features in the next frame that best represent our rocket by maximizing a similarity function.

This strategy makes a lot of sense. If your dog goes missing, you wont just drive to the countryside but instead start with searching your immediate neighborhood for your best friend, Unless course you have a dog like Lassie. In that case, she’ll find you. [Woof]

Optical Flow

The other tool you look into is optical flow, which looks at the motion of features due to the relative motion across frames between the scene and camera. So say for example you have your rocket in your image, and it moves up in the image, you will be able to estimate the motion vectors in frame 2 relative to frame 1.

Now if your object is moving at a certain velocity, you will be able to use these motion vectors to track and even predict the trajectory of the object in the next frame. A popular Optical Flow model that you could use for this is Lucas Kanade.

Cool so now you’v got another shot at impressing Mr. Musk, he was only a little annoyed,..thats right only a little annoyed.. that you lost his rocket,. So to save Elon a buck or 2, you decide to model this in simulation and test the viability of Optical flow and Mean Shift (Fast and furious shifting), You find out some interesting things from this experiment.

After running your simulations you discover that while the traditional methods have good tracking performance, they however are computational complex and prone to noise in the case of optical flow . And for mean shift, its unreliable if the object happens to go beyond the neighborhood region of interest. So Move too fast, loose the track. And that’s not even considering any type of significant occlusion.

So as much as you want to show this off to Mr. Musk, you have a gut feeling telling you that you can do better..way better. You go to your shrine and meditate for a bit, Spend some time crunching the numbers and reasons why you were better off working somewhere else. But You stumble across an amazing technic used almost everywhere known as the Kalman Filter.

Kalman Filter

Now I have a whole video on what the Kalman filter is and how you can use it catch Pokémon. But essentially its premise Is: say you are tracking a ball rolling in 1 dimension. You can easily detect it within each frame. That detection is your input signal which you can rely on as long as there is a clear line of sight to the ball, with very low noise. Now during detection, you decide to simulate cloudy conditional using that fog machine you used at the last office party. You can still see the ball but now your vision sensor has noise in it, thus decreasing the confidence of where the ball is.

Now Lets make it a bit more complex and throw in another scenario where the ball travels behind a box which occludes the ball. How do you track something that you can’t see? Well this is where the Kalman comes in. Assuming a constant velocity model and gaussian distribution. You can guestimate where the ball is based on the model of it motion. When the ball is able to be seen, you rely more on the sensor data and thus put more weight on it. When it is partially occluded, You can place weight or reliance on both motion and sensor measurement data.

And if its fully occluded. You will shift a lot of weight on motion data. And the best part of the Kalman filter is that it is recursively , meaning where we take current readings, to predict the current state, then use the measurements and update our predictions Now of course there is a lot more to the Kalman filter to cover in just one video.

But by now you probably wondering, Ritz, the title of this video is on DeepSORT, what are you going on about Kalman filters and traditional tracking algorithms from the good ol days! What going on here man! Hold up hold up, we are getting there, just bare with. The Kalman filter is a crucial components in DeepSORT. Let’s Explore why.

The next launch is coming up soon where multiple Projectiles may be need to be tracked, so you are required to find a way for your camera to track your designated rocket. The Kalman filter looks promising, but your Kalman filter alone may not be enough.

Simple Online Realtime Tracking (SORT)

Enter SORT — Simple Online Real-time Tracking. You learn that SORT comprises of 4 core components which are :

Detection
Estimation
Association, &
Track Identity creation and destruction.

Hmmm, This is where is all starts come together. You start with detection

Detection

So as you’ve learn earlier that detection by itself is not enough for tracking. However the quality of detections has a significant impact on tracking performance. Bewely et. al. use FRCNN(VGG16) back in 2016

Estimation

So we got detections now what the f*[Bleep] do we do with them? So now we need to propagate the detections from the current frame to the next using a linear constant velocity model. Remember the homework you did earlier on the Kalman filter, yes that time was not wasted. When a detection is associated to a target, the detected bounding box is used to update the target state where the velocity components are optimally solved via the Kalman filter framework.

However if no detection is associated tot the target, its state is simply predicted without correct using the Linear velocity model.

Target Association

In assigning detections to existing targets, each target’s bounding box geometry is estimated by predicting its new location in the latest frame. The assignment cost matrix is then computed as the intersection-over-union (IOU) distance between each detection and all predicted bounding boxes from the existing targets.

The assignment is solved optimally using the Hungarian algorithm. This works particularly well when one target occludes another. In your face Swooping Pigeon!!

Track Identities life Cycle

When objects enter and leave the image, unique identities need to be created or destroyed accordingly. For creating trackers, we consider any detection with an overlap less than IOUmin to signify the existence of an untracked object. The tracker is initialized using the geometry of the bounding box with the velocity set to zero. Since the velocity is unobserved at this point the covariance of the velocity component is initialized with large values, reflecting this uncertainty. Additionally, the new tracker then undergoes a probationary period where the target needs to be associated with detections to accumulate enough evidence in order to prevent tracking of false positives.

Tracks are terminated if they are not detected for TLost frames, you can specify what the amount of frame should be for TLost. Should an object reappear, tracking will implicitly resume under a new identity.

DeepSORT

Wow, you are absolutely on fire now. All this SORT power consuming you, you power up even more, surging, power level over 9000, screaming until you transform from SORT to your ultimate form DeepSORT, Super Sayans be proud.

Now if you’re almost there. So now you explore your new found powers and learn what separates SORT from the upgraded DeepSORT. So in SORT we learnt that we use a CNN for detection but what makes DeepSORT so different? If we analyze the full title of which is simple online and real time tracking or SORT with a deep association metric.

Hhmm, okay Ritz, I really hope you are going to explain this deep association metric.

We’ll discuss this in the next article.. hahah just kidding. I cant leave you hanging like that. Especially when we are so close to completing the project for the falcon 9 launch.

Okay so Where is the deep learning in all of this?

Well, we have an object detector that provides us detections, the almighty Kalman filter tracking it and giving us missing tracks, the Hungarian algorithm associates detections to tracked objects. You ask : So, is deep learning really required here?

Well while SORT achieves an overall good performance in terms of tracking precision and accuracy, also despite the effectiveness of Kalman filter, it returns a relatively high number of identity switches and has a deficiency in tracking through occlusions and different viewpoints etc..

So, to improve this, the authors of DeepSORT introduced another distance metric based on the “appearance” of the object.

The Appearance feature Vector

So a classifier is build based on our dataset which is trained meticulously until it achieves a reasonably good accuracy. Then we take this network and strip the final classification layer leaving behind a dense layer that produces a single feature vector, waiting to be classified. This feature vector is known as the appearance descriptor.

Now how this works is that after the appearance descriptor is obtained the authors, use nearest neighbor queries in the visual appearance to establish the measurement-to-track association. Measurement-to-track association or MTA is the process of determining the relation between a measurement and an existing track. So now we use the Mahalanobis distance as oppose to the Euclidean distance for MTA.

Day of the Launch

So while tensions are mounting, on the dawn of the launch day. You quickly run your simulation and you find the Deep extension to the SORT algorithm shows a reduced number of identity switches by 45% achieved an over competitive performance at high frame rates.

Just like that you find yourself standing alongside Elon in the bunker moments before the commencement of the launch. You clench your fists, and feel the sweat on your brow, saying

‘This is it.. this is the moment of truth”.

Elon raises the same question that you have on your mind

“So will it work?”

You stemmer a little but answer with a confident

“Im sure it will”

Elon looks forward as the countdown begins 3..2..1….We have lift off. Your PTZ camera is set on the target on the target as the rocket lifts up from the ground… So far so good we have track. However, the rocket is passing through some clouds that partially occluding the target. The camera is still targeting the DeepSORT model is holding up quite well. Actually very well, as you notice as the swooping pigeon occluded the camera on multiple occasions without hinderance to the tracker.

YES! Mission Accomplished

Elon looks at you and extends his hand outwards to shake yours and says

“Well done, that was quite impressive.”

You can now relax and pop some champagne with the team. Job well done! That was quite an adventure for which you have learnt about object tracking, particularly on the DeepSORT model. Just out of curiosity you search the net for DeepSORT alternatives and you create a quick comparison. You find 3 which are:

Tracktor++ which is pretty accurate, but one big drawback is that it is not viable for real-time tracking. Results show an average execution of 3 FPS. If real-time execution is not a concern, this is a great contender.
TrackR-CNN is nice because it provides segmentation as a bonus. But as with Tracktor++, it is hard to utilize for real-time tracking, having an average execution of 1.6 FPS.
JDE displayed decent performance of 12 FPS on average. It is important to note that the input size for the model is 1088x608, so accordingly, we should expect JDE to reach lower FPS if the model is trained on Full HD. Nevertheless, it has great accuracy and should be a good selection.
DeepSORT is the fastest of the bunch, thanks to its simplicity. It produced 16 FPS on average while still maintaining good accuracy, definitely making it a solid choice for multiple object detection and tracking.