Visual Perception for Self-Driving Cars! Part 2: Multiple Object Tracking

Learn concepts by coding! Explore how deep learning and computer vision are used for different visual tasks in autonomous driving.

Shahrullohon Lutfillohonov
6 min readAug 8, 2022

This article is part of series. Check out the full series: Part 1, Part 2, Part 3, Part 4, Part5, Part6!

In Part1, I gave an introduction to visual perception for self-driving cars topic and talked about 2D Object Detection for Self-Driving Cars. (If you haven’t already read part 1, read it now!)

Now we know how to use detection model for our autonomous vehicles. Unfortunately, only detecting objects on the road cannot help much. However, it is a starting point to track objects. Only then, our car can sense its surrounding and respond to while moving on the road accordingly.

In this article, we will consider one of the highly required topics for autonomous driving — Object Tracking. First, I will give a brief introduction to tracking objects and dive deeply its use case in self-driving cars.

A Brief Introduction To Object Tracking

Tracking object implies determining the target object’s current status using prior knowledge. In this article, we will implement one of the SOTA tracking algorithms: StrongSORT jointly with YOLOv5 and test it on MCMOT dataset by 42dot.ai.

Multi-Camera Multi-Object Tracking — Source link

There are two main types of tracking algorithm family: 1) Single Object Tracker (SOT) and 2) Multiple Object Tracker (MOT). In this article, we will consider only MOT method as it is directly related to our Autonomous Driving task.

Multiple Object Tracking (MOT)

Multiple Object Tracking (MOT) refers to the computer vision task that addresses to track every single object of interest in the images or videos. In usual case, MOT integrates a technique known as tracking-by-detection. It entails running a tracker on the set of detections after an independent detector has been applied to images or videos in order to gather expected detections. Unique IDs are then assigned to the bounding box of detected objects and estimation algorithms are used to track moving object’s future actions without losing assigned IDs. As conclusion, the following three steps are shared by the majority of MOT algorithms in high level:

  • Detect objects
  • Create a unique ID for each detected objects
  • Track object as they move, maintaining the assigned IDs.
Object Tracking for AD: each object is given unique ID — Image by Author

In the following lines, we will talk about some advanced topics. If any of them seems unknown for the reader, please refer to the attached links for further information!

StrongSORT: Make DeepSORT Great Again

As the title implies, StrongSORT is the superior of DeepSORT algorithm which is an extension of the SORT (Simple Online Realtime Tracking) technique. While DeepSORT integrates deep learning into the SORT to improve the efficiency, StrongSORT equips various advanced techniques to further upgrade the performance of DeepSORT. To understand StrongSORT algorithm, we first review the structure of DeepSORT, then details new attached methods to it.

DeepSORT

DeepSORT is one of the widely used real-time object tracking methods. It comprises deep learning techniques to primitive approaches, namely Kalman filters and Hungarian algorithms for motion prediction and data association to estimate the object’s location and track them with higher accuracy and low failures. DeepSORT comprises the following components:

  • Bounding Box Prediction: detect the object of interest in the image. It can be done with any object detection algorithms. We showed the use case of YOLOv7 in our previous article for object detection.
  • State Estimation: Kalman filter is applied to to predict the future location of the target by optimally solving the velocity components. Then the detected bounding box in the previous step is used to update the target state.
  • Target Association: Kalman filter just estimates the object’s new location, which needs to be optimized. A matching cascade is introduced in DeepSORT to solve a series of subproblems. A cascade gives priority to more frequently seen objects to encode the notion of probability spread in the association likelihood. Then, intersection over union (IoU) association is run in the final matching stage for assigning detections to existing targets. The Hungarian algorithm is used to solve the assignment in a best possible way. It helps to deal with the occlusion problem and maintain the IDs. Mahalanobis Distance and Deep Appearance Desriptor are introduced as association metrics in DeepSORT algorithm. The full structure is given in the below graph.
Architecture of DeepSORT — Image by Author

StrongSORT

StrongSORT is built with equipping DeepSORT with a variety of various advanced approaches to demonstrate the efficiency of its paradigm.

Notably, the original simple CNN is replaced by BoT+ResNeSt50 in Deep Appearance Descriptor. Furthermore, exponential moving average (EMA) is applied to enhance the matching quality and reduce the time consumption.

Next, enhanced correlation coefficient (ECC) is adopted for camera motion compensation. ECC is invariant to photometric distortions. Moreover, the vanilla Kalman filter is sensitive to poor-quality detected objects and ignores the information on the detection noise scales. Therefore, it is replaced by the Noise Scale Adaptive (NSA) Kalman filter that adaptively modifies the noise scale in response to the object detection quality.

Lastly, the matching cascade is replaced by vanilla global linear assignment as additional prior constraints would limit the matching performance. The updated version of DeepSORT architecture which is known as StrongSORT is shown below. New components are given in orange boxes.

StrongSORT: Make DeepSORT Great Again — Image by Author

StrongSORT Implementation

Like other tracking algorithms, StrongSORT can be implemented for various real-life scenarios. As we promised earlier, we will implement StrongSORT for self-driving cars. However, we are not going to train the model from scratch, instead we use pre-trained models and inference them on custom dataset.

Dataset — MCMOT : multi-camera multi-object tracking

42dot.ai has recently released MCMOT dataset for multi-camera multi-object tracking. They provide an annotated dataset that assigns unique track IDs to the objects captured by three frontal cameras. The camera on the front center has a field of view (FOV) of 60 degrees and the two cameras on the front sides (left and right) have a FOV of 120 degrees. All three cameras in the front have a resolution of 1920x1208. All descriptions are given in the dataset website.

Model Inference

Let’s prepare the dataset and test it on the pre-trained models.

Create new environment and install StrongSORT dependencies

It is always helpful to create virtual environment to manage dependencies and isolate our project.

# Create new conda environment
conda create -n (your env name) python=3.9 jupyter

and activate it

# activate the conda environment
conda activate (your env name)

Now let’s install StrongSORT with YOLOv5 for detection part. Thanks Mikel_Brostrom for sharing such an awesome repo! More details can be seen in the repo.

# Download StrongSORT repository and install requirements
!git clone --recurse-submodules https://github.com/mikel-brostrom/Yolov5_StrongSORT_OSNet.git
# then install dependencies
!pip install -r requirements.txt

Select object detection and ReID model

We will use yolov5m pre-trained weights for object detection and osnet_x0_25_market1501 weights for tracking.

!python track.py --source {path for data} --strong-sort-weights osnet_x0_25_market1501.pt --save-vid

Depending on your Hardware, it may take some time for tracking objects.

Now, let’s see the results.

Inferenced data on StrongSORT object tracking algorithm — Source by Author

Conclusion

StrongSORT has showed fairly well and promising performance for tracking multiple objects for self-driving cars. We have not showed full details of custom code implementation, as there are some really good repos at GitHub to refer. The one that we used it by Mikel Brostrom.

I hope you enjoyed reading. If you have any question or suggestion, please feel free to leave a comment. You can also find me on LinkedIn or email me directly. I’d love to hear from you!

We will discuss further more on visual perception for self driving cars in the following posts.

--

--