Getting Started With Multi-Object Tracking: Part1

In this article, I’ll discuss some basic (frequently used) terminology that you should know to get started with Multi-Object Tracking. After this 8 minutes read, you will be able to understand the majority of the literature that has been published in this sub-domain of Computer Vision. I’ve attached a few S.O.T.A. approaches below, that have been published recently. Note: In the next article, I’ll cover the technical details of Object Reidentification.

Achleshwar
cvrs.notepad
8 min readDec 28, 2020

--

The image is taken from FairMOT. (Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, Wenyu Liu)

Object Detection

OD is a technique to localize (regression problem) and identify (classification problem) objects in images or videos. It can be broadly categorized into two main types: one-stage detectors such as YOLO, SSD, RetinaNet, CenterNet, CornerNet, and ExtremeNet, and two-stage detectors such as RCNN Family, SPPNet, FPN, to name a few. Both these techniques have the same input (image or a stack of frames of videos) and the same output (class scores and bounding boxes). The only major difference is that the two-stage detectors have an extra stage for extraction of object proposals. See figure below.

Re-Identification

Re-Identification is the other most important task of Object Tracking. Viewing tracking as a retrieval problem, given a detected person in Tth frame and a bunch of other detected persons in (T-1)th frame, we select the top similar persons and consider the possibility of those detections forming a trajectory. In Deep Similarity Learning, we find a function that measures how similar two objects are. Face-Recognition is a good example of a re-identification task.

Online Tracking vs Offline Tracking

In online tracking, we consider the current frame and the previous frame and try to find object-object relations to form trajectories whereas, in offline tracking, we take a stack of frames and make our predictions. Online tracking is more useful in real-time applications such as autonomous driving and offline tracking finds its applications in Video Querying or Video Analysis.

Anchor Boxes

Source

During region proposal, we generate multiple bounding boxes with different scales and aspect ratios, centered around each pixel. IoU also called the Jaccard index, measures the similarity of two bounding boxes. It is the ratio of the intersecting area to the union area of two bounding boxes. During prediction, we remove similar predicting bounding boxes using Non-Max Suppression (NMS) to simplify the results.

One-Shot Learning

Face-Recognition using Siamese Networks (implementation) is the first thing that strikes my mind when I hear One-Shot Learning. This is helpful in the re-ID part of Object Tracking. In typical cases, classification involves training our model on thousands of examples of a single class but in one-shot learning, one (or a few) example is given and in return, our model performs predictions on many unknown examples in the future. It does that by learning a low-dimension feature representation and a distance function. Further, we use triplet loss for learning face embedding and contrastive loss for dimensionality reduction (covered below) to optimize our network.

In case of one-shot learning, a single exemplar of an object class is presented to the algorithm — Knowledge Transfer in Learning to Recongize Visual Objects’ Classes

ROI-Align

ROI-Align is different from ROI-Pooling, as the object proposals obtained from RPN are not adjusted to fit on the feature map. Instead, the object proposals are divided into no. of boxes and in each box few points are sampled. The values at those points are obtained using bilinear interpolation.

As in the figure below, the feature map is shown by dotted blue lines, and the proposal is shown by solid black lines. The object proposal is divided into 2x2 boxes and the feature map into 5x5. the proposal is not perfectly aligned with the feature map and in each box, we have sampled 4 points. Now consider the top-left point. We obtain its value using the 4 nearest values in the feature map by bilinear interpolation and similarly for the rest of the points in each cell. Further, we take the maximum value in each therefore, we obtain a 2x2 feature map (1 value from each box).

Source: Mask R-CNN

Skip Connections

As we switch to very deep neural networks, there are high chances of overfitting and vanishing gradients. By using a skip connection, we provide an alternative path for the gradient (with backpropagation). It is experimentally validated that these additional paths are often beneficial for the model convergence. Skip connections in deep architectures, as the name suggests, skip some layer in the neural network and feeds the output of one layer as the input to the next layers (instead of only the next one).

Deformable Convolution

A CNN kernel uses a fixed rectangular window(Figure (a))to sample from the input feature map at fixed locations, the pooling layer uses the same rectangular-shaped kernel to reduce the spatial resolution at a fixed ratio.

In deformable convolutions, to factor in the scale of different objects and have different receptive fields according to the scale of the object, 2D offsets are added to the regular grid sampling locations in the standard convolution operation thereby deforming the constant receptive field of the preceding activation unit. The offsets added are learnable from the preceding feature maps using additional convolutional layers. Thus the deformation applied depends on the input features in a local, dense, and adaptive manner.

(a) Simple Convolution (b) Deformable Convolution (c,d) Special cases of Deformable Convolution with scaling and rotation respectively.

Encoder-Decoder

An encoder is a CNN that takes the input and outputs a feature map. The decoder is again a CNN (usually the same structure as the encoder but in opposite orientation) that takes the feature vector from the encoder and gives the best closest match to the actual input image. The encoders are trained with the decoders. There are no labels (hence unsupervised). The loss function is based on computing the delta between the actual input and the reconstructed image. The optimizer will try to train both the encoder and decoder to lower this reconstruction loss.

Source: NYU’s Deep Learning Course

Bipartite Matching

Before getting into what Bipartite matching is, let’s talk about bipartite graphs. In simple English, it is a graph that can be divided into two parts in such a way that all the edges go between the two parts. In other words, a graph G = (V, E) is bipartite if the vertex set V can be partitioned into two sets X and Y in such a way that no edge in E has both endpoints in the same set of the bipartition. Throwing some context to this: Let X be the students in a class and Y be the presentation topics. We put an edge from a vertex x∈X to a vertex y∈Y if student x would like to present on topic y. No student in set X has a relation with each other, but only to the elements(topics which they want to choose) in set Y. This is a bipartite graph. Of course, some students would want to present on more than one topic, so their vertex would have a degree greater than 1. Now suppose you’re the coordinator of that class and you want to assign each student a single topic. For this task, you would pick a subset of the edges so that each student gets assigned with only one topic and no topic gets matched with two students. This is known as bipartite matching.

The Hungarian Algorithm can be used to find maximum-weights matching in bipartite graphs.

Contrastive Loss

Contrastive Loss is used for dimensionality reduction. Converting a highly complex input, like an image, to a low-dimensional vector is a complex task. But it has been shown in recent research that having low-dimensional re-ID features help in improving the performance and inference speed of MOT. Contrastive Loss has been proven to be effective for such tasks in a 2006 paper titled ‘ Dimensionality Reduction by Learning an Invariant Mapping’. It is calculated between pairs of inputs, and it penalizes the model based on whether the inputs belong to the same class or different.

Triplet Loss

Triplet Loss was introduced by Google in the paper titled “FaceNet: A Unified Embedding for Face Recognition and Clustering.” It works very similarly to Contrastive Loss but instead of taking a pair of images, it considers three images. The loss function penalizes the model such that the distance between matching examples is reduced and the distance between anchor and negative pair increases.

It requires the face triplets, and then it minimizes the distance between an anchor and a positive sample of the same identity and maximizes the distance between the anchor and a negative sample of a different identity

Source: Pinterest

Group Loss

The Group Loss, proposed by Dynamic Vision and Learning Group (TUM), takes into consideration the mini-batch of samples. The researchers argue that what matters for a classification network is that the output is correct, not that the embeddings of samples belonging to the same class are similar. Since each sample is classified independently, two images of the same class may have two distant embeddings that both allow for correct classification.

Source: Dynamic Vision and Learning Group

Given a mini-batch B consisting of n images, we consider the problem of assigning a class label to each image in B. The method needs a matrix of soft-label assignment for each image (computed via the softmax layer of a CNN) and a matrix of similarity W coding the similarity between the images. The high-level description of the method is as follows:

1) Initialization: Initialize X, the image-label assignment using the softmax outputs of the neural network. Compute the n×n pairwise similarity matrix W using the neural network embedding.

2) Refinement: Iteratively, refine X considering the similarities between all the mini-batch images, as encoded in W, as well as their labeling preferences.

3) Loss computation: Compute the cross-entropy loss of the refined probabilities and update the weights of the neural network using backpropagation.

https://arxiv.org/abs/1912.00385

Thanks for going through this article. Looking forward to your feedback and constructive criticism. In the next article, I’ll cover the implementation of FairMOT. Feel free to connect with me on LinkedIn.

Publications:

  1. FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking.
  2. Towards Real-Time Multi-Object Tracking
  3. 3D Multi-Object Tracking: A Baseline and New Evaluation Metrics
  4. Tracking without bells and whistles

--

--