Getting Started With Multi-Object Tracking: Part2

In this article, we’ll discuss some basics of Object Reidentification which is an important component of MOT. We’ve attached a few S.O.T.A. approaches below, that have been published recently. Note: In the next article, we’ll cover Multiple Object Detection.

Achleshwar

Published in

cvrs.notepad

10 min readOct 6, 2021

We — Pahwa Esha, Preyansh Agrawal, Dwij Mehta, Yogya Modi, and Achleshwar

Introduction

Object reidentification (ReID) aims at retrieving an object of interest across multiple non-overlapping cameras. Given a query object-of-interest, the goal of ReID is to determine whether this object has appeared in another place at a distinct time captured by a different camera, or even the same camera at a different time instant. The query object can be represented by an image, a video sequence, and even a text description. The most common application of ReID is video surveillance. When multiple cameras are located around a shopping mall, parking lot, university, or any other location, we want to ensure security. By using re-identification and tracking models, we can follow the path that a person is taking and make sure nothing illegal or inappropriate is done. Vehicle Surveillance is another application where vehicles can be tracked and re-identified in areas where access is restricted. Due to the urgent demand for public safety and an increasing number of surveillance cameras, person ReID is imperative in intelligent surveillance systems with significant research impact and practical importance.

There are 3 main modules in a ReID model:

1. Feature Representation Learning, which focuses on developing feature construction strategies.
2. Deep Metric Learning, which aims at designing the training objectives with different loss functions such as triplet loss, contrastive loss, etc., and different sampling strategies.
3. Ranking Optimization, which concentrates on optimizing the retrieved ranking list.

Datasets

Some of the popular publicly available datasets related to this problem include LPW (Labelled Pedestrian in the Wild), Market1501, PRID 2011, and MARS (Motion Analysis and Reidentification Set). Labeled Pedestrian in the Wild is a video-based dataset. It’s collected in three scenes on the street. The identities include adults and children and the poses vary from running to cycling. The dataset has been manually cleaned up to remove failed detections. Market1501 contains a large number of identities and each identity has several images from six disjoint cameras. PRID dataset has 385 trajectories from one camera view and 749 trajectories from another camera view. Among them, only 200 people appear on both cameras. The MARS dataset is an extended version of the Market1501 dataset. It is the first large-scale video-based person re-id dataset.

Building a Custom Dataset

If you want to solve this problem on your custom dataset then follow these simple instructions. The first step will probably be the toughest one and the most time-consuming and we can’t really help you with that.

Disclaimer: We’ll be using Torchreid for this blog.

Step1 — Take your camera, set up the scene, plan how you’re going to record the videos (videos are better than taking pictures because later you can take screenshots at different time intervals “manually hehe”) and then go get em’ videos.

Step2 — Here we assume that you have your dataset collected already. Now split your dataset into three splits (t̶r̶a̶i̶n̶-̶v̶a̶l̶-̶t̶e̶s̶t̶ … train-query-gallery). You can follow the following directory structure. We used 4 camera views to collect our dataset.

Step3 — Follow the following code. (Here PIDS refers to pigs ids and CIDS refers to camera ids and they both are python dictionaries.)

Step4 — Register your dataset in torchreid and create a datamanager (aka dataloader of torchreid).

Feature Representation Learning

Now that we have our dataset ready, we will be working on generating feature maps. In this blog, we’ll discuss three architectures/approaches that we found quite unique and efficient.

a. MuDeep: Multi-scale Deep Learning Architectures for Person Re-identification — Learning the discriminative feature representations is a vital task in any re-identification problem. Global features such as gender or size may be easily distinguishable, but simple 2D CNN models can find it strenuous to different individuals based on local features such as hair color, the type of handbag, or the type of shoes he or she is wearing. To add to this, the said features are often not of the same size — for example, shoes of different sizes — hence making this task a bit tedious. Fortunately, MuDeep provides an efficient and effective way to tackle this.

MuDeep is a novel person re-identification approach introduced in 2017 that does the task by bringing in two multi-scale streams that make use of different filter sizes of the convolution operation to handle varying sizes of discriminative features. It also introduces a saliency-based fusion layer which helps the model in distinguishing between the important features of the foreground while ignoring the background information.

Each input contains a pair of images, one query image, and one gallery image. Pre-processed images are then fed into four branches — four streams — of the model which consists of four components namely, Tied Convolution Layers, Tied Multi-Scale A and B, Tied Reduction Layer and Saliency based Learning Fusion Layer. The details of these layers are given in Table 1 here. It should be noted that filters of size 1x1, 3x3, and 5x5 are used to handle multi-scale features. 5x5 filter is further divided into two sizes: 3x3 and 3x3 which increases the width and depth of the layer. In multi-scale-B, 3x3 is divided into 3x1 and 1x3 filters which reduces the overall computational cost of a 3x3 filter. Adding a reduction layer increases the effectiveness of the approach as it leads to the same dimensionality vector as that of a max-pooling layer but without loss of information, by inculcating convolution operations across different streams ids.

For all four streams in the Multi-Scale B network, a feature map of size (39 x 14 x 256) is obtained. In the saliency layer, the output value for the jth channel is calculated using the following formula:

Hence, for each channel, summation across all streams is computed. αij is the scalar for a jth channel of the ith feature map, which initially has a random value assigned to it, and is thus learned at each iteration while training the network.

The resultant vector is then passed to a 4096 dimensional FC layer followed by a softmax classification subnet for both query and gallery image, containing nodes equal to the number of unique identities and the verification subnet which calculates whether the two input images belong to the same person or not using the following formula:

Difference D = [G1 − G2] . ∗ [G1 − G2] (.* denoting element-wise multiplication)

The above approach has tremendously good results on the CUHK01 and VIPeR dataset achieving a top-5 accuracy of 97% for the former and 74% for the latter.

b. Multi-Level Factorisation Net (MLFN) — MLFN is a novel network architecture that factorizes the visual appearance of a person into latent discriminative factors at multiple semantic levels without manual annotation.MLFN achieves state-of-the-art results on three Re-ID datasets, as well as compelling results on the general object categorization CIFAR-100 dataset.

The most recent Person Re-ID approaches employ deep neural networks (DNNs) to learn view-invariant discriminative features. These features would remain the same across multiple views for a particular person, for example, gender or color of shirt, etc. For matching, the features are typically extracted from the very top feature layer of a trained model. A problem thus arises: A DNN comprises multiple feature extraction layers stacked one on top of each other; and it is widely acknowledged that, when progressing from the bottom to the top layers, the visual concepts captured by the feature maps tend to be more abstract and of higher semantic level. However, for Re-ID purposes, discriminative factors of multiple semantic levels should be ideally preserved in the learned features.

Source: M — Source: Xiaobin Chang et al.

As we can see in the above diagram, there are N MLFN blocks denoted by Bi. Within each Bn, there are two key components: multiple Factor Modules (FMs) and a Factor Selection Module (FSM). Each FM is a subnetwork with multiple convolutional and pooling layers of its own, powerful enough to model a latent factor at the corresponding level indexed by n. Each block Bn consists of Kn FMs with identical network architecture.

The training procedure of MLFN for Person Re-ID follows the standard identity classification paradigm where each person’s identity is treated as a distinct class for recognition. A final fully connected layer is added above the representation R that projects it to a dimension matching the number of training classes (identities), and the cross-entropy loss is used. MLFN is then end-to-end trained. It discovers latent factors with no supervision other than person identity labels for the final classification loss. During testing, appearance representations R are extracted from gallery and probe images, and the L2 distance is used for matching.

MLFN is evaluated on three-person Re-ID benchmarks, Market-1501, CUHK03, and DukeMTMCreID, and achieves state-of-the-art performance on all of them. It achieved 90 on R1 score and 74.3 mAP on Market1501 Dataset. It achieved 81 R1 and 62.8 mAP on DukeMTMC — Re-ID Dataset.

c. Omni-Scale Network — As mentioned before, neural networks face two main challenges to perform person re-identification accurately. One, the variation in lighting and perspective of the camera systems creates large dissimilarity even in very similar target IDs. And two, different targets can look very similar depending on the similarity of the clothes and accessories they wear and the distance they are at from the cameras.
The solution to these problems is to learn distinguishing features at multiple scales. Introducing OSNet. It is a Re-ID Convolutional neural network that allows feature extraction at multiple scales as well as a combination of these to provide state-of-the-art performance while being extremely lightweight when compared to similar networks.

Conv vs. Lite Conv. Source: Kaiyang Zhou et al.

The lightweight part comes from the use of pointwise → depthwise (termed as ‘Lite’ convolution from now on) instead of depth-wise → pointwise convolutions as shown in the above figure. The computational cost is reduced from h×w×k²×c×c’ to h×w×(k² +c)×c’ and the number of parameters from k²×c×c’ to (k²+c)×c’ where, an input tensor is of h×w×c (height x width x channel width) dimensions, k is the kernel size and c’ is the output channel width.

This lite convolution is used in a residual bottleneck to form the building block of OSNet.

One unit of a building block in the OSNet.

These building blocks are stacked in different quantities in different channels to introduce the multi-scale aspect of learning in one single network as shown here.

The OSNet Architecture. Source: Kaiyang Zhou et al.

A stack of t Lite layers results in a receptive field of (2t+1) x (2t+1). Therefore the network above implements receptive fields of 3,5,7 and 9 in the four channels that are seen in the figure.

The most important addition is the Aggregation Gate used to combine all these features dynamically such that the dynamism is itself a learnable neural network. Due to this Gate, the final residual that is added in the very last ReLU layer (just before the output) looks like:

where G(x t ) is a vector with a length spanning the entire channel dimension of x t and denotes the Hadamard product. G is implemented as a mini-network composed of a non-parametric global average pooling layer [ref] and a multi-layer perceptron (MLP) with one ReLU-activated hidden layer, followed by the sigmoid activation. To reduce parameter overhead, the hidden dimension of the MLP is reduced with a reduction ratio, which is set to 16.

(more about this in “Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In ECCV, 2018” and “Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018”)

When trained from scratch, OSNet gave the best R1 and mAP scores at the time of publication (2019) in four of the most widely used datasets Market1501, CUHK03, Duke, and MSMT17. Further, the importance of the backbone of the architecture has been found to be of wide interest in visual recognition and not just re-ID.

Building Model using Torchreid

Good for us, we don’t have to code these convoluted models from scratch. Torchreid comes with the implementation of these models along with pre-trained weights! You can follow these code guidelines to build a model on your own. We have used OSNet because of its lightweight and domain generalization.

Activation Maps

There’s another great functionality available in Torchreid, i.e., visualizing activation maps. This is extremely helpful for the explainability purpose to understand where your model focuses to predict the identity of the input. You need to set test_only and visrank equals to True while running your engine.

engine.run(
 save_dir=’/home/achleshwarl/trial_run’,
 test_only=True,
 visrank=True
)

Thanks for reading this. Hope it helps!