PAMTRI: Pose-Aware Multi-Task Learning for Vehicle Re-Identification
Using Highly Randomized Synthetic Data

An radical approach to solve vehicle re identification problem by NVIDIA Labs

Deval Shah
VisionWizard
10 min readAug 21, 2020

--

Today, we will discuss an unorthodox paper by NVIDIA Labs on Vehicle Re Identification.

I would like to point out that the paper doesn’t propose a novel research idea instead present an interesting engineering approach to solve vehicle re-identification problem.

Let’s jump right in.

Table Of Contents

  1. Introduction
  2. Novel Contributions
  3. Literature Review
  4. Approach
  5. Dataset
  6. Results
  7. Code
  8. Conclusion

Introduction

The paper attempts to solve a long standing problem of vehicle re-identification.

What is vehicle re-identification? Why is it required?

Consider vehicle re-identification as facial recognition problem except it is for vehicles. You see a vehicle and then look for a similar vehicle in database and say whether you got a match or not.

There are plethora of applications in vehicle ReID(re-identification) space like matching vehicles in entry and exit of parking lots, automatic toll collection, criminal investigations, traffic control etc.

What are different challenges faced by researchers while solving vehicle re-id problem?

Vehicle ReID is an open research problem with many researchers working extremely hard to devise new algorithms to solve it. However, there are many challenges to address to overcome them in diverse situations.

Challenges:-

  1. high intra-class variability (caused by the dependency of shape and appearance on viewpoint), and
  2. small inter-class variability (caused by the similarity in shape and appearance between vehicles produced by different manufacturers).[1]
  3. Due to limited number of colour options and cars in any particular region, matching poses a problem as they might look similar but, are different.

Despite these challenges, there has been great research work in past couple of years to effectively tackle the problem at scale.

Why can’t we use license plates to match vehicles?

In many scenarios, the camera viewpoints or low resolution are not suitable to read lp’s. For eg : cctv cameras are placed at an oblique angle on street lights. In such cases, we need a matching mechanism that uses unique car features like colour, headlights, bonet etc. Hence, vehicle re-identification comes into picture

Novel Contribution

The authors point out that person ReID problem has seen tremendous advancements as compared to vehicle ReID due to largely available annotated person data.

To recognize a vehicle with another, authors identify few cues that are distinguishable across different vehicles. The authors believe that the key to vehicle ReID is to exploit viewpoint-invariant information such as color, type, and deformable shape models encoding pose.[1]

In this work, the authors propose a novel framework named PAMTRI, for Pose-Aware Multi-Task Re-Identification.

1. PAMTRI embeds keypoints, heatmaps and segments from pose estimation into the multi-task learning pipeline for vehicle ReID, to guide the network to pay attention to viewpoint-related information. [1]

2. PAMTRI is trained with large-scale synthetic data that include randomized vehicle models, color and orientation under different backgrounds, lighting conditions and occlusion. Annotations of vehicle identity, color, type and 2D pose are automatically generated for training.
The dataset is created using unity engine. [1

3. Our proposed method achieves significant improvement over the state-of-the-art on two mainstream benchmarks: VeRi and CityFlow-ReID . [1]

Additional experiments validate that the unique architecture exploiting explicit pose information, along with use of randomized synthetic data for training, are key to the method’s state of the art results.

Literature Review

The first well known approach for solving vehicle re-identification was given by PROVID: Progressive and Multimodal Vehicle Reidentification for Large-Scale Urban Surveillance paper.

The paper showcased the concept of using contrastive loss for training a siamese neural network.

They also introduced the VeRi dataset (first large scale dataset for vehicle reid)

Fig : PROVID Framework

Furthermore, In defense of the triplet loss for person re-identification extends the idea of popular triplet loss to the vehicle ReID task.

Fig : Triplet Loss Training

Other approaches tried to focus on multi viewpoint invariant features (multi camera target tracking).

Orientation Invariant Feature Embedding and Spatial Temporal Regularization for Vehicle Re-identification embeds local region features from extracted vehicle key-points for training with cross-entropy loss.[1]

Fig : Feature Descriptor based orientation invariance

Vehicle Re-identification by Adversarial Bi-directional LSTM Network use a generative adversarial network (GAN) to generate multi view features to be selected by a viewpoint-aware attention model. [1]

Fig : Using Reconstruction loss to train GAN between real and fake for vehicle ReID

Approach

The authors have taken a multi task learning method to solve the vehicle re-id problem. They use different cues like viewpoint information(pose), type(eg : sedan) and color (eg : blue) to create a strong and distinct vehicle representation.

Three main pillars to the approach presented in the paper are

Synthetic Data Generation

It is highly unfeasible to annotate 36 keypoint structure for car images given the time and effort constraint to get a decent quantity of dataset. Hence, the authors generated a synthetic dataset.

A popular approach to overcome the so-called reality gap is domain randomization , in which a model is trained with extreme visual variety so that when presented with a real-world image the model treats it as just another variation.[1]

Synthetic data, if generated as per the context, can give considerable boost to the accuracy of the model.

Entire process of how the synthetic data is generated is mentioned in 3.1 section of the paper. I am skipping this part as I feel there are too many unnecessary details which might derail the focus from the main approach.

Fig : Vehicle Keypoint Data Generation using Unity Engine [1]

There are 36 keypoints and 13 segments per vehicle. Each segment a particular portion of vehicle like windshield, bottom, side glass etc.

Vehicle Pose Estimation

The idea behind doing the pose estimation is to get relevant viewpoint information of the vehicle during matching. Especially, during multi camera matching where the viewpoint can be quite different, the intersecting pose points can help.

The training is done using the synthetic data.

Instead of explicitly locating keypoint coordinates, the pose estimation network is trained for estimating response maps only, and semantic attributes are not exploited in their framework.

The architecture design choice for the pose estimation network is HRNet instead of the stacked hourglass network. The reason behind exclusion of stacked hourglass type of architecture is because of it’s unilateral pathway from low to high resolution whereas the HRNet maintains high-resolution representations and gradually add high-to-low resolution sub-networks with multi-scale fusion.

As a result, the predicted keypoints and heatmaps are more accurate and spatially more precise, which benefit our embedding for multi-task learning.[1]

The authors present two approaches to account for viewpoint information from pose estimation network

  1. In one approach, after the final deconvolutional layer, we extract the 36 heatmaps for each of the keypoints used to capture the vehicle shape and pose.
  2. In the other approach, the predicted keypoint coordinates from the final
    fully-connected (FC) layer
    are used to segment the vehicle body. There are predefined 13 segmentation masks based on keypoint information. If a certain keypoint is not visible, then the segment is set to blank.

The feedback of heatmaps or segments from the pose estimation network is then scaled and appended to the original RGB channels for further processing.

Multi Task Learning for ReID

Pose-aware representations are beneficial to both ReID and attribute classification tasks.

  1. Vehicle pose describes the 3D shape model that is invariant to the camera
    viewpoint, and thus the ReID sub-branch can learn to relate features from different views.
  2. Second, the vehicle shape is directly connected with the car type to which the target belongs.
  3. Third, the segments by 2D keypoints enable the color classification sub-branch to extract the main vehicle color while neglecting the non-painted areas such as windshields and wheels.

The end goal is to create a strong vehicle representation using color, type and pose information that can be used for vehicle re-identification

Architecture Information

Fig : Architecture given in paper for Mutli Task learning based ReID

The backbone used for multitask learning (type/color) is modified version of DenseNet121. The authors modified the initial layer to incorporate features from pose estimation network and stack them.

Fig : DenseNet Architecture [Link]

For extensive explanation on DenseNet121 architecture, please visit link.

The concatenated feature vector(vehicle signature) is fed to three separate branches for multi-task learning, including a branch for vehicle ReID and two other branches for color and type classification.

Loss Function

The final loss function of our network is the combined loss of the three tasks. For vehicle ReID, the hard-mining triplet loss is combined with cross-entropy loss to jointly exploit distance metric learning and identity classification

Final_Loss = L_ID + L_COLOR + L_TYPE

Fig : REID Loss [1]

L_ID = c1*hard_triplet_loss + c2*cross_entropy_loss

Here, c1 and c2 are regularizing coefficients.

Hard Triplet Loss is an extension of original triplet loss. The idea is to find set of triplets in a mini batch for which the model does not give good representation.

The process is to take p samples and k images from each sample. Now from pk images, calculate triplet loss on all and sort the scores. The ones having highest loss(model performed bad on these images) in the subset of data(pk images) that corresponds to a hard mini batch as compared to randomly choosing triplets.

Fig : Hard Triplet Loss [1]

I have explained this algorithm in detail in this article.

Cross Entropy Loss as the name suggests calculates loss between ground truth and estimation class. It is used for color and type classification as well.

Fig : Cross Entropy Loss [1]
Fig : Color and Type Loss [1]
Fig : Final Loss of entire network [1]

The regularizing parameter for type and color is set to a low value as compared to REID loss. This is because, in some circumstances, vehicle
ReID and attribute classification are conflicting tasks, i.e., two vehicles of the same color and/or type may not share the same identity. [1] The authors wanted REID loss to dominate in feature formation.

Inference

At the inference stage, the final ReID classification layer is removed. For each vehicle image a 1024-dimensional feature vector is extracted from the last FC layer. The features from each pair of query and test images are compared using Euclidean distance to determine their similarity. This is a standard procedure follower for any similarity matching task using feature vector.

Datasets

Fig : Datasets used for training and evaluation [1]

Please visit the links for detailed information about the datasets

VeRi Dataset, CityFlow-ReID

Both the datasets are available to download at the owner’s discretion for strictly academic purposes.

The synthetic dataset is not provided by authors.

Results

I strongly believe that the results given in paper should not be taken very seriously because these numbers(generally speaking) are produced under specific constraints. Unless you can reproduce these numbers easily, it is hard to confirm the authenticity. Having said that, the results provided in the papers give a general idea about relativity of approach in comparison to other approaches.

Below are ReID accuracy numbers from papers for VeRi and CityFlow-ReId dataset and color/type classification results and pose estimation network(trained seperately using HRNet backbone).

Fig : Accuracy numbers for VeRi Dataset [1]
Fig : Accuracy numbers for CityFlow-ReID dataset [1]
Fig : Accuracy numbers for color and type classification [1]
Fig : Accuracy numbers for Pose Estimation network [1]

The training hyper parameters are given in the paper. All the findings are clearly stated in the section 4 of the paper.

If you have any queries about results, please mention in the comment section below.

Code

The authors have provided official code repository to replicate the training and testing setup Github Link

Conclusion

The vehicle attributes such as color and type are highly related to the deformable vehicle shape expressed through pose representations.

Estimated heatmaps or segments are embedded with input batch images for training, and the predicted keypoint coordinates and confidence are concatenated with the deep learning features for multi-task learning.

The idea has merit as it takes into account the pose features for viewpoint information, color and type for exhaustive attributes as well. However, as the synthetic data is not made public by authors, it will be difficult to believe this results unless you generate similar data by yourself.

--

--