PAMTRI: Pose-Aware Multi-Task Learning for Vehicle Re-Identification
Using Highly Randomized Synthetic Data
An radical approach to solve vehicle re identification problem by NVIDIA Labs
Today, we will discuss an unorthodox paper by NVIDIA Labs on Vehicle Re Identification.
I would like to point out that the paper doesn’t propose a novel research idea instead present an interesting engineering approach to solve vehicle re-identification problem.
Let’s jump right in.
Table Of Contents
- Introduction
- Novel Contributions
- Literature Review
- Approach
- Dataset
- Results
- Code
- Conclusion
Introduction
The paper attempts to solve a long standing problem of vehicle re-identification.
What is vehicle re-identification? Why is it required?
Consider vehicle re-identification as facial recognition problem except it is for vehicles. You see a vehicle and then look for a similar vehicle in database and say whether you got a match or not.
There are plethora of applications in vehicle ReID(re-identification) space like matching vehicles in entry and exit of parking lots, automatic toll collection, criminal investigations, traffic control etc.
What are different challenges faced by researchers while solving vehicle re-id problem?
Vehicle ReID is an open research problem with many researchers working extremely hard to devise new algorithms to solve it. However, there are many challenges to address to overcome them in diverse situations.
Challenges:-
- high intra-class variability (caused by the dependency of shape and appearance on viewpoint), and
- small inter-class variability (caused by the similarity in shape and appearance between vehicles produced by different manufacturers).[1]
- Due to limited number of colour options and cars in any particular region, matching poses a problem as they might look similar but, are different.
Despite these challenges, there has been great research work in past couple of years to effectively tackle the problem at scale.
Why can’t we use license plates to match vehicles?
In many scenarios, the camera viewpoints or low resolution are not suitable to read lp’s. For eg : cctv cameras are placed at an oblique angle on street lights. In such cases, we need a matching mechanism that uses unique car features like colour, headlights, bonet etc. Hence, vehicle re-identification comes into picture
Novel Contribution
The authors point out that person ReID problem has seen tremendous advancements as compared to vehicle ReID due to largely available annotated person data.
To recognize a vehicle with another, authors identify few cues that are distinguishable across different vehicles. The authors believe that the key to vehicle ReID is to exploit viewpoint-invariant information such as color, type, and deformable shape models encoding pose.[1]
In this work, the authors propose a novel framework named PAMTRI, for Pose-Aware Multi-Task Re-Identification.
1. PAMTRI embeds keypoints, heatmaps and segments from pose estimation into the multi-task learning pipeline for vehicle ReID, to guide the network to pay attention to viewpoint-related information. [1]
2. PAMTRI is trained with large-scale synthetic data that include randomized vehicle models, color and orientation under different backgrounds, lighting conditions and occlusion. Annotations of vehicle identity, color, type and 2D pose are automatically generated for training.
The dataset is created using unity engine. [1
3. Our proposed method achieves significant improvement over the state-of-the-art on two mainstream benchmarks: VeRi and CityFlow-ReID . [1]
Additional experiments validate that the unique architecture exploiting explicit pose information, along with use of randomized synthetic data for training, are key to the method’s state of the art results.
Literature Review
The first well known approach for solving vehicle re-identification was given by PROVID: Progressive and Multimodal Vehicle Reidentification for Large-Scale Urban Surveillance paper.
The paper showcased the concept of using contrastive loss for training a siamese neural network.
They also introduced the VeRi dataset (first large scale dataset for vehicle reid)
Furthermore, In defense of the triplet loss for person re-identification extends the idea of popular triplet loss to the vehicle ReID task.
Other approaches tried to focus on multi viewpoint invariant features (multi camera target tracking).
Orientation Invariant Feature Embedding and Spatial Temporal Regularization for Vehicle Re-identification embeds local region features from extracted vehicle key-points for training with cross-entropy loss.[1]
Vehicle Re-identification by Adversarial Bi-directional LSTM Network use a generative adversarial network (GAN) to generate multi view features to be selected by a viewpoint-aware attention model. [1]
Approach
The authors have taken a multi task learning method to solve the vehicle re-id problem. They use different cues like viewpoint information(pose), type(eg : sedan) and color (eg : blue) to create a strong and distinct vehicle representation.
Three main pillars to the approach presented in the paper are
Synthetic Data Generation
It is highly unfeasible to annotate 36 keypoint structure for car images given the time and effort constraint to get a decent quantity of dataset. Hence, the authors generated a synthetic dataset.
A popular approach to overcome the so-called reality gap is domain randomization , in which a model is trained with extreme visual variety so that when presented with a real-world image the model treats it as just another variation.[1]
Synthetic data, if generated as per the context, can give considerable boost to the accuracy of the model.
Entire process of how the synthetic data is generated is mentioned in 3.1 section of the paper. I am skipping this part as I feel there are too many unnecessary details which might derail the focus from the main approach.
There are 36 keypoints and 13 segments per vehicle. Each segment a particular portion of vehicle like windshield, bottom, side glass etc.
Vehicle Pose Estimation
The idea behind doing the pose estimation is to get relevant viewpoint information of the vehicle during matching. Especially, during multi camera matching where the viewpoint can be quite different, the intersecting pose points can help.
The training is done using the synthetic data.
Instead of explicitly locating keypoint coordinates, the pose estimation network is trained for estimating response maps only, and semantic attributes are not exploited in their framework.
The architecture design choice for the pose estimation network is HRNet instead of the stacked hourglass network. The reason behind exclusion of stacked hourglass type of architecture is because of it’s unilateral pathway from low to high resolution whereas the HRNet maintains high-resolution representations and gradually add high-to-low resolution sub-networks with multi-scale fusion.
As a result, the predicted keypoints and heatmaps are more accurate and spatially more precise, which benefit our embedding for multi-task learning.[1]
The authors present two approaches to account for viewpoint information from pose estimation network
- In one approach, after the final deconvolutional layer, we extract the 36 heatmaps for each of the keypoints used to capture the vehicle shape and pose.
- In the other approach, the predicted keypoint coordinates from the final
fully-connected (FC) layer are used to segment the vehicle body. There are predefined 13 segmentation masks based on keypoint information. If a certain keypoint is not visible, then the segment is set to blank.
The feedback of heatmaps or segments from the pose estimation network is then scaled and appended to the original RGB channels for further processing.
Multi Task Learning for ReID
Pose-aware representations are beneficial to both ReID and attribute classification tasks.
- Vehicle pose describes the 3D shape model that is invariant to the camera
viewpoint, and thus the ReID sub-branch can learn to relate features from different views. - Second, the vehicle shape is directly connected with the car type to which the target belongs.
- Third, the segments by 2D keypoints enable the color classification sub-branch to extract the main vehicle color while neglecting the non-painted areas such as windshields and wheels.
The end goal is to create a strong vehicle representation using color, type and pose information that can be used for vehicle re-identification
Architecture Information
The backbone used for multitask learning (type/color) is modified version of DenseNet121. The authors modified the initial layer to incorporate features from pose estimation network and stack them.
For extensive explanation on DenseNet121 architecture, please visit link.
The concatenated feature vector(vehicle signature) is fed to three separate branches for multi-task learning, including a branch for vehicle ReID and two other branches for color and type classification.
Loss Function
The final loss function of our network is the combined loss of the three tasks. For vehicle ReID, the hard-mining triplet loss is combined with cross-entropy loss to jointly exploit distance metric learning and identity classification
Final_Loss = L_ID + L_COLOR + L_TYPE
L_ID = c1*hard_triplet_loss + c2*cross_entropy_loss
Here, c1 and c2 are regularizing coefficients.
Hard Triplet Loss is an extension of original triplet loss. The idea is to find set of triplets in a mini batch for which the model does not give good representation.
The process is to take p samples and k images from each sample. Now from pk images, calculate triplet loss on all and sort the scores. The ones having highest loss(model performed bad on these images) in the subset of data(pk images) that corresponds to a hard mini batch as compared to randomly choosing triplets.
I have explained this algorithm in detail in this article.
Cross Entropy Loss as the name suggests calculates loss between ground truth and estimation class. It is used for color and type classification as well.
The regularizing parameter for type and color is set to a low value as compared to REID loss. This is because, in some circumstances, vehicle
ReID and attribute classification are conflicting tasks, i.e., two vehicles of the same color and/or type may not share the same identity. [1] The authors wanted REID loss to dominate in feature formation.
Inference
At the inference stage, the final ReID classification layer is removed. For each vehicle image a 1024-dimensional feature vector is extracted from the last FC layer. The features from each pair of query and test images are compared using Euclidean distance to determine their similarity. This is a standard procedure follower for any similarity matching task using feature vector.
Datasets
Please visit the links for detailed information about the datasets
Both the datasets are available to download at the owner’s discretion for strictly academic purposes.
The synthetic dataset is not provided by authors.
Results
I strongly believe that the results given in paper should not be taken very seriously because these numbers(generally speaking) are produced under specific constraints. Unless you can reproduce these numbers easily, it is hard to confirm the authenticity. Having said that, the results provided in the papers give a general idea about relativity of approach in comparison to other approaches.
Below are ReID accuracy numbers from papers for VeRi and CityFlow-ReId dataset and color/type classification results and pose estimation network(trained seperately using HRNet backbone).
The training hyper parameters are given in the paper. All the findings are clearly stated in the section 4 of the paper.
If you have any queries about results, please mention in the comment section below.
Code
The authors have provided official code repository to replicate the training and testing setup Github Link
Conclusion
The vehicle attributes such as color and type are highly related to the deformable vehicle shape expressed through pose representations.
Estimated heatmaps or segments are embedded with input batch images for training, and the predicted keypoint coordinates and confidence are concatenated with the deep learning features for multi-task learning.
The idea has merit as it takes into account the pose features for viewpoint information, color and type for exhaustive attributes as well. However, as the synthetic data is not made public by authors, it will be difficult to believe this results unless you generate similar data by yourself.