Reforming Person Re-Identification with Local Convolutional Neural Networks
This article is part of the Academic Alibaba series and is taken from the paper entitled “Local Convolutional Neural Networks for Person Re-Identification” by Jiwei Yang, Xu Shen, Xinmei Tian, Houqiang Li, Jianqiang Huang, and Xian-Sheng Hua, and accepted by ACM MM 2018. The full paper can be read here.
Person re-identification — or ReID — refers to the process of finding and identifying a specified individual from a database of images so that they can be located across multiple cameras or within the same camera but at different times. This field of computer vision has received a lot of attention in recent years, and the technology is now being used extensively in vision-related applications, such as video retrieval, video surveillance, and CCTV identification.
But person ReID is not without its challenges. Dramatic variations in background caused by cameras located in different settings, variations in visual appearance due to people’s postures changing over time and different camera angles, and physical obstructions are all factors which have contributed towards making person ReID a rather complicated task to carry out. More pertinently, however, they highlight the need for a more powerful and robust ReID feature, a fact that piqued the interest of the Alibaba team.
While prior studies have focused primarily on extracting low-level features, such as shapes and local descriptors, the Alibaba team pursued a different approach: convolutional neural networks (CNNs). The team’s novel proposal — Local CNN — is a set of local operations, a family of building blocks for synthesizing local and global information in any CNN layer. This building block can be inserted into any convolutional module, even with only a limited knowledge as to the approximate locations of local parts.
Earlier research into this topic has consistently treated local and global features as separate branches, meaning that the interaction or boosting of local and global information was impossible. Unlike with GoogleNet, ResNet, and DenseNet, the Alibaba team decided to merge them, incorporating local information to improve the backbone vision models.
With Local CNN, local operations can be combined with any existing architectures. For person ReID, a simple yet effective form of Local CNN is implemented, with the model used found to outperform other state-of-the-art attention-based and part-based methods. The framework developed by the Alibaba team is also believed to be the first in the industry that enables the interaction of global and local information in any CNN layer.
Local CNN was tested against three of the most widely used ReID datasets, namely Market-1501, CUHK03, and DukeMTMC-reID. The tables below illustrate how, when tested against state-of-the-art attention-based and part-based models, Local CNN comes out on top. Results from the experiments also indicated that it is more effective to fuse global and local information together, rather than splitting them into separate branches.
In Table 1, the Market-1501 dataset is used to demonstrate the performance of multiple granularity-based methods. MG indicates that the same architecture and training scheme is used as in , and RK refers to reranking.
In Table 2, the Market-1501 dataset is used to demonstrate the performance of different methods (single query without reranking)
In Table 3, the performances of different methods is illustrated using the DukeMTMC-reID dataset (single shot without reranking), whereas in Table 4, the CUHK03 dataset is used (single query without reranking).
The full paper can be read here.