NeuroNuggets: CVPR 2018 in Review, Part II

Today, we continue our series on the recent CVPR (Computer Vision and Pattern Recognition) conference, a top conference in the world for computer vision. Neuromation successfully participated in the DeepGlobe workshop there, and now we are taking a look at the papers from the main conference. In the first part of our CVPR review, we briefly reviewed the most interesting papers devoted to generative adversarial networks (GAN) for computer vision. This time, we delve into the works that apply computer vision to us, humans: tracking human bodies and other objects in videos, estimating poses and even full 3D body shapes, and so on. Again, the papers are in no particular order, and our reviews are very brief, so we definitely recommend to read the papers in full.

The human touch: person identification, tracking, and pose estimation

Humans are very good at recognizing and identifying other humans, much more so than at recognizing other objects. In particular, there is a special part of the brain, called fusiform gyrus, which is believed to contain the neurons responsible for face recognition, and those neurons are believed to do their jobs a bit differently from the neurons that recognize other things. This is where those illusions about upside-down faces (the Thatcher effect) come from, and there is even a special cognitive disorder, prosopagnosia, where a person loses the ability to recognize human faces… but still perfectly well recognizes tables, chairs, cats or English letters. It’s not all that well understood, of course, and there are probably no specific “individual face neurons”, but faces are definitely different. And humans in general (their shapes, silhouettes, body parts) also have a very special place in our hearts and brains: “basic shapes” for our brain probably include triangles, circles, rectangles… and human silhouettes.

Recognizing humans is a central problem for humans, and so it has been for computer vision. Back in 2014 (a very long time ago in deep learning), Facebook claimed to reach superhuman performance on face recognition, and regardless of contemporary criticism by now we can basically assume that face recognition is indeed solved very well. However, plenty of tasks still remain; e.g., we have already posted about age and gender estimation and pose estimation for humans. On CVPR 2018, most human-related papers were either about finding poses in 3D or about tracking humans in video streams, and this is exactly what we concentrate on today. For good measure, we also review a couple of papers on object tracking that are not directly related to humans (but where humans are definitely one of the most interesting subjects).

Detect-and-Track: Two-Step Tracking with Pose Estimation

R. Girdhar et al., Detect-and-Track: Efficient Pose Estimation in Videos

We have already written about segmentation with Mask R-CNN, one of the most promising approaches to segmentation that appeared in 2017. Over the last year, several extensions and modifications of the basic Mask R-CNN appeared, and this collaboration between Carnegie Mellon, Facebook, and Dartmouth presents another: the authors propose a 3D Mask R-CNN architecture that uses spatiotemporal convolutions to extract features and recognize poses directly on short clips. Then they proceed to show that a two-step algorithm with 3D Mask R-CNN as the first step (and bipartite matching to link keypoint predictions as the second) beats state of the art methods in pose estimation and human tracking. Here is the 3D Mask R-CNN architecture which is sure to find more applications in the future:

Pose-Sensitive Embeddings for Person Re-Identification

M. Saquib Sarfraz et al., A Pose-Sensitive Embedding for Person Re-Identification with Expanded Cross Neighborhood Re-Ranking

Person re-identification is a challenging problem in computer vision: as examples above show, changes in the camera view and pose can make the two pictures not alike at all (although we humans still immediately identify that this is the same person). This problem is usually solved with retrieval-based methods that derive proximity measures between the query image and stored images from some embedding space. This work by German researchers proposes a novel way to incorporate information about the pose directly into the embedding, improving re-identification results. Here is a brief overview picture, but we suggest to read the paper in full to understand how exactly the pose is added to the embedding:

3D Poses from a Single Image: Constructing a 3D Mesh from 2D Pose and 2D Silhouette

G. Pavlakos et al., Learning to Estimate 3D Human Pose and Shape from a Single Color Image

Pose estimation is a well-understood problem; we have written about it before and already mentioned it in this post. Making a full 3D shape of a human body is quite another matter, however. This work presents a very promising and quite surprising result: they generate the 3D mesh of a human body directly through an end-to-end convolutional architecture that combined pose estimation, segmentation of human silhouettes, and mesh generation (see picture above). The key insight here is based on using SMPL, a statistical body shape model that provides a good prior for the human body shape. As a result, this approach manages to construct a 3D mesh of a human body from a single color image! Here are some illustrative results, including some very challenging cases from the standard UP-3D dataset:

FlowTrack: Looking at Video with Attention for Correlation Tracking

Z. Zhu et al., End-to-end Flow Correlation Tracking with Spatial-temporal Attention

Discriminative correlation filters (DCF) are a state of the art learning technique for object tracking. The idea is to learn a filter — that is, a transformation of an image window, usually simply a convolution — which would correspond to the object you want to track and then apply it to all frames in the video. As it often happens with neural networks, DCFs are far from a new idea, dating back to a seminal 1980 paper, but they had been nearly forgotten until 2010; the MOSSE tracker started a revival, and now DCFs are all the rage. However, classical DCFs do not make use of the actual video stream and process each frame separately. In this work, the Chinese researchers present an architecture that does involve a spatial-temporal attention mechanism able to attend across different time frames; they report much improved results. Here is the general flow of their model:

Back to the Classics: Correlation Tracking

C.Suni et al., Correlation Tracking via Joint Discrimination and Reliability Learning

This paper, just like the previous one, is devoted to tracking objects in videos (a very hot topic right now), and just like the previous one, it uses correlation filters for tracking. But, in stark contrast to the previous one, this paper does not use deep neural networks at all! The basic idea here is to explicitly include in the model reliability information, i.e., add a term to the objective function that models how reliable the learned filter is. The authors report significantly improved tracking and also show learned reliability maps that often look very plausible:

That’s all folks!

Thank you for your attention! Join us next time — there are many more interesting papers from CVPR 2018… and, just as a sneak peek, the ICLR 2019 deadline has passed, its submitted papers are online, and although we won’t know which are accepted for a few more months we are already looking at them!

Sergey Nikolenko
Chief Research Officer, Neuromation

Aleksey Artamonov
Senior Researcher, Neuromation