Research highlights from an unprecedented year at NeurIPS, the world’s most-watched AI conference: Computer vision and graphics (Part 2)

J. Rafael Tena
Acrisure Technology Group
8 min readMar 31, 2021
Image: Yari and colleagues, NeurIPS 2020

Whether through the visual media we consume on our smartphones, the video games we play, or the special effects animating our favorite TV series, we routinely experience the progress that has come from the marriage of computer vision, graphics, and AI. We highlight just a few of the latest projects in computer vision from NeurIPS 2020 that stood out to us. Check out part 1 of this blog post here for a round-up of other great research.

Reconstructing and processing 3D data

Surface and appearance reconstruction from multiple camera views. Image: Yari and colleagues, NeurIPS 2020

The majority of the digital media we consume on a daily basis consist of 2D images. Whether photographs or videos, the data is neatly arranged in grids of pixels. This consistent structure is leveraged by the ConvNet architectures that are at the heart of deep learning for computer vision applications. However, images are 2D lossy representations of the 3D world we inhabit. That information loss, namely depth, is critical to us for understanding spatial relationships in our world. Accordingly, developing techniques that work directly with data in the 3D domain is critical for many applications. This year, NeurIPS had several papers targeting the pipeline for working on the 3D domain.

A necessary step for developing 3D techniques is to actually ingest 3D information into the digital domain via 3D reconstruction. In Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance, Yari et al. from the Weizman Institute of Science, present a neural network end-to-end architecture that, given a set of images of an object from multiple viewpoints and corresponding segmentation masks of the object, can reconstruct the 3D geometry of the object, its appearance, and the locations of the cameras that created the input images. The work builds upon recent developments in neural rendering and surface representation to formulate the traditional multiview reconstruction problem as a neural network learning problem. The proposed architecture includes two MLPs: one that learns the 3D geometry, which is initialized to represent a unit sphere; and a second renderer MLP that is a function of the surface properties, the incoming radiance, and the viewing direction, which dictate the colors of the pixels in the image. The training loss consists of three terms: one that penalizes pixel color discrepancies; another that penalizes segmentation mask discrepancies; and a regularization term. Because geometry and appearance are captured by different networks, the architecture makes it possible to apply the material properties learned in one reconstruction to the geometry of a different object by transferring the rendering MLP.

Creating model representations of objects requires to process multiple instances of the same class. When working with images, the pixel grid provides a natural representation that allows to compare the value of a given pixel across multiple samples. However, to work with 3D objects it is first necessary to establish explicit correspondence between samples.

Establishing correspondences across different objects of the same class. Image: Liu and Liu, NeurIPS 2020

LoopReg: Self-supervised Learning of Implicit Surface Correspondences, Pose and Shape for 3D Human Mesh Registration by Bhatnagar et al. from MPI Saarland and Google Research, presents an end-to-end semi-supervised learning framework to register a corpus of scans to a common 3D human model. The approach encompasses two mappings: one parameterized by a neural network that predicts a corresponding point on the common 3D human model for every 3D point in the input scan and a second mapping parameterized by the human model, that transforms the corresponding points back to the input scan based on the human model parameters (pose and shape) to close the loop. Compared to traditional approaches, which optimize an objective function one scan at a time, LoopReg can leverage information across a corpus of unlabelled scans. Two key contributions make LoopReg possible: representing the surface of the human model implicitly and diffusing it to the whole 3D domain. These allow us to find correspondences even when the neural network predictions do not explicitly land on the model’s surface.

Learning Implicit Functions for Topology-Varying Dense 3D Shape Correspondence by Liu and Liu from Michigan State University, tackle the registration problem for topology varying classes such as chairs, or tables where correspondences are more readily defined at a part-level. For instance, a table has a top and legs, but different tables can have different numbers of legs. Their approach consists of a novel implicit function for surface representation that produces a part embedding latent vector for each 3D point, that is assumed to be similar to the latent vector of a corresponding point in another 3D surface of the same object category. Then, dense correspondences are established through an inverse function mapping from the part embedding to a corresponded 3D point. Similar to an encoder-decoder pair, both functions are jointly learned using a loss function that combines an occupancy term, a self-reconstruction term and a cross-reconstruction term.

After establishing correspondences across a collection of meshes, learning a latent representation for the collection is a core task for multiple applications such as compression, animation, and simulation. The paper Fully Convolutional Mesh Autoencoder using Efficient Spatially Varying Kernels by Zhou et al. from Adobe Research, Facebook Reality Labs, Pinscreen, and University of Southern California, presents a fully convolutional mesh autoencoder for arbitrary registered mesh data. Given that 3D data lacks the consistent grid available on 2D images, the authors’ main contributions are their convolution and (un)pooling operators, learned with globally shared weights but locally varying coefficients to capture the spatially varying contents presented by irregular mesh connections. The convolution kernel is defined with weights on a standard grid referred to as the “weight basis,” and any given vertex and its neighbors on the mesh can be projected to it. The weights basis is shared throughout the mesh, while the actual weights for each vertex are sampled from the weight basis by unique functions, which are learned through training.

Working directly on the 2D image grid

Multiresolution image blending. Image: Burt and Adelson, ACM Transactions on Graphics 1983

Many previous techniques in computer vision and image processing have been improved by adopting multiresolution approaches with pyramids that repeatedly smooth and subsample an image. An optimization algorithm would begin working at the lowest level of the pyramid, where the data is smooth, and move up to slowly incorporate more detail to keep the optimization stable. The paper Curriculum by Smoothing by Sinha et al. from University of Toronto, NVIDIA, and MILA follows a similar path to improve the training of CNNs. In the earlier stages of training, distortion artifacts due to noise can have adverse effects from which the network may not recover. To avoid this problem, the authors propose a curriculum-based training scheme to smooth the feature embedding of a CNN using Gaussian low pass filters. More specifically, the output of each convolutional layer is convolved with a Gaussian kernel with a high value of sigma to reduce the high-frequency information. As training progresses sigma is annealed towards zero slowly allowing higher frequencies and finally reverting to traditional CNN training. The authors experiment with their curriculum-based training and find performance improvements in classification tasks, transfer learning, and unsupervised representation learning with VAEs.

Attention is used to find correspondences between a query image and its support set. Image: Doersch and colleagues, NeurIPS 2020

Deep learning has dramatically improved the ability with which computers can correctly classify images. This is particularly true when all image classes are known to be the same as those encountered during training. However, general purpose vision systems must be able to adapt and learn new classes on the fly from a few examples, also known as few-shot learning. Meta-learning research benchmarks adaptability, that is, how good is a model at learning to learn. In Cross Transformers: spatially-aware few-shot transfer, Doersch and colleagues from DeepMind and University of Oxfordintroduce a new architecture called CrossTransformers for few-shot classification. Meta-learners usually learn to classify using the image as a whole, however, objects and scenes are composed of smaller parts that can be leveraged when learning new classes. This key insight of exploiting local part-based comparisons and accounting for spatial alignment are operationalized in practice by establishing soft correspondences using the attention mechanism of transformers. Most approaches to few-shot learning involve learning an embedding for each image, followed by a classifier, in this case the CrossTransformer. To improve the image embeddings, the authors use SimCLR, a self-supervised learning technique recently introduced by Chen and colleagues. Rather than using SimCLR as an auxiliary loss on the embedding, they reformulate the approach as an episodic training strategy that implicitly has the same effect but can be easily applied to any episodic learner. This new strategy is randomly applied to 50% of the training set. Even without leveraging SimCLR for the image embeddings, the CrossTransfromer achieved or outperformed other state-of-the-art methods on benchmarking datasets; and with SimCLR it was the best performer. Finally, the authors recognize that while their work has pushed the state-of-the-art, the performance of systems for understanding rare objects remains well below human performance and they would be a high risk if used on safety-critical applications.

Onward

As these highlighted projects show, computer vision, along with the broader field of deep learning, continues to develop rapidly. The advances we are seeing promise to impact many aspects of our lives, from the (now pervasive) video calls we attend, to the visual media we consume for entertainment, and eventually, even the cars that drive us. It is indeed an exciting time for AI!

Eric Allen is the Chief Technology Officer of Acrisure Technology Group. Before that he was an SVP at Two Sigma, conducting deep learning research on the Core AI team, and was a principal investigator at Sun Labs. He has a PhD in computer science from Rice University.

Brendan McCord is President of Acrisure Technology Group. Prior, he was founding CEO of two AI companies that were acquired in 2020 for $400 million, led the formation of the applied AI organization for the Department of Defense, and authored the first DoD AI Strategy.

Rafael Tena is Sr. Staff AI Researcher at Acrisure Technology Group. Prior, he helped companies like FIGS accelerate growth using ML while at Tulco Labs, and was Sr. Research Engineer at Disney Research. He has a PhD in computer vision from University of Surrey.

--

--