Our take on CVPR 2020
Rafael Tena and Eric Allen, Tulco Labs
CVPR serves as the premier conference on computer vision research. And every year, we see more and more focus at this conference on research that is relevant to the broader machine learning research community. In this blog, we highlight many of the more general papers, as well as some of the more interesting vision-specific work that caught our attention.
As with many conferences held since the Covid-19 pandemic, CVPR replaced its scheduled gathering at a physical venue (Seattle, WA) with a fully virtual conference. We discussed some of the pros and cons of hosting a virtual scientific gathering in our take on ICLR, and it is interesting to see the community evolve the virtual conference experience; there certainly are many advantages to virtual conferences to go along with their limitations! Technology to support these virtual venues is advancing rapidly, and it’s especially interesting to see a conference like CVPR, which showcases research directly relevant to video conferencing, take advantage of that same technology. One wonders what these conferences will look like in a year’s time. Maybe some of the papers highlighted below will help us get there…
General Interest
Computing the Testing Error Without a Testing Set
Ciprian A. Corneanu, Meysam Madadi, Sergio Escalera, and Aleix M. Martinez
Universitat de Barcelona, The Ohio State University and Centre de Visio per Computador
One of the pitfalls of training machine learning models is not understanding the difference between training, validation and testing errors. In particular, measuring the testing error on a hold-out set is key to determine if a model generalizes well. The hold-out set is meant to be used only once and then discarded to avoid overfitting, which makes that dataset an extremely scarce resource. In this work the authors present a methodology for measuring the generalization error without using a hold-out set. This enables experimentation to improve generalization without burning out the testing set. We find the notion of directly estimating generalization as a supplement to validation to be an interesting one; we are far more skeptical of the authors’ claim that a test set can be avoided entirely. Nevertheless, some a priori measure of generalization (in the context of an accepted inductive bias) could be quite valuable. The authors derive persistent topology measures that identify when a DNN is learning to generalize to unseen samples. Their methodology is supported by experiments with multiple architectures and computer vision tasks.
Improving Confidence Estimates for Unfamiliar Examples
Zhizhong Li and Derek Hoiem
University of Illinois Urbana Champaign
When a machine learning model is deployed, it is not uncommon for it to be presented with data from a significantly different distribution than that used during development. As a consequence, the error rate on predictions for which the model is highly confident can increase dramatically. In this paper, Li and Hoiem compare several methods to mitigate overconfidence for both familiar and unfamiliar samples. In their methodology, they split the data into familiar and unfamiliar samples according to the samples attributes. The unfamiliar samples are reserved for the testing set. For instance, in a dog vs cats classification task, the test set would contain breeds not seen during training and validation. Their experiments show that ensembles of models calibrated using temperature scaling as proposed by Guo et al in their work “On calibration of modern neural networks” are the least prone to making overconfident predictions on unfamiliar samples.
Proxy Anchor Loss for Deep Metric Learning
Sungyeon Kim, Dongwon Kim, Minsu Cho, and Suha Kwak
POSTECH
Deep Learning can be used to learn semantic distance metrics, that is embeddings in which semantically similar data, such as images of the same class, are closely grouped together and away from other clusters of dissimilar data. The quality of the embedding space is mostly governed by the loss used to train the networks, and most losses can be categorized into pair-based and proxy-based. Pair-based losses take a pair of embedding vectors as input, pull them together if they are of the same class and push them apart otherwise. Because the number of pairs increases polynomially with the size of the dataset, pair-based loss training is expensive, and convergence is slow. Conversely, proxy-based losses estimate a single proxy for each class, and data points are paired with these proxies only. This reduces the number of available pairs, improving convergence at the expense of quality. Kim and colleagues propose a hybrid loss to overcome the limitations of proxy-based losses while retaining the benefits. The main idea is to take each proxy as an anchor and associate it with all data in a batch. The loss pulls the proxy and its most dissimilar positive example in the batch together, and pushes its most similar negative example apart.
Augment Your Batch: Improving Generalization Through Instance Repetition
Elad Hoffer, Tal Ben-Nun, Itay Hubara, Niv Giladi, Torsten Hoefler, and Daniel Soudry
Habana-Labs, ETH Zürich, Technion
Gradient Descent derived methods are the optimization workhorses of Deep Learning. As datasets grew in size, mini-batches became the preferred method for estimating gradients, allowing us to perform more descent steps faster, and enabling parallelization. Larger batches reduce the noise in gradient estimation but have been observed to hamper generalization error when the training regime is not carefully tuned. This work by Hoffer and colleagues introduces batch augmentation, that is, replicating instances of samples within the same batch with different data augmentations. They show that this simple process acts as a regularizer and an accelerator, increasing both generalization and performance scaling for a fixed budget of optimization steps. It would be interesting to compare (and potentially combine) this approach with Chatterjee’s winsorization approach from ICML 2020, which also regularizes mini-batches, albeit in a very different way.
Saurabh Singh, and Shankar Krishnan
Google Research
Batch Normalization (BN) has become a common component in deep neural networks. It accelerates training, enables training deeper networks, and can have a regularizing effect. Because BN uses mini-batch statistics, its performance degrades when training with small mini-batches. Alternatives to BN, such as Batch Renormalization and Group Normalization (GN) have been introduced. However they either cannot match the performance of BN for large batches or have their own drawbacks. Singh and Krishnan propose the Filter Response Normalization (FRN) layer, a novel combination of a normalization and an activation function, that can be used as a replacement for other normalizations and activations. Because this new layer operates independently on each activation channel of each batch element, it eliminates the element of interdependence of BN. Experiments with different architectures and datasets show that FRN not only outperforms BN and GN on all batch sizes, but also that its performance is stable across batch sizes.
Computer Vision
Probabilistic Regression for Visual Tracking
Martin Danelljan, Luc Van Gool, and Radu Timofte
ETH Zürich
Visual tracking can be formulated as regressing the state of a target in each frame of a video, where the state is represented as a bounding box around the target. The majority of successful methods at this task take the approach of learning a confidence value, indicating the “confidence” that the target is at a given location in the image. Confidence-based regression has the advantage of representing uncertainties. However, the values themselves have no clear interpretation. In this work, Danelijan et al, propose a formulation for learning to predict the conditional probability density of a target state y given an image x. Unlike confidence values, the probability density allows the computation of absolute probabilities. Their experiments on seven benchmark datasets show that their proposed formulation improves tracker performance.
High-Frequency Component Helps Explain the Generalization of Convolutional Neural Networks
Haohan Wang, Xindi Wu, Zeyi Huang, and Eric P. Xing
Carnegie Mellon University
While Deep Learning has become a powerful tool for computer vision in the form of convolutional neural networks (CNN’s), their unintuitive capacity to generalize and vulnerability to adversarial examples are active research topics. Along with Ilyas and colleagues, the authors hypothesize that these phenomena arise from correlations between high-frequency components, that are perceived as noise by humans, and the “semantic” low-frequency components of the images. They show that when an image is separated into its low and high frequency components, a CNN can successfully predict the class of an object from its high-frequency components but fail to do so from its low frequency components. This in turn explains how an adversarial example is perceived to belong to one class to a human and to a completely different one to a CNN. With image frequency spectrum as a tool, Wang and colleagues offer hypotheses to explain several generalization behaviors of CNN’s and propose defense methods that can help improve their adversarial robustness.
Valentin Khrulkov, Leyla Mirvakhabova, Evgeniya Ustinova, Ivan Oseledets, and Victor Lempitsky
Skoltech, Russian Academy of Sciences, Samsung AI, and Yandex
Many computer vision tasks rely on learned high-dimensional embeddings. Embedding learning aims to define a high-dimensional space where images are placed close together when they are semantically similar, and far apart when they are not. A successfully trained embedding can enable the use of simple distance metrics to assess similarity between images. The operations at the end of the deep networks used to train the embedding imply a certain type of geometry of the embedding spaces, most of which are Euclidean or spherical. In this work, Khrulkov et al argue that hyperbolic spaces can be better for learning image embeddings. They add hyperbolic network layers, proposed by Ganea and colleagues, to the end of several computer vision networks, and present experiments on image classification, one-shot, and few-shot learning and person re-identification. The results show that hyperbolic geometry embeddings can deliver performance improvements.
Learning in the Frequency Domain
Kai Xu, Minghai Qin, Fei Sun, Yuhao Wang, Yen-Kuang Chen, and Fengbo Ren
DAMO Academy, Arizona State University
The majority of deep neural networks for computer vision operate in the spatial domain and with fixed input sizes. For practical applications, images are usually large and have to be downsampled to the size accepted by the network. Smaller sizes mean faster training and inference at the expense of information loss in the input and corresponding performance degradation. In this work, Xu and colleagues take a page out of the digital signal processing book and propose to reshape the high-resolution images in the frequency domain, leveraging the discrete cosine transform (DCT), rather than resizing them in the spatial domain. The reshaped DCT coefficients can be fed as input to traditional CNN architectures with little modification. Their experiments in image classification, object detection, and instance segmentation show that learning in the frequency domain outperforms its spatial counterpart
with an equal or smaller input data size. Additionally, their methodology includes a learning-based dynamic channel selection method that identifies trivial frequency compo- nents for static removal during inference. Experiments show that up to 87.5% of the frequency channels can be discarded using the proposed channel selection method with limited accuracy degradation.
End-to-End Learning of Visual Representations From Uncurated Instructional Videos
Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman
Inria, DeepMind, Czech Technical University, University of Oxford
Doing supervised learning on videos can be daunting because of the difficulty of procuring large volumes of labeled data. To circumvent this problem, Miech and colleagues show how to learn video representations without manual supervision using the recently released HowTo100M dataset of narrated videos. Given a short video and the corresponding textual narration, their goal is to learn a joint embedding space where similarity between the narration and video embedding is high when the text and visual content are semantically similar and low otherwise. The similarity score between an embedded video and an embedded narration is their dot product. The learning problem is made more challenging due to frequent misalignment between video and visual descriptions. To successfully learn the joint embedding despite the noise, they formulate a new loss, Multiple Instance Learning Contrastive Loss, whose objective is to maximizethe ratioof the sum of a set of positive candidate scores to the sum of the scores of allnegatives samples. Positive samples are pairings of a video and its adjacent text descriptions, while negative samples are pairings of a video and text description of other videos. Experiments show that representations learned with the proposed loss generate better results in downstream tasks when compared to representations learned with other self-supervised and fully supervised baselines.