Pushing the Frontiers of Computer Vision – Insights from CVPR 2018

Published in

SAP AI Research

7 min readJul 17, 2018

The 2018 conference on Computer Vision and Pattern Recognition (CVPR) took place between June 18–22 in Salt Lake City, Utah. As the premier and highly competitive conference in the realm of computer vision, CVPR provides a platform for a diverse group of academics, researchers, technologists, industrial giants and high-tech start-ups to showcase the field’s latest innovations.

CVPR this year has shown significant growth; making it the largest CVPR conference with more than 6,000 attendees. Known for its diligent and high-quality review process, CVPR received 3309 conference paper submissions this year, out of which only 979 papers were accepted. Additionally, the conference hosted 21 tutorials, 48 workshops, the annual doctoral consortium, along with an industrial exhibition that featured around 150 companies.

The conference has incited numerous stimulating discussions and showcased a wide range of novel papers and presentations. Machine learning, in particular, was at the forefront of CVPR this year, scoring 24% of total research with a total of 233 papers submitted on the topic. Research on object recognition and scene understanding has also dominated this year’s conference with 202 papers alone.

As one of CVPR’s official sponsors, the SAP Leonardo Machine Learning Research team contributed to the discussion with our recent research project focusing on multimodality as an effective approach to address the shortcomings of deep learning models combining visual and natural language.

Our paper “Cross-modal Hallucination for Few-shot Fine-grained Recognition” was a part of the workshop on Fine-Grained Visual Categorization. The paper proposes a multimodal approach that addresses the lack of sufficient data for model training. Our multimodal benchmark approach employs a two-phase training process with images and text descriptions to better train the model to understand and identify visual classifiers. Moreover, our research partners from the University of Pittsburgh presented their work: “Deep Ordinal Regression Network for Monocular Depth Estimation” and An Efficient and Provable Approach for Mixture Proportion Estimation Using Linear Independence Assumption.

We’ve put together a summary of the conference’s main trends and highlights, along with our own selection of what we deem must-read papers.

Check out our full conference report, including more details and paper highlights.

A Glance at the Main Trends and Highlights

- Multimodality: Bridging the Gap between Visual and Natural Language

Multimodality was one of the most noticeable trends at this year’s CVPR, particularly in vision and language models, such as Visual Question Answering (VQA) and Visual Dialog (VisDial) systems. Visual and natural language models are still undergoing several testbeds and various shortcomings have been highlighted. One such shortcoming is the lack of an integrative multi-modal approach that would allow for the improvement of interpretability and perception ensuring that the systems learn to generalize.

Visual Question Answering (VQA): Through this task, a system is given an image and a natural question about the content of the image and asked to produce a natural language answer (to the image-question pair). Answers can be provided in the form of multiple choice, e.g. the system is given 2 − 4 choices and has to determine which option is most likely to be the correct answer or in terms of filling in blanks, where the system would need to generate an appropriate word for a given blank position.
Visual Dialog (VisDial): The system engages in a meaningful dialog about visual content with humans in conversational language. More precisely, given an image, a dialog history and a follow-up question about the image, the system has to answer questions about the content displayed.

Take a look at our recent blog post and planned ECCV workshop for more details on this topic.

- Synthetic Data, Self-Supervision and the Future of Al in the Medical Field

Another topic that is gaining more attention is the usage of sophisticated synthetic data from environments that mimic the real world with high fidelity, in conjunction with domain adaptation with real data; rendering big data curation unnecessary. Similarly, the topic of self-supervision is gaining momentum. In self-supervised learning, the training labels are directly determined from the input data, thus no manual data annotation is required. One example is solving puzzles, e.g. an image is cut into pieces that are shuffled and the neural network has to learn which parts belong together. Another example is using unlimited amount of color video data, in which the data may be converted to grayscale and assigning the machine the task of recoloring the images.
Computer vision papers concerned with the medical field still constitute a small niche. However, the number of related papers are increasing as the topic continues to gain traction. The topics covered in this area of research are multi-modality of patients’ images and text reports as well as segmentation.

- “Good Citizen of CVPR”: Skills & Ethics of the CVPR Community

A great initiative this year was the panel ‘Good Citizen of CVPR’, which focused on establishing a CVPR community culture and code of ethics. The panel included a variety of sessions on research, writing and presentation skills, as well as topics such as representation, inclusiveness and building up a community based on mentorship and leadership.

Our Selection of Interesting Papers

Taskonomy: Disentangling Task Transfer Learning (Best Paper Award)

Amir Zamir, Alexander Sax, William Shen, Leonidas Guibas, Jitendra Malik, Silvio Savarese (2018)

Using a fully sophisticated computational approach, the paper proposes “a computational taxonomic map” that interconnects and correlates relationships and transfer learning dependencies between different tasks to facilitate task transfer learning in a more efficient manner. Identifying redundancies across tasks can be exploited for new tasks by simply re-using existing networks in conjunction with feature transfer functions. As a result, the amount of required labeled data can be dramatically reduced as only a couple of iterations of fine-tuning might be necessary in order to obtain a high level of accuracy.

Empirical Study of the Topology and Geometry of Deep Networks

Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, Pascal Frossard, Stefano Soatto (2018)

This paper studies the topology of classification regions created by deep networks, as well as their associated decision boundary. This is of particular interest since, compared to other central features of deep networks such as generalization, there has been little emphasis on this research area. The authors presented that state-of-the-art deep nets have the ability to learn connected classification regions. Furthermore, it is intriguing to learn that the decision boundary in the vicinity of natural data points is flat along most directions; whereas some curved directions are shared across data points. With respect to adversarial perturbations, these shared directions are where the deep networks are most vulnerable. Additionally, curvature asymmetry for real data points is employed for detecting adversarial perturbed samples from original samples. Finally, this purely geometrical approach is a unique way of enhancing deep neural network image classifiers’ resistance against perturbations.

Learning by Asking Questions

Ishan Misra, Ross Girshick, Rob Fergus, Martial Hebert, Abhinav Gupta, Laurens van der Maaten (2017)

Standard VQA models passively rely on large static datasets; unlike the interactive nature of human learning that is more sample efficient and less redundant. The paper fills this research gap by introducing a more interactive VQA model called “learning-by-asking” (LBA) that mimics natural learning. In this model, an agent has the potential to learn more quickly and efficiently by asking questions to an oracle about a given image. LBA questions are not observed during training time, the agent; however, must learn to “self-evaluate” his knowledge and ask “good” and “relevant” questions. As the number of oracle requests is constrained by a budget, the learner must ask questions that maximize the learning signal from each image-question pair sent to the oracle.

Guide Me: Interacting with Deep Networks

Christian Rupprecht, Iro Laina, Nassir Navab, Gregory D. Hager, Federico Tombari (2018)

The paper proposes an original approach to enhance the performance of a pre-trained convolutional neural network (CNN) through employing an additional function to the network, called “spatio-semantic guide”. This guide enables an interactive dialogue between a human user and the CNN and translates the human’s feedback into actual changes in the network’s activations. By facilitating simultaneous user feedback, the network can adjust its deductions on the spot without the need for additional training of the network’s parameters. Therefore, it shares some similarity with reasoning under partial evidence. An intuitive way of interaction is via text queries, sent by the human to the network, which aims to improve some initial estimation on a specific task to guide the class labels to a certain direction. The novelty of this approach is the ability to constantly enhance the performance of a trained CNN in a timely and less costly manner.

Deep Image Prior

Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky (2018)

The paper proposes a decoder network as prior for imaging tasks. Interestingly enough, the authors show that a generator network is adequate to capture a large amount of low-level image statistics prior to any learning. Specifically, in this approach, the neural network is interpreted as a parametrization of the image. It is shown that fitting the weights to one visually degraded image alone is enough to obtain a rich enough network (image representation) that can serve as a generic tool for tasks such as denoise, image restoration etc. by using the learned prior as regularizer indicator function, i.e. indicator function of images that can be produced from a random noise vector by a deep convolutional net of a certain architecture. The authors also use the approach to investigate the information content retained at different levels of the network by producing so-called natural pre-images, i.e. images that map to the same latent representation. Intriguingly, using the deep image prior as a regularizer, the pre-image obtained from even very deep layers still captures a large amount of information.

FBridging the Gap between Theory and Application

In addition to the wide range of academic and technical research presented at the conference, the industrial exhibition this year has also witnessed substantial growth. Alongside the research, several companies showcased their newest industrial innovations; from self-driving cars and robotics to a plethora of other solutions employing machine learning, 3D vision, virtual reality, video analytics, and more. With its continuous growth and success in bridging the gap between theory and application, CVPR continues to push the frontiers of computer vision.