Deep Vision & Language Integration

Moin Nabi (ML Research Berlin)

Moin Nabi
SAP AI Research
6 min readJun 29, 2018

--

We live in a multimodal world, therefore it comes at no surprise that the human brain is tailored for the integration of multi-sensory input. Specifically, our knowledge comes primarily from various types of inputs that we simultaneously receive and process in parallel. One of the old dreams of AI is to teach computers to see and understand visual concepts and, as a next step, enable them to express this understanding in a sentence. This dream was further nurtured with the emergence of deep learning leading to a growing interest within the AI community to combine information from vision and language. Recent works have focused on bridging visual and natural language understanding through grounding in perception.

As primary testbeds, several tasks such as Image Captioning, Visual Question Answering or Visual Dialog have been put forward.

  • Image Captioning(IC) is a domain of problems where the goal is to generate a caption for a given image such that it is both semantically and syntactically correct and properly describes the content of the image.
  • In Visual Question Answering(VQA), a system is given an image and a natural question about the content in the image and in turn has to produce a natural language answer (to the image-question pair). In this regard, answers can be provided in the form of multiple choice, e.g. the system is given 2 − 4 choices and has to determine which option is most likely to be the correct answer or in terms of filling in blanks, where the system would need to generate an appropriate word for a given blank position.
  • In Visual Dialog(VisDial), the system has to have a meaningful dialog about visual content with humans in conversational language. More precisely, given an image, a dialog history and a follow-up question about the image, the system has to answer questions about the content displayed.
Fig1. Vision and Language Tasks: i) Image Captioning: Given an image, the model has to generate a caption describing its content. ii) Visual Question Answering: Given an image and a question about the image content, the model is asked to answer the question. iii) Visual Dialogue: Given an image, a dialog history and a follow-up question, the system has to answer questions about the content.

After the introduction of the described tasks, the computer vision community keeps being impressed by the amount of ideas, data and models produced in this stimulating area. However, despite their success, it remains unclear whether the testbeds are truly reflecting the difficulty of the tasks. Particularly, it is not clear whether state-of-the-art vision and language models capture vision and language in a fully integrative fashion. In response, a number of recent works has been assessing the strengths and weaknesses of current models, datasets and evaluation metrics. They have performed detailed analyses on the skills required in some vision and language problems, allowing meaningful benchmarking of vision and language systems. To provide more details on these topic areas we will discuss recent relevant works below.

Challenges in Vision and Language

The last few years have seen an explosion of work in illustrating systemic deficiencies in current vision and language systems. We could identify the following types of arguments surrounding the performance of vision and language systems:

  • Dataset Bias: The current VQA models ‘cheat’ the result by primarily relying on superficial correlations in the training data. This over-reliance on language priors results in poor image grounding.
  • Lack of Compositionality: The current VQA models fail to to answer questions about unseen compositions of seen concepts. Such inability to handle novel setting is due to a lack of compositional reasoning skills in language and vision models.
  • Lack of Fine-grained Understanding: The current vision and language models grasp the gist of the image and text rather than a fine-grained representation of it meaning that they are not able to identify inconsistencies between language and fine-grained image details.
  • Attention Inconsistency: There are recent studies evaluating the attention maps generated by state-of-the-art vision and language models against human attention, showing that current attention models in vision and language models do not seem to be looking at the same regions as humans.
  • Lack of Reasoning: To reason and answer questions about visual data, a diagnostic test to analyze the progress and discover shortcomings is vital. However, the current VQA models conflate multiple sources of error, making it hard to pinpoint model weaknesses.This is due to the fact that the internal reasoning and model failures are not interpretable in the existing vision and language systems.
  • Evaluation Metrics: The evaluation of vision and language tasks is challenging due to the natural ambiguity of language and perception. As a result, many criteria have been proposed. Recent studies suggest that many of these metrics correlate poorly with human judgments or are easily exploitable by perceptually poor models.

Besides the wealth of literature on the shortcomings of existing vision and language systems, many novel tasks have been proposed to overcome or measure the shortfalls in vision and language. These novel tasks are emphasizing different skills essential for a truly integrated vision and language system as well as analyzing the existing tasks, models, datasets and metrics by evaluating through downstream tasks. For example, FOIL-COCO associates images with both correct and “foil” captions, that is, descriptions of the image that are highly similar to the original ones, but contain one single mistake (“foil word”). State-of-the-art vision and language models fall into the trap of this dataset and perform badly on classifying correct vs. foil captions, detecting the foil word in a sentence and providing its correction. Humans, in contrast, have near-perfect performance on such tasks.

Fig 2. Novel Vision and Language Tasks: i) Binary Classification: Given an image and a caption, the model is asked to mark whether the caption is correct or wrong. ii) Foil Word Detection: Given an image and a foil caption, the model has to detect the foil word. iii) Foil Word Correction: Given an image, a foil caption and the foil word, the model has to detect the foil and provide its correction. (Image taken from the original paper)

Workshop on Shortcomings in Vision and Language (SiVL)

Bearing in mind these research areas, it becomes clear that action is needed to overcome the described challenges. Therefore, we have joined forces with researchers from University of Amsterdam, University of Edinburgh, University of Trento, Georgia Institute of Technology, Rochester Institute of Technology and Facebook AI Research, organizing a workshop to provide a venue to discuss the shortcomings in this line of research.

In order to present new works, exchange ideas, and build connections, we are organizing a workshop at European Conference on Computer Vision (ECCV) 2018 on “Shortcomings in Vision and Language (SiVL)”. This first-of-a-kind workshop brings together experts at the intersection of vision and language to discuss modern approaches, tasks, datasets, and evaluation metrics for significant problems in image and video captioning, visual question answering and visual dialog. By highlighting common areas of concern in these problems, the aim is to facilitate the discussion of novel research directions and to redirect the focus of the community towards the high-level challenges affecting them broadly.

The workshop’s lineup of invited speakers consists of experts across diverse disciplines of vision and language. We call for papers that emphasize an analysis of the strengths and weaknesses of current models, datasets, and evaluation metrics as well as novel tasks to overcome or measure current shortfalls. In addition to invited talks, we also invite the submission of relevant papers to be presented as posters at the workshop — with recognition of these submissions through spotlight talk sessions. Lastly, we will also be hosting the organizers of the Visual Dialog Challenge 2018 who will announce the results of their challenge.

The papers and abstracts exploring shortcomings in current vision and language models can be submitted using the CMT website and accepted papers will be published in ECCV 2018 proceedings.

--

--

Moin Nabi
SAP AI Research

Senior Research Scientist at SAP Machine Learning Research