Neuromation researchers attended ACL 2019, the Annual Meeting of the Association for Computing Linguistics, which is the world’s leading conference in the field of natural language processing. With this second part of our ACL in Review series (see the first part here), I continue the experiment of writing draft notes as the ACL sections progress.
This time, it’s the Monday evening session called “Vision, Robotics, Multimodal, Grounding and Speech”; this means that in this section, we get some nice pictures along with the text. Again, I will provide ACL Anthology links for the papers, and all images in this post are taken from the corresponding papers unless specified otherwise. The paper I want to discuss in detail was not the first in its section, but I still decided to keep the order from the conference to make it as authentic as possible.
On to the papers!
Visually Grounded Neural Syntax Acquisition
How do we understand syntax? When we were children, we had to derive it from data, from the language stream directed at us. But what really helped was that language was often paired with imagery: by hearing sentences like “A cat is sleeping outside”, “A cat is staring at you”, or “There’s a cat playing with the ball” and matching them with what we saw, we could extract the notion of “a cat”.
Haoyue Shi et al. (ACL Anthology) ask how to implement this technique for deep learning models: can we generate a linguistically plausible structure for the text given a set of parallel image-text data (say, the MS COCO dataset)? They use the notion of “concreteness”: concrete spans in the parse tree such as “a cat” are more likely to correspond to objects on a picture. This notion can be captured by a part of the network that estimates concreteness based on the interrelation between captions and images. The entire network learns a joint embedding space that unites images and constituents in the same vector space with a hinge-based triplet loss; abstractness and concreteness are defined in the same embedding space. In general, the structure looks like this:
With this approach, the authors get a model that jointly learns parse trees and visually grounded textual representations. They show state of the art parsing results with much less data than needed for state of the art text-only models.
Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation
This work by Google researchers Vihan Jain et al. (ACL Anthology) deals with a rapidly growing field of vision-and-language navigation (VLN): how can we give instructions to agents in natural language and have the agents plan their actions, navigate and respond to the changes in their visual field? A characteristic example here would be the Room-to-Room (R2R) dataset that contains images of real indoor environments, where the agent is asked to follow instructions such as, e.g., “Make a left down at the narrow hall beside the office and walk straight to the exit door. Go out the door and wait.” On a map this might look something like this:
The authors move from R2R to R4R, where instructions are more detailed, the paths are longer, and the agent is supposed to follow specific navigation instructions rather than just find the shortest path from point to point. For instance, the agent that found the red path in the right part of the image above would be penalized if the actual instruction was to go along the blue path; the agent using the orange path in the image does a better job even though it fails to actually reach the target.
The models, all based on the reinforced cross-modal matching (RCM) model by Wang et al., now also come in two variations with different rewards: goal-oriented agents just want to reach the goal, while fidelity-oriented agents have an objective function that rewards following the reference path. It is no wonder that the latter does a better job with R4R. Generally, the work argues that path fidelity is a better objective if our goal is to better understand natural language instructions — the problem being to understand the instruction in its entirety rather than just extract the endpoint.
Expressing Visual Relationships via Language
And here comes our highlight for the section on visually grounded NLP. In this work, Adobe researchers Hao Tan et al. (ACL Anthology) move from the already classic problem of image captioning, i.e., describing with natural language what an image contains, to image editing, i.e., describing what we want to do with the image. We want a model that can get a request such as “Add a sword and a cloak to the squirrel” and do something like this:
So the first question is how you collect a dataset of this kind. The supervised dataset should consist of triples: the original image, an editing request, and the modified image. First, the authors crawled a collaborative image editing community called Zhopped (note: I will be hugely surprised if there are no Russian speakers behind this website) and Reddit; specifically, there is a reddit called r/PhotoshopRequest where you can ask people to help you with image editing. This yielded the pairs of original and edited images. Although Reddit and Zhopped both contain the original editing requests from users, these are usually very noisy and often conversational, so the authors opted to re-do all the requests manually through crowdsourcing.
This procedure yielded the image editing dataset. The authors also used the Spot-the-Diff dataset from (Jhamtani, Berg-Kirkpatrick, 2018) that focuses on finding changes between two images. The problem is now to generate text from a pair of images, like this:
The third dataset with image-image-text triples is the NLVR2 dataset (Suhr et al., 2018) that emphasizes the relationship between the two images. Given two images and a statement, you are supposed to classify whether the statement is true or false; for the purposes of this paper, the authors simply used the correct statements and converted this into a captioning problem for a pair of images:
Now that we have the data, what about the models? To be clear, let’s concentrate on the task of generating a sentence that describes the relationship between a pair of images. There are four different models used in the paper, with a natural succession between them. Let’s look at the flowchart and then discuss it:
This is quite a lot to parse, but actually this is a careful build-up of well-known ideas in the field. The first model (a) is an adaptation of the encoder-decoder model with attention, very similar to the ones used by Xu et al. and Jhamtani and Berg-Kirkpatrick. It constructs features from input images, concatenates them, and then uses this as context to predict the next word with a recurrent architecture.
The basic model, however, does not even differentiate between the two input images. To fix this, model (b) moves on to multi-head attention, an idea made very popular in NLP by Transformer and its follow-up models. In model (b), attention is applied sequentially, so that when the model is attending to the target image it can have context from the source image already available, and it can immediately know where to look for differences.
Models © and (d) introduce the concept of relational attention. This means that they can compute the relational scores between the source and target images (and vice versa, as you can see, there are two attention modules there). In the static model ©, the scores are then compressed into two feature sequences, possibly losing some information along the way, while the dynamic model (d) does it dynamically during decoding and has access to the full scores.
Naturally, this progression means that quality metrics improve as we move from model (a) to model (d). Here are some sample results from the paper, both positive and negative:
As you can see, sometimes state-of-the-art models are actually pretty good at understanding what is going on with the images, but sometimes they are lost and definitely don’t understand what they’re talking about.
Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video
Traditional video grounding is the problem of localizing a spatial region in certain video frames that corresponds to a specific part of a natural language query (say, find the part of the “spatio-temporal tube”, i.e., the video tensor, that corresponds to a coffee cup). This, however, requires dense fine-grained regional annotations in the video, which are very hard to obtain, so this work by Zhenfang Chen et al. (ACL Anthology) considers weakly supervised video grounding. Moreover, they move to the general problem of localizing a spatio-temporal tube that corresponds to a given sentence as a whole, not to a specific noun. They call this problem weakly-supervised spatio-temporally grounding sentence in video (WSSTG); like this:
To solve the problem, the authors use a pipeline with a standard Faster R-CNN object detector to generate bounding box proposals (Instance Generator below), an “Attentive Interactor” module that unites RNN-produced representations for the text and the proposed regions, and then the whole thing is trained with a ranking loss and a diversity loss with multiple instance learning. Here is the whole pipeline:
The authors also collect and present a new dataset designed for this problem, with videos where some target regions are annotated with sentences. There is nothing too sensational in this approach, but, as often happens with modern deep learning models, it is a big surprise that it actually does work! The resulting model can correctly understand quite complicated queries. Here is a sample comparison from the paper:
The PhotoBook Dataset: Building Common Ground through Visually-Grounded Dialogue
And the last paper of the day was by Janosch Haber et al. (ACL Anthology). This dataset concentrates not on captioning, but on visually-grounded dialogue, i.e. conversations . The authors argue that one important reason why dialogue is hard is because the participants rely on their shared knowledge and shared linguistic experience that they build during the conversation, and current dialogue models still cannot really capture this “common ground”. So their proposal is to create a dataset where the grounding is not only conversational but also visual. This is implemented as a game: two participants see six photos each, and they need to find out in dialogue which of the three highlighted photos they have in common (some of them are the same and some are different). They do it in natural dialogue, and the visual domain is controlled too, so the images are sufficiently similar to require an in-depth description from the participants. Here is an example:
Moreover, the game consists of five rounds (hence the “Page 1 of 5” in top left), and in subsequent rounds some images can reappear again, which motivates the participants to make mutually referring descriptions of images. Therefore, this large-scale crowdsourced data collection allows the authors to not only have a good dataset for training but also make some conclusions on how people talk about images. Especially interesting conclusions deal with how the game changes over the five rounds: as the game progresses, utterances become much shorter, the fraction of nouns and content words increases significantly, but these words also begin to repeat themselves a lot, so there are fewer new nouns introduced in later rounds. This is exactly the “common ground” that is hard to capture for the conversational model.
Then the authors present two baseline models for visual grounding, one that has no dialogue history and one that receives (in processed form) the references from previous rounds of the conversation. Naturally, the latter model is more successful on later stages of the game; in the example below, both models do just fine in the left example but only the history-based model can manage the example on the right (and no wonder!):
But both models are still far from perfect, and, of course, the authors hope that this dataset will serve as a common ground (pardon the pun) for further research in the field.
With this, we finish the “Vision, Robotics, Multimodal, Grounding and Speech” section. We are often bombarded by stories about sensational AI achievements from the media. Usually the journalists are not trying to lie to us, but it’s often hard to say whether they are showing a best-case cherry-picked example or a solution ready to go to production. Thus, it was very enlightening to see what exactly the state of the art really is in these things. For most models that we’ve seen today, my conclusion is: sometimes they work, and you can find really cool examples if you look, but very often they still get lost. On the other hand, a lot of this research sounds very promising for real world applications as well. We should stay tuned to this research, but it’s clear that true deep understanding of images from the real world and a genuine ability to put images into words or vice versa are still quite far away.
Chief Research Officer, Neuromation