Localized Narratives: The Latest and Greatest in Image Captioning By Google

Joyce Varghese
The Startup
Published in
6 min readSep 21, 2020
Photo by Tim Johnson on Unsplash

A couple of months back the research team at Google announced a brand new way to develop datasets for learning tasks using vision, tracing, and speech called Localized narratives. As you read on, you’ll see how this new annotation protocol works and how it opens up new research opportunities in the world of machine learning and AI.

Localized narratives is a new protocol for dataset generation developed by Google. It aims to provide highly accurate and rich datasets that can cater to over 15 use-cases once generated.

How the dataset is made?

The human-annotators are asked to talk about the image while hovering their mouse over it. They had to do it in such a way that the mouse hovers over the region of the image as they are talking about that region.

Localized narrative example - Image by Jordi Pont-Tuset

For example, here the voice and the transcription describes the dried grass when the annotator is talking about the grass. Then he moves on to the woman and describes her clothing and the objects she holds in her hands. And lastly, the sky and the hair of the woman is described. The mapping can be seen from how the color of the mouse trace corresponds with the captions as each region is being described.

Once the annotator describes the image as a voice clip, he is asked to transcribe it word by word. While this part may seem redundant at first, we have to appreciate the research efforts put in so that any error in transcription is avoided. This step can be automated once the Automatic Text Generators get better. Now, this transcription is accurate but without any mapping to the mouse traces. To address this, the researchers performed a sequence-to-sequence alignment between automatic and manual transcriptions, which leads to accurate and temporally synchronized captions. Now the resulting dataset has for each image, the audio describing it, the transcription of this audio, and the mouse traces all in sync.

What makes it so special?

Localized narratives use a combination of text, speech, and mouse traces to describe an image. What makes this form of annotation special is that there is a region mapped to the picture for every word spoken by the narrator. Unlike many other types of captioning methods, nouns are not the only focus of the captions. Verbs, prepositions, etc. are given equal importance and have a region within the image associated with these words. The whole process is based on how people usually describe things: by pointing and explaining. And as such, this comes easily to the annotators. While the transcription adds some length to the process, the resulting data is rich in explanation and error-free. So the overall time taken to the data collected ratio is very favorable.

Photo by Karina Carvalho on Unsplash

Where is it useful?

While the primary use case may be image captioning, localized narratives provide services to a wide variety of tasks. Localized narratives provide four synchronized modularities: the image, the text, the recording, and the grounding or mouse trace. This opens the way to a huge number of use cases for the dataset, combining these four modularities in different orders. For example, with the image as input and test as output, this dataset is ideal for an image captioning or paragraph generation. If you use them the other way around, with the text as input and image as output, it can be used for Text to Image generator. Like this, the researchers have discovered 15 different use cases which are very popular to put this massive data to use.

Tasks where localized narratives can be used

The results so far

The researchers performed some of the aforementioned tasks to see how the new dataset performed.

Controlled Image Captioning

Given both an image and a mouse trace, the goal is to produce an image caption that matches the mouse trace, that is it describes the image regions covered by the trace, and in the order of the trace. The grounding and image are given as inputs and text caption is the expected output.

Standard image captioning versus Controlled image captioning using localized narratives - Image by Jordi Pont-Tuset

Anyone reading the captions can see how the captioning has improved when mouse traces are provided. Some key observations are

  1. Since the mouse trace focuses on a smaller region inside the image, the captions have much more detail. Many more features are described in the last two captions compared to the first. Conditioning on the mouse trace has helped to cover the image more completely.
  2. The mouse trace provides a richer caption in the order that the user intends. From the second and third images above, we can see that different captions have been generated for different traces of the image. Mouse traces aid in producing a much more specific caption and gives the user a voice on what the generated caption will look like.

While these resulting captions need not be necessarily better, It is sure to be more complete and in line with how the user wants the image described.

Image generation

Incremental image generation - Image by Jordi Pont-Tuset

Image generation uses a segmentation map, which is a region specified data file that indicates where each object is supposed to go, to generate an image. The researchers demonstrated how localized narratives helps with this task using a state-of-the-art pre-trained model. They showed how it can make the interface for the task much more user-friendly. An incremental generation can be developed as shown above.

Here the user specifies only a boat first, followed by water, a person, an umbrella, and a mountain. Notice how water reflects the mountain and the boat and how the boat opens up on the addition of the person.

Image generation tasks are limited to nouns for now. With Localized narratives, grounding is provided for verbs and adjectives. This opens up new possibilities in research and can go a long way with helping in image generation in the future.

These are simply two use-cases where localized narratives helped with popular machine learning tasks. As mentioned earlier, much better models for many different tasks will hopefully be developed using localized narratives.

How can I use localized narratives in my project?

Google has already annotated 849k images with localized narratives. Localized narratives for popular image datasets like COCO, Flickr30k, ADE20k, and a part of the Open Images dataset have already been made available. If your project uses any of these datasets, you can find the localized narratives for them here.

Project idea 😋

Google has annotated all these datasets and made them available for public use. However, if your data does not fall in the image categories of these popular datasets, as of now, there is no way to generate localized narratives datasets for task-specific use. This presents an opportunity to create an annotation application that can help an annotator generate such datasets. I’m hoping someone reading this goes ahead and builds one. If you do let me know and be sure to make it opensource.

If you are interested in reading more about localized narratives, the research paper is available here. A video by the author describing how it works is available here. All images uploaded in the post were sourced from the research paper put out and all rights belong to the respective owners.

Peace✌.️

--

--