Large-scale Image Captioning

nocaps: novel object captioning at scale

elvis
DAIR.AI
5 min readJan 2, 2019

--

Image captioning involves the task of generating natural language descriptions of visual content with the use of datasets comprising of image-caption pairs. The figure below visually demonstrates the task and the datasets used in this type of work along with some examples. For instance, the second image on the left shows a child sitting on a couch, which can also be inferred from the accompanying image-caption shown in the example.

On the left-hand side, we have image-caption examples obtained from COCO, which is a very popular object-captioning dataset. nocaps (shown on the right) is the benchmark dataset proposed in this paper and includes three different settings: in-domain (only COCO classes), near-domain (COCO and novel classes), and out-of-domain (only novel classes). These settings are explained later, for now, you only need to be concerned with the fact that the proposed dataset, nocaps, aims to complement current image captioning datasets, such as COCO, as opposed to completely replacing them.

The challenge with current image captioning models is that they generalize poorly to images in the wild. This happens because most models are trained to capture a tiny amount of visual concepts as compared to what a human may encounter in everyday life. Take the COCO dataset, for example, models trained on it can only describe images containing dogs and umbrellas, but not dolphins. In order to build more robust real-world applications, such as an assistant for people with impaired vision, the above limitations need to be addressed. Specifically, large-scale classes of objects need to be supported to generalize better on an image captioning task. The proposed work supports 500+ novel classes, a huge improvement compared to the 80 classes found in COCO. This paper aims to develop image captioning models that learn visual concepts from alternative data sources such as object detection datasets. One of those large-scale object detection datasets is Open Images V4.

The training dataset for the benchmark consists of a combination of COCO and Open Images V4 training sets. Keep in mind that no extra image-caption pairs are provided besides those found in COCO since the Open Images V4 training portion only consists of images annotated with bounding boxes. The validation and test set are comprised of images from the Open Images object detection dataset. Overall, the authors propose a benchmark with 10 reference captions per image and many more visual concepts as contained in COCO. In addition, 600 classes are incorporated via the object detection dataset, which is significantly larger than COCO which contains only 80 object classes. Each selected image was captioned by 11 AMT workers via caption collection interfaces as shown in the Figure below. Note that priming refers to the technique where workers are given a small guide (in this case labels) to help with annotating rare images.

In summary, as compared to COCO captions, the proposed benchmark, nocaps, have greater visual diversity, more object classes per image, and longer and more diverse captions (with large vocabulary). See paper for more information on how both the dataset and benchmark are prepared.

The benchmark system utilizes COCO paired image-caption data to learn to generate syntactically correct captions while leveraging Open Images object detection dataset to learn more visual concepts. In essence, the COCO dataset is the only image-caption information considered for training, while captions from nocaps validation set are used for validation and testing datasets.

One of the aims of the nocaps benchmark is to increase the difficulty of the image captioning task by increasing diversity of captions and images. However, the authors note that the performance, obtained with automatic evaluation metrics, is weaker than the human baseline. But the hope is to improve interpretation of results and obtain more insights.

The authors propose to investigate two popular methods for object captioning on their benchmark: Neural Baby Talk (NBT) and Up-Down, with and without constrained beam search (CBS). A Faster R-CNN model is trained on image feature representations extracted from both the Visual Genome and Open Image datasets. As a reminder, with COCO, it is very common to use object detection features trained on Visual Genome since the images are sourced from COCO. Specifically, VG features refer to the use of Visual Genome alone and VGOI refers to the combination of Visual Genome and Open Images datasets. (Learn more about the experimental setup from the paper).

The experimental results are reported in the table above. We can observe that the Up-Down model, with VG features alone (row 1), perform better than when using VGOI, perhaps indicating that classes in Open Images may be a lot more sparse, increasing the complexity of the task. Results from the Neural Baby Talk (NBT) model can also be observed to be lower than the Up-Down model. However, both methods are outperformed by the “Human” model, particularly for the nocaps validation dataset. You can find a more detailed discussion of the results in the paper.

Finally, below we can observe a few image examples from nocaps with the generated captions produced by each model type.

The in-domain model (trained only on COCO) fails to identify novel objects such as gun/rifle and insect/centipede due to the shortage of visual concepts as explained earlier. Near-domain means that both object classes from COCO and Open Images were used. Out-of-domain means that no COCO classes were used. For both near-domain and out-of-domain images, the captions are somewhat better but still need improvement. Overall, the performance of the benchmark models, which use the nocap dataset, improve marginally over a strong baseline but fall short when compared to the human baseline, which means there is still room for improvement in image-captioning tasks.

Reference

nocaps: novel object captioning at scale — Harsh Agrawal, Karan Desai, Xinlei Chen, Rishabh Jain, Dhruv Batra, Devi Parikh, Stefan Lee, Peter Anderson

--

--