Image Caption Generation with Visual Attention

Published in

The Startup

4 min readDec 20, 2020

Introduction

Captioning involves automatically generating natural language descriptions of objects present in the image and their relationships with the environment. It requires a semantic understanding of the image such that the generated statement is linguistically plausible and semantically truthful.

Dataset

The dataset used in this project was Flick8k. The reason is that it is realistic and relatively small so it is possible to download it and build models on your workstation using a CPU. You can find more on the dataset is in the paper “Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics” from 2013.

The authors describe the dataset as follows:

We introduce a new benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events.

Classic Image Captioning Model

This architecture was proposed by Oriol et al. in their CVPR, 2015 paper “Show and Tell”. It is an encoder-decoder architecture where a Convolutional Neural Network pre-trained on Image Net dataset is used as Encoder to produce a fixed vector representation of the image. Then using a LSTM as Decoder it generates each word of the caption one at a time.

Architecture used by Vinyals et al. in their paper Show and Tell

Loss Function

The model uses sum of the negative log likelihood of the correct word at each step as the loss function.

Issues with the Classical Model

This model uses the entire representation of the image to condition the
generation of each word and hence cannot focus on using different parts of the image for generating different words.

Attention for image captioning

Visual attention for image captioning was first introduced by Xu et al. (2015) in their paper “Show, attend and Tell”. The work takes inspiration from attention’s application in other sequence and image recognition problems.

The model looks at the “relevant” part of these images to generate the underlined words.

The model is focussing on the correct object at each step

Model Architecture

Same architecture as before but with attention layer added

The architecture is similar to that of the classical model but with a new layer of attention. The attention looks at the feature map generated by CNN and decides what is relevant for the LSTM decoder at each step.

The decoder utilizing context vector generated by the attention mechanism

At a particular timestep, the decoder GRU considers the hidden state from the previous time step, context vector generated by attention mechanism and previous step output. It then combines them to update the hidden state.

Beam Search

Beam search uses top n words while searching for best caption, unlike greedy search which only considers the best word. Beam search can increase the overall probability of the generated sentence and hence is widely used.

Evaluation Metrics

BLEU

Bilingual Evaluation Understudy Score ranging from 0 to 1. It is a precision-based metric that compares generated and reference sentences. Here n indicates n-grams(1,2,3,4).

GLEU

Gleu focuses on grammatical corrections.

ROUGE

It is used for evaluating a generated sentence against a reference sentence. It calculates precision, recall, f measure in different ways. In previous metrics, we were seeing n-grams. Here we take into consideration the longest matching sequence of words using LCS and we calculate precision, recall, f score out of it.

METEOR

METEOR explicitly matches between generated captions and ground truth captions on a word to word basis.

Results

Loss curves for different CNN models when used as Encoder

The best CNN model was InceptionV3. So it was used for further analysis.

The curve shows the model has fitted well and is performing well for unseen data as well.

Results with Classical Approach

Mod-1 refers to the model without attention and using greedy search to get the best caption.

Mod-2 refers to the model without attention mechanism and using beam search for the getting best caption.

Results with Attention-based approach

Attention-based models have performed very well. This is seen from the increased scores on all metrics. Also, beam search proved to be a better search strategy than greedy search.

References

Vinyals O., Toshev A., Bengio S., et al.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, MA, 07–12 June, pp. 3156–3164 (2015)
K. Xu, “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. Int. Conference Machine Learning, 2015.