Image Caption Generation with Visual Attention
Introduction
Captioning involves automatically generating natural language descriptions of objects present in the image and their relationships with the environment. It requires a semantic understanding of the image such that the generated statement is linguistically plausible and semantically truthful.
Dataset
The dataset used in this project was Flick8k. The reason is that it is realistic and relatively small so it is possible to download it and build models on your workstation using a CPU. You can find more on the dataset is in the paper “Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics” from 2013.
The authors describe the dataset as follows:
We introduce a new benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events.
Classic Image Captioning Model
This architecture was proposed by Oriol et al. in their CVPR, 2015 paper “Show and Tell”. It is an encoder-decoder architecture where a Convolutional Neural Network pre-trained on Image Net dataset is used as Encoder to produce a fixed vector representation of the image. Then using a LSTM as Decoder it generates each word of the caption one at a time.
Loss Function
The model uses sum of the negative log likelihood of the correct word at each step as the loss function.
Issues with the Classical Model
This model uses the entire representation of the image to condition the
generation of each word and hence cannot focus on using different parts of the image for generating different words.
Attention for image captioning
Visual attention for image captioning was first introduced by Xu et al. (2015) in their paper “Show, attend and Tell”. The work takes inspiration from attention’s application in other sequence and image recognition problems.
The model looks at the “relevant” part of these images to generate the underlined words.
Model Architecture
The architecture is similar to that of the classical model but with a new layer of attention. The attention looks at the feature map generated by CNN and decides what is relevant for the LSTM decoder at each step.
At a particular timestep, the decoder GRU considers the hidden state from the previous time step, context vector generated by attention mechanism and previous step output. It then combines them to update the hidden state.
Beam Search
Beam search uses top n words while searching for best caption, unlike greedy search which only considers the best word. Beam search can increase the overall probability of the generated sentence and hence is widely used.
Evaluation Metrics
BLEU
Bilingual Evaluation Understudy Score ranging from 0 to 1. It is a precision-based metric that compares generated and reference sentences. Here n indicates n-grams(1,2,3,4).
GLEU
Gleu focuses on grammatical corrections.
ROUGE
It is used for evaluating a generated sentence against a reference sentence. It calculates precision, recall, f measure in different ways. In previous metrics, we were seeing n-grams. Here we take into consideration the longest matching sequence of words using LCS and we calculate precision, recall, f score out of it.
METEOR
METEOR explicitly matches between generated captions and ground truth captions on a word to word basis.
Results
The best CNN model was InceptionV3. So it was used for further analysis.
The curve shows the model has fitted well and is performing well for unseen data as well.
Results with Classical Approach
Mod-1 refers to the model without attention and using greedy search to get the best caption.
Mod-2 refers to the model without attention mechanism and using beam search for the getting best caption.
Results with Attention-based approach
Attention-based models have performed very well. This is seen from the increased scores on all metrics. Also, beam search proved to be a better search strategy than greedy search.
References
- Vinyals O., Toshev A., Bengio S., et al.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, MA, 07–12 June, pp. 3156–3164 (2015)
- K. Xu, “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. Int. Conference Machine Learning, 2015.