Multi-Modal Methods: Image Captioning (From Translation to Attention)

Recent Intersections Between Computer Vision and Natural Language Processing (Part Two)

This is the second instalment of our latest publication series looking at some of the intersections between Computer Vision (CV) and Natural Language Processing (NLP). Readers are encouraged to view the piece through our website for the best experience:

Part One: Visual Speech Recognition (Lip Reading)

Part Two: Image Captioning (From Translation to Attention)

Part Three: Image Captioning (Reinforcement Learning and Beyond)

Feedback and comments are welcomed, either through medium or directly to

If you enjoy our work, then please feel free to follow, share and clap for our team. Thanks for reading!

The M Tank

Part Two: Image Captioning (from Translation to Attention)

Figure 11: Some image captioning examples

Note from source: Visualisation of generated captions and image attention maps on the COCO dataset. Different colours show a correspondence between attended regions and underlined words, i.e. where the network focuses its attention in generating the prediction. 
Note: Cropped from the full image which provides more failure examples. For instance, the birds in the bottom left are miscounted as two instead of three.
Source: Lu et al. (2017)[51]

Introduction to Image Captioning

Suppose that we asked you to caption an image; that is to describe the image using a sentence. This, when done by computers, is the goal of image captioning research. To train a network to accurately describe an input image by outputting a natural language sentence.

It goes without saying that the task of describing any image sits on a continuum of difficulty. Some images, such as a picture of a dog, an empty beach, or a bowl of fruit, may be on the easier end of the spectrum. While describing images of complex scenes which require specific contextual understanding — and to do this well, not just passably — proves to be a much greater captioning challenge. Providing contextual information to networks has long been both a sticking point, and a clear goal for researchers to strive for.

Image captioning is interesting to us because it concerns what we understand about perception with respect to machines. The problem setting requires both an understanding of what features (or pixel context) represent which objects, and the creation of a semantic construction “grounded” to those objects.

When we speak of grounding we refer to our ability to abstract away from specifics, and instead understand what that object/scene represents on a common level. For example, we may speak to you about a dog, but all of us picture a different dog in our minds, and yet we can ground our conversation in what is common to a dog and progress forward. Establishing this grounding for machines is known as the language grounding problem.
These ideas also move in step with the explainability of results. If language grounding is achieved, then the network tells me how a decision was reached. In image captioning a network is not only required to classify objects, but instead to describe objects (including people and things) and their relations in a given image. Hence, as we shall see, attention mechanisms and reinforcement learning are at the forefront of the latest advances — and their success may one day reduce some of the decision-process opacity that harms other areas of artificial intelligence research.

We thought that the reader may benefit from a description of image captioning applications, of which there are several. Largely, image captioning may benefit the area of retrieval, by allowing us to sort and request pictorial or image-based content in new ways. There are also likely plenty of opportunities to improve quality of life for the visually-impaired with annotations, real-time or otherwise. However, we’re of the opinion that image captioning is far more than the sum of its immediate applications.

Mapping the space between images and language, in our estimation, may resonate with some deeper vein of progress. Which, once unearthed, could potentially lead to contextually-sophisticated machines. And, as we’ve noted before, providing contextual knowledge to machines may likely be the one of the key pillars that eventually support AI’s ability to understand and reason about the world like humans do.

Image captioning in a nutshell: To build networks capable of perceiving contextual subtleties in images, to relate observations to both the scene and the real world, and to output succinct and accurate image descriptions; all tasks that we as people can do almost effortlessly.

Image captioning (circa 2014)

Image captioning research has been around for a number of years, but the efficacy of techniques was limited, and they generally weren’t robust enough to handle the real world. Largely due to the limits of heuristics or approximations for word-object relationships[52][53][54]. However, in 2014 a number of high-profile AI labs began to release new approaches leveraging deep learning to improve performance.

The first paper, to the best of our knowledge, to apply neural networks to the image captioning problem was Kiros et al. (2014a)[55], who proposed a multi-layer perceptron (MLP) that uses a group of word representation vectors biased by features from the image, meaning the image itself conditioned the linguistic output. The timeline of this, and other advancements from research labs was so condensed, that looking back it seems like a veritable explosion of interest. These new approaches generally;

Feed the image into a Convolutional Neural Network (CNN) for encoding, and run this encoding into a decoder Recurrent Neural Network (RNN) to generate an output sentence. The network backpropagates based on the error of the output sentence compared with the ground truth sentence calculated by a loss function like cross entropy/maximum likelihood. Finally, one can use a sentence similarity evaluation metric to evaluate the algorithm.

One such evaluation metric is the Bilingual Evaluation Understudy algorithm, or BLEU score. The BLEU score comes from work in machine translation, which is where image captioning takes much of its inspiration; as well as from image ranking/retrieval and action recognition. Understanding the basic BLEU score is quite intuitive.

A set of high-quality human translations are obtained for a given piece of text, and the machine’s translation is compared against these human baselines, section by section at an n-gram level[56]. Typically an output score of ‘1’ matches perfectly with the human translations, and a ‘0’ means that the output sentence is completely unrelated to the ground truth. The most representative within Machine Translation and Image Captioning include: BLEU 1-4 (n-gram with n=1-4), CIDEr [57], ROUGE_L [58], METEOR[59]. These approaches are quite similar in that they measure syntactic similarities between two pieces of text, while each evaluation metric is designed to be correlated to some extent with human judgement.

In image captioning however, translations are replaced with image descriptions or captions. But BLEU scores are still calculated as output against human-annotated reference captions. Hence, network-generated captions are compared against a basket of human-written captions to evaluate performance.

In the past we’ve noted the huge effect of new datasets on research fields in AI. The arrival of the Common Objects in Context (COCO)[60] dataset in 2014 ushered in one such shift in image captioning. COCO enabled data-intensive deep neural networks to learn the mapping from images to sentences. And, given a comparatively large dataset of images with multiple human-label descriptions of said images, coupled with new, clever architectures capable of handling image input and language output; it now became possible to train deep neural networks for end-to-end image captioning via techniques like backpropagation.

Translation of images to descriptions

In machine translation it is quite common to use ‘sequence-to-sequence’ models[61]. These models work by generating a representation through a RNN, based on an input sequence, and then feeding that output representation to a second RNN which generates another sequence. This mechanism has been particularly effective with chatbots, enabling them to process the representation of the input query and generate a coherent answer related to the input sequence (sentence).

Figure 12: Sequence-to-sequence model

Note: This is an example of a sequence-to-sequence model used to generate automatic replies to messages. The same is true for seq2seq models used for machine translation e.g. the right hand side could possibly output the translation (in another language like Japanese) of the input words (left hand side). The encoder LSTM (in green) is “unrolled” through time, i.e. every time step is represented as a different block that generates an encoded representation (sometimes controversially known as a “thought vector”) as an output. After processing each word in the input sentence, the final outputted encoding/hidden state can be used to set the initial parameters of the decoder LSTM (in blue). The illustration shows how a word is generated at every time step. The complete set of generated words is the output sequence (or sentence) of the network. 
Source: Britz (2016)[62]

CNNs can encode abstract features from images. These can then be used for classification, object detection, segmentation, and a litany of other tasks[63]. Returning to the notion of contemporaneous successes in 2014, Vinyals et al. (2014)[64], successfully used a sequence-to-sequence model in which the typical encoder LSTM[65] was replaced by a CNN. In their paper titled, “Show and Tell: A Neural Image Caption Generator”, the CNN takes an input image and generates the feature representation which is then fed to the decoder LSTM for generating the output sentence (see fig. 13).

Figure 13: CNN encoder to LSTM decoder

Note: The image is encoded into a context vector by a CNN which can then be passed to a RNN decoder. Different neural blocks of computation are combined in new ways to handle different tasks. 
Source: The M Tank
A few more specifics on how the sentence is generated. At every step of the RNN, the probability distribution of the next word is output using a softmax. Depending on the situation, a slightly naive approach would be to take the word with the highest probability at each step after extracting the output from the RNN. However, beam search is another method which represents a better approach for sentence construction. By searching through specific combinations of words, and creating different possible outputs, beam search constructs a whole sentence without relying too heavily on any individual word from the ones which the RNN may generate at any specific time step. Beam search, therefore, can rank a lot of different sentences according to their collective, or holistic, probability.

Figure 14: Beam search example

Source: Geeky is Awesome (2016)[66]

For example, at the first word prediction output step, a higher probability sentence might be outputted overall by choosing the word with a lower probability than the word with the highest. A deeper explanation of beam search for sentence generation, i.e. related to the decoder portion of our example above, may be found here[67].

Further contemporaneous work

Around the time Show and Tell came around, a similar, but distinct, approach was presented by Donahue et al. (2014): Long-term Recurrent Convolutional Networks for Visual Recognition and Description[68]. Instead of just using an LSTM for encoding a vector, as is typically done in sequence-to-sequence models, the feature representation is outputted by a CNN, in this case VGGNet[69], and presented to the decoder LSTM. This work was also successfully applied to video captioning, a natural extension of image captioning.

The main contribution of this work was not only this new connection setting between the CNN encoder and LSTM decoder[70], but an extensive set of experiments which stacked LSTMs to try different connection patterns. The team also assess beam search against their own random sampling method, as well as using a CNN trained on ImageNet or further fine-tuning the pre-trained network to the specific dataset used[71].

Delving deeper into the multimodal approach

From Captions to Visual Concepts and Back, by Fang et al. (2014)[72], is useful to explain multi-modality of the 2014 breakthroughs. Although distinct from the approaches of Vinyals et al. (2014)[73] and Donahue et al. (2014)[74], the paper represents an effective combination of some of these ideas[75]. For readers, the working flow of the captioning process may bring a new appreciation of the modularity of these approaches.

Figure 15: Creating captions from visual concepts

Source: Fang et al. (2014)

(I) Detect words

To begin, it is possible to squeeze more information, which is easier to interpret, out of the CNN. Looking closer at how humans would complete the task, they would notice the important objects, parts and semantics of an image and relate them within the global context of the image. All before attempting to put words into a coherent sentence. Similarly, instead of “just” using the encoded vector representation of the image, we can achieve better results by combining information contained in several regions of the image.

Using a word detection CNN, which generates bounding boxes similar to what an object detection CNN does, the different regions in the image may receive scores for many individual objects, scenes or characteristics which correspond to words in a predefined dictionary (which includes about 1000 words).

(II) Generate sentences

Next, the likelihood of the matched image descriptors (detections) are analysed according to a statistically predefined language model. E.g. if a region of the image is classified as “horse”, this information can be used as a prior to give a higher likelihood to the action of “running” over “talking” for the image captioning output. This combined with beam search produces a set of output sentences that are re-ranked with a Deep Multimodal Similarity Model (DMSM)[76].

(III) Re-ranking sentences

This is where the multimodal independence comes into play. The DMSM uses two independent networks: a CNN for retrieving a vector representation of the image (VGG) and a CNN architecture with an explicit use. The image encoding network is based on the trained object detector from the previous section, with the addition of a set of fully connected layers to be trained for this re-ranking task. The second CNN is designed to extract a vector representation out of a given natural language sentence, which is the same size as the vector generated by the image encoding CNN. This effectively enables the mapping of language and images to the same feature space.

Since the image and encoded sentence are both represented as vectors with the same size, both networks are trained to minimise the cosine similarity between the image and ground truth captions for the given image, as well as to increase the difference with a set of irrelevant captions provided.

During the inference phase, the set of output sentences generated from the language model with beam search are re-ranked with the DMSM networks and compared against each other. The caption with highest cosine similarity is selected as the final prediction.

Dense captioning and the lead up to attention mechanisms (circa 2015)

Considerable improvements in bounding box detectors, such as RCNN, as well as the success of BiRNNs [77] in translation, produced another approach theoretically similar to the DMSM for sentence evaluation presented before. Namely, that one can make use of two independent networks, one for text and one for image regions, that create a representation within the same image-text space. An example of such an approach is seen in the work of Karpathy and Fei-Fei (2015)[78].

Deep Visual-Semantic Alignments for Generating Image Description [79] — which utilises the aforementioned CNN + RNN approach for caption generation — is, perhaps, most responsible for popularising image captioning in the media. A large proportion of articles on image captioning tend to borrow from their excellent captioned image examples.

But more impressive than capturing the public’s attention with their research, were the strides made by Johnson, Karpathy and Fei-Fei later that year — in DenseCap: Fully Convolutional Localization Networks for Dense Captioning[80].

Figure 16: Dense captioning & labelling

Note from source: We address the Dense Captioning task (bottom right) by generating dense, rich annotations with a single forward pass.
: Johnson et al. (2015)[81]
We noted earlier that running a CNN into a RNN allowed the image features, and therefore, its information, to be output in natural language terms. Additionally, the improvements of RCNN inspired DenseCap to use a region proposal network to create an end-to-end model for captioning, with a forward computation time reduced from 50s to 0.2s using Faster-RCNN[82].

With these technical improvements, Johnson et al. (2015) asked the question, why are we describing an image with a single caption, when we can use the diversity of the captions in each region of interest to generate multiple captions with better descriptions than an individual image caption provides?

The authors introduce a variation to the image captioning task called dense captioning where the model describes individual parts of the image (denoted by bounding boxes). This approach produces results that may be more relevant, and accurate, when contrasted with captioning an entire image with a single sentence.

Put simply, the technique resembles object detection, but instead of outputting one word, it outputs a sentence for each bounding box in a given image. Their model also can be repurposed for image retrieval, e.g. “find me an image that has a cat riding a skateboard”. In this way we see the connection between image retrieval and image captioning is naturally quite common.

Figure 17: Dense captioning in action

Note from source: Example captions generated and localized by our model on test images. We render the top few most confident predictions. 
Source: Johnson et al. (2015)
We’ve seen improvements in information flow to the RNN, and the use of multiple bounding boxes and captions. However, if we placed ourselves in the position of captioner, how would we decide on the appropriate caption(s)? What would you ultimately deem important, or disregard, in captioning an image? What would you pay attention to?

Enter “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention” by Xu et al. (2015) [83] — the first paper, to our knowledge, that introduced the concept of attention into image captioning. The work takes inspiration from attention’s application in other sequence and image recognition problems. Building on seminal work from Kiros et al. (2014a; 2014b)[84][85], which incorporated the first neural networks into image captioning approaches, the impressive research team of Xu et al. (2015)[86] implement hard and soft attention for the first time in image captioning.

Attention as a technique, in this context, refers to the ability to weight regions of the image differently. Broadly, it can be understood as a tool to direct the allocation of available processing resources towards the most informative parts of the input signal. Rather than summing up the image as a whole, with attention the network can add more weight to the ‘salient’ parts of the image. Additionally, for each outputted word the network can recompute its attention to focus on a different part of the image.

There are multiple ways to implement attention, but Xu et al. (2015) divide the image into a grid of regions after the CNN feature extraction, and produce one feature vector for each. These features are used in different ways for soft and hard attention:

  • In the soft attention variant, each region’s feature vector receives a weight (can be interpreted as the probability of focusing at that particular location) at every time step of the decoding RNN which signifies the relative importance of that region in order to generate the next word. The MLP (followed by a softmax), which is used to calculate these weights, is a deterministic part of the computational graph and therefore can be trained end-to-end as a part of the whole system using backpropagation as usual.
  • With hard attention only a single region is sampled from the feature vectors at every time step to generate the output word (using probabilities calculated similarly as mentioned before). This prevents the network training by backpropagation due to the stochasticity of sampling.
Training is instead completed using the final loss/reward (obtained from the sampled trajectory of chosen regions) as an approximation of the expected reward to be obtained from the MLP which, most importantly, can then be used to calculate the gradients. The same MLP is again used to calculate these probabilities[87]. The idea of sampling an attention trajectory as an estimation was taken from a Reinforcement Learning algorithm called REINFORCE[88]. The next part of this publication will deal with Reinforcement Learning applied to image captioning in different ways and with greater detail.

Figure 18: Attention in action

Note from source: Examples of attending to the correct object (white indicates the attended regions, underlines indicated the corresponding word). 
Source: Xu et al. 2015

Incorporating attention allows the decoder to focus on specific parts of the input representation for each of the outputted words. Meaning, that in converting aspects of the image to captions, the network can choose where and when to focus in relation to specific words outputted during sentence generation. Such techniques not only improve network performance, but also aid interpretability; we have a better understanding how the network determined its answer. As we shall see, attention mechanisms have grown in popularity since their inception.

Attention variants and interpretability

Attention and its variants come in many forms: Semantic attention, spatial attention and multi-layer attention. Hard, soft, bottom-up, top-down, spatial, adaptive, visual, text-guided, and so on. We feel that attention, while a newer technique for handling multi-modal problems, has the potential to be somewhat revolutionary.

Such techniques not only allow neural networks to tackle previously insurmountable problems, but also aid network interpretability; a key area of interest as AI permeates our societies. For those wishing to know more about attention, beyond the limited areas that we touch upon, there is an excellent distil article from Olah and Carter (2016)[89] available from here, and another by Denny Britz (2016)[90] available here.

Attention can enable our inspection and debugging of networks. It can provide functional insights, i.e. which parts of the image the network is ‘looking at’. Each form of attention, as we’ll see, has its own unique characteristics.

  • Image Captioning with Semantic Attention (You et al., 2016)[91]
    You et al. (2016) note that traditional approaches to image captioning are either ‘top-down, moving from a gist of an image which is converted to words, or bottom-up, which generate words describing various aspects of an image and then combine them’[92]. However, their contribution is the introduction of a novel algorithm that combines both of the aforementioned approaches, and learns to selectively attend. This is achieved through a model of semantic attention, which combines semantic concepts and the feature representation of the image/encoding.
Semantic attention refers to the technique of focusing on semantically important concepts, i.e. objects or actions which are integral to constructing an accurate image caption. In spatial attention the focus is placed on regions of interest; but semantic attention relates attention to the keywords used in the caption as it’s generated.

There are several important differences, by the authors’ own admission, between their use of semantic attention and previous use-cases in image captioning. Comparing this work to Xu et al. (2015)[93], their attention algorithm learns to attend to the specific word concepts found within an image rather than words defined from specific spatial locations. It is important to note that some concepts or words may not be directly related to a specific region, e.g. the word “exciting” which may encompass the entire image. This is the case even with concepts that are not directly seen in the image, and can be expanded by ‘leveraging external image data for training additional visual concepts as well as external text data for learning semantics between words’[94].

Figure 19: Semantic attention framework

Note from source: Top — an overview of the proposed framework. Given an image, we use a convolutional neural network to extract a top-down visual feature and at the same time detect visual concepts (regions, objects, attributes, etc.). We employ a semantic attention model to combine the visual feature with visual concepts in a recurrent neural network that generates the image caption. 
Bottom — We show the changes of the attention weights for several candidate concepts with respect to the recurrent neural network iterations.
Source: You et al. (2016)[95]
  • Next we introduce the concept of adaptive attention from Lu et al. (2017) [96]. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning produced new benchmarks for state-of-the-art on both the COCO and Flickr30K datasets. Instead of forcing visual attention to be active for each generated word, Lu et al. (2017) reason that certain words in a sentence do not relate to the image, such as ‘the’, ‘of’, etc. Through the use of a visual sentinel the model learns when to use attention. Adaptive attention may also vary the amount of attention supplied for each word.

Visual sentinel is classified as a latent representation of what the decoder already knows. As an extension of a spatial attention model, it determines whether the model must attend to predict the next word. We mentioned that words like ‘a’, ‘it’ and ‘of’ may be seen as not worth attending to; but words like ‘ball’, ‘man’ and ‘giraffe’ are not only worth attending at a point in time (sentinel), but also in a particular part of the image (spatial).

At each time step, our model decides whether to attend to the image (and if so, to which regions) or to the visual sentinel. The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation”[97].

Figure 20: Visualisation of caption generation

Note from source: Visualization of generated captions, visual grounding probabilities of each generated word, and corresponding spatial attention maps produced by our model. 
Note: Knowing where and when, and how much to look. The probabilities graphed change depending on the ‘importance’ of the word, i.e. the attention that should be given to an image section when generating the sentence. 
Source: Lu et al. (2017)[98]
  • Another interesting piece is SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning from Chen et al. (2017) [99]. The authors go all out and make use of spatial, semantic, multilayer and multi-channel attention in their CNN architecture, while also gently admonishing the use of traditional spatial attention mechanisms.
Attention is usually applied spatially to the final layer outputted by the encoder CNN, treating all channels the same to calculate where the attention should focus on, i.e. the usual attention model generates output sentences by only attending to specific spatial areas within the final convolutional layer.
It is worth noting that each CNN filter performs as a pattern detector, and each channel of a feature map in CNN is a response activation of the corresponding convolutional filter. Therefore, applying an attention mechanism in channelwise manner can be viewed as a process of selecting semantic attributes”[100].
CNN features are naturally spatial, channel-wise and multi-layer’ and the authors’ take full advantage of this natural design — applying attention to multiple layers within the CNN and to the individual channels within each layer.

Their approach was applied to the usual datasets of Flick8k, Flickr30k and COCO, and a thorough analysis of the different attention variants was undertaken. The authors note improvements in metrics both through combinations of attention variants, or with a single type, e.g. Spatial vs Channel vs Spatial + Channel. They also vary how many final layers the network should attend to (1–3), and extend this to different feature extractors, e.g. a VGG network (with attended layers being chosen from the “conv5_4, conv5_3 or conv5_2” convolution layers) or a ResNet.

The TencentVision team are leading the COCO captioning leaderboard at present. [101] According to the leaderboard, their entry description reads “multi-attention and RL”. When contrasted with the original paper, one must conclude that an approach which incorporates Reinforcement Learning techniques constitutes a variation on the original approach. However, we could not find a publication detailing these additions as of yet[102].
  • In 2017, Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering from Anderson et al. (2018)[103] proposed a more natural method for attention, inspired by neuroscience. Another paper that truly explores the difference between Bottom-Up and Top-Down attention, the authors’ examine attention in depth and present a method to efficiently fuse information from both types.
“In the human visual system, attention can be focused volitionally by top-down signals determined by the current task (e.g., looking for something), and automatically by bottom-up signals associated with unexpected, novel or salient stimuli”[104].

Following from this definition, bottom-up attention is applied to a set of specific spatial locations which are generated by an object detection CNN. These salient spatial regions are typically defined by a grid on the image, but here they compute bottom-up attention over all bounding boxes where the detection network finds a region of interest. Specifically, each region of interest is weighted differently by a scaling/alpha factor and these are summed into a new vector which is passed into the language model LSTM[105].

On the other hand, top-down attention uses an LSTM with visual information [106], as well as task-specific context input, to generate its own weighted value of these features. The previously generated word, the hidden state from the language model LSTM, and the image features averaged across all objects are used to generate the top-down attention output.

Using the same attention methodology, Anderson et al. (2018) managed to make strides in two different tasks, i.e. both image captioning and VQA [107]. Their approach is currently second on the COCO captioning leaderboard [108], achieving SOTA scores on the MSCOCO test server with CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively[109].

Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge”[110].

Although attention and its variants represent quite a large body of impressive work, finally we turn ours, limited by space, to our two favourite pieces of research to date:

  • MAT: A Multimodal Attentive Translator for Image Captioning from Liu et al. (2017)[111]. Liu et al. (2017) decide to pass the input image as a sequence of detected objects to the RNN for sentence generation, as opposed to the favoured approach of having the whole image encoded by a CNN into a fixed-size representation. They also introduce a sequential attention layer which takes all encoded hidden states in to consideration when generating each word.
To represent the image in a sequential way, we extract the object’s features in the image and arrange them in a order using convolutional neural networks. To further leverage the visual information from the encoded objects, a sequential attention layer is introduced to selectively attend to the objects that are related to generate corresponding words in the sentences”[112].
  • Text-guided Attention Model for Image Captioning, from Mun et al. (2016)[113], proposes a model which “combines visual attention with a guidance of associated text language”, i.e. during training they use the training caption to help guide the model to attend to the correct things visually. Their model can also use the top candidate sentences during testing to also guide attention. This method seems to deal well with cluttered scenes.
To the best of our knowledge, the proposed method is the first work for image captioning that combines visual attention with a guidance of associated text language”[114].


  1. [51] Lu et al. (2017). Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning. [Online] arXiv: 1612.01887. Available: arXiv:1612.01887v2
  2. [52] Kiros et al. (2014a): “Unlike many of the existing methods, our approach can generate sentence descriptions for images without the use of templates, structured prediction, and/or syntactic trees.”
  3. [53] Farhadi et al. (2010). Every Picture Tells a Story: Generating Sentences from Images. In: Daniilidis K., Maragos P., Paragios N. (eds) Computer Vision — ECCV 2010. ECCV 2010. Lecture Notes in Computer Science, vol 6314. Springer, Berlin, Heidelberg. Available:
  4. [54] Kulkarni et al. (2013). BabyTalk: Understanding and Generating Simple Image Descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 35, No. 12, December. Available:
  5. [55] Kiros et al. (2014a). Multimodal Neural Language Models. Proceedings of the 31st International Conference on Machine Learning, PMLR 32(2):595–603. Available:
  6. [56] N-grams refer to the breakdown of a sequence of textual data into consecutive groups of symbols (e.g. words or letters). For example, in word bigrams (n=2), the sentence “A man riding a horse” breaks down to “A man”, “man riding”, “riding a”, etc. These can then be used by specific evaluation metrics which give higher scores for output sentences that have more words in the same order as the ground truth.
  7. [57] Vedantam et al. (2014). CIDEr: Consensus-based Image Description Evaluation. [Online] arXiv: 1411.5726. Available: arXiv:1411.5726v2 (2015 version).
  8. [58] ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation — Lin, C.Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, Barcelona, Spain. Available:
  9. [59] Banerjee, S., Lavie, A. (2005). METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. [Online] Language Technologies Institute, Carnegie Mellon University ( Available:
  10. [60] (2018) COCO: Common Objects in Context. [Website] Available:
  11. [61] Sutskever et al. (2014). Sequence to Sequence Learning with Neural Networks. [Online] arXiv: 1409.3215. Available: arXiv:1409.3215v3
  12. [62] Britz, D. (2016). Deep Learning for Chatbots, Part 1 — Introduction. [Blog] WildML ( Available:
  13. [63] For shameless self-promotion, see previous report: “A Year in Computer Vision”. Available:
  14. [64] Vinyals et al. (2014). Show and Tell: A Neural Image Caption Generator. [Online] arXiv: 1411.4555. Available: arXiv:1411.4555v2
  15. [65] LSTM (long-short term memory): a type of Recurrent Neural Network (RNN)
  16. [66] Geeky is Awesome. (2016). Using beam search to generate the most probable sentence. [Blog] Geeky is Awesome ( Available:
  17. [67] ibid
  18. [68] Donahue et al. (2014). Long-term Recurrent Convolutional Networks for Visual Recognition and Description. [Online] arXiv: 1411.4389. Available: arXiv:1411.4389v4 (2016 version)
  19. [69] Visual Geometry Group Network (VGGNet), a type of neural network named after the research group who created it.
  20. [70] This type of architecture was also tested by Vinyals et al. (2014) in their paper without success.
  21. [71] Fine-tuning is a common practice which consists of training on a richer/larger dataset first (i.e. pretraining), and then retraining on the target dataset, since the generality of the features learned on the first dataset can often be used and transferred to some extent to the target dataset.
  22. [72] Fang et al. (2014). From Captions to Visual Concepts and Back. [Online] arXiv: 1411.4952. Available: arXiv:1411.4952v3 (2015 version)
  23. [73] Vinyals et al. (2014). Show and Tell: A Neural Image Caption Generator. [Online] arXiv: 1411.4555. Available: arXiv:1411.4555v2 (2015 version).
  24. [74] Donahue et al. (2014). Long-term Recurrent Convolutional Networks for Visual Recognition and Description. [Online] arXiv: 1411.4389. Available: arXiv:1411.4389v4 (2016 version)
  25. [75] Interestingly, although the authors hail primarily from Microsoft Research, contributions were made by researchers in Facebook AI Research (FAIR) and Google as well.
  26. [76] Fang et al. (2014). From Captions to Visual Concepts and Back. [Online] arXiv: 1411.4952. Available: (2015 version).
  27. [77] For more information see Part One: Visual Speech Recognition (Lip Reading). Medium:
  28. [78] Karpathy, A., Fei-Fei, L. (2015). Deep Visual-Semantic Alignments for Generating Image Descriptions. [Online] Stanford Computer Science Department ( Available: For additonal code, etc., see project page:
  29. [79] ibid
  30. [80] Johnson, J., Karpathy, A., Fei-Fei., L. (2015). DenseCap: Fully Convolutional Localization Networks for Dense Captioning. [Online] arXiv: 1511.07571. Available: arXiv:1511.07571v1
  31. [81] ibid
  32. [82] CNN computation only according to: Ren et al. (2015). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. [Online] arXiv: 1506.01497. Available: arXiv:1506.01497v3 (2016 version).
  33. [83] Xu et al. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. [Online] arXiv: 1502.03044. Available: arXiv:1502.03044v3 (2016 version)
  34. [84] Kiros et al. (2014). Multimodal Neural Language Models. Proceedings of the 31st International Conference on Machine Learning, PMLR 32(2):595–603. Available:
  35. [85] Kiros et al. (2014). Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. [Online] arXiv: 1411.2539. Available: arXiv:1411.2539v1
  36. [86] The authors from Kiros et al. (2014) are present in the author list for Xu et al. (2015).
  37. [87] A Multinoulli distribution is parameterised by the MLP and therefore, it can be sampled from. Using Monte-Carlo sampling, the final result from the sampling operation is that 1 region gets the entire weight and the rest get a weight of 0.
  38. [88] Williams, R. J. (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. In: Machine Learning, 8, pp. 229–256. Available:
  39. [89] Olah, C., Carter, S. (2016). Attention and Augmented Recurrent Neural Networks. [Online] Distill ( Available:
  40. [90] Brtiz, D. (2016). Attention and Memory in Deep Learning and NLP. [Blog] WildML ( Available:
  41. [91] You et al. (2016). Image Captioning with Semantic Attention. [Online] arXiv: 1603.03925. Available: arXiv:1603.03925v1
  42. [92] ibid pg. 1
  43. [93] Xu et al. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. [Online] arXiv: 1502.03044. Available: arXiv:1502.03044v3 (2016 version)
  44. [94] ibid pg. 2
  45. [95] ibid
  46. [96] Lu et al. (2016). Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning. [Online] arXiv: 1612.01887. Available: arXiv:1612.01887v2 (2017 version).
  47. [97] ibid pg. 1
  48. [98] ibid
  49. [99] Chen et al. (2016). SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning. [Online] arXiv: 1611.05594. Available: arXiv:1611.05594v2
  50. [100] ibid
  51. [101] COCO. (2018). Captioning Leaderboard. [Website] Common Objects in Context ( Available:
  52. [102] The approach used is possibly somewhat analogous to the approaches featured in part three of this publication, where we will dive into the details of how RL is used increasingly within Image Captioning.
  53. [103] peteanderson80 (GitHub). (2018). Up-Down-Captioner. [Online] Automatic Image Captioning Model by PeteAnderson80 ( Available: See publication: Anderson et al. (2017). Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. [Online] arXiv: 1707.07998. Available: arXiv:1707.07998v3 (2018 version)
  54. [104] ibid
  55. [105] Specifically, they use a pretrained ResNet-101 with a Faster RCNN model to output these regions of interest. In general, this paper’s approach is related to the previously mentioned papers that used bottom-up information. Usually bottom-up is completed and described by generating words (visual concepts, attributes) from the image which can then be combined into sentences using language models. However, the attribute and class information is implicitly contained within the Faster RCNN model for this paper.
  56. [106] An averaged version of all input vectors generated from the object detector CNN is also used for bottom-up attention.
  57. [107] Visual Question Answering is also a multi-modal type of problem.
  58. [108] See name “panderson@MSR/ACRV”, one behind the TencentVision team seen previously.
  59. [109] One interesting point to mention from the paper is that they first train the system using cross-entropy loss (XE) as usual, but then fine-tune the network by directly optimising the non-differentiable CIDEr metric using an algorithm similar to REINFORCE from the field of Reinforcement Learning (RL). We will dive into the details of how RL is used increasingly within Image Captioning in the next part of this publication.
  60. [110] Anderson et al. (2017). Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. [Online] arXiv: 1707.07998. Available: arXiv:1707.07998v3 (2018 version)
  61. [111] Liu et al. (2017). MAT: A Multimodal Attentive Translator for Image Captioning. [Online] arXiv: 1702.05658. Available: arXiv:1702.05658v3
  62. [112] ibid pg. 1
  63. [113] Mun et al. (2016). Text-guided Attention Model for Image Captioning. [Online] arXiv: 1612.03557. Available: arXiv:1612.03557v1
  64. [114] ibid pg. 1