Image Captioning: A Retrospect on Our Latest Study

The amount of data generated each day is tremendous. With more data comes more information. By extracting the precious information hidden in the piles of big data and employing new paradigms, researchers can now solve quite challenging problems such as describing what is out there in a scene.

Recently, this particular problem have garnered a lot of attraction from both computer vision and natural language processing communities. Obviously it is at the intersection of two active research areas and the challenges of this problem seems to grasp the attention of several researchers.

But, what makes this problem uniquely interesting? The answer is hidden in its goal, which probably explains why it is very important to Artificial Intelligence in general. The basic goal in image captioning is to describe a scene like a human would describe it. Although we are not there yet, with the recent progress we can say that we not too far from there either.

It’s a challenging problem, one needs to admit, to start with there are several things to consider if you want to pass the Turing test. For example, people do not always mention all image contents to describe it and tend to mention things that are of importance to them. Moreover, people often provide quite different descriptions for the same scene as they often do not agree on what is important about that scene. These are just a few things to keep in mind. But the real challenge is the quality of the descriptions the algorithms produce, considering we want to convince people that these descriptions come from humans not machines. Yet it is not as easy as it may sound, humans are quite good at separating things that are non-human, in other words robotic. See for yourself, and try to guess which caption is authored by a human and which by a machine.

  1. There are one cow and one sky. The golden cow is by the blue sky.
  2. A young highlander cow stands in a pasture.

The above image was taken from UIUC Pascal Sentence dataset and the first caption was generated by BabyTalk of Berg et al. (CVPR, 2011), the second however was written by humans.

Image captioning studies fall into two broad categories. In the first category the main idea is to retrieve a relevant description from a set of descriptions. In the second category the main idea is to generate a description from scratch. We recently did a study on the former, which we presented at ACL 2015.

In our study we proposed a query expansion scheme to improve the quality of retrieval based image captioning approaches. The main idea of our work is to improve the retrieval based image captioning by synthesizing a new query with the captions of visually similar images and re-ranking candidate captions based on this synthesized query.

Now let’s repeat the same test and try to find which caption is provided by a human and which caption by a machine.

  1. A construction crew in orange vests working near train tracks
  2. A man in an orange vest and yellow hard hat looks on as a yellow vehicle lays track

It is getting harder right? The main reason is the approach we used here. Actually both captions are written by humans, but the first caption was chosen by our algorithm among several captions. Our algorithm believes that this caption should be describing the scene. Apparently, it does so.

It is hard to make a fair comparison between generative and retrieval models, as they do quite different things. Generative models aim to create a caption from scratch whereas retrieval models aim to find the most suitable caption for the scene. They do have their own challenges, which I believe would be off topic for this short blog post. So, we follow a retrieval based approach in our image captioning study, therefore the above description is selected among several image captions.

At this point, you might wonder how we selected this description for that image. You should check the original paper for the details but for the conciseness of this blog post, I will try to explain how we did that in very simple terms :)

Retrieving Visually Similar Images

Let’s rephrase our problem here once more. We want to describe a scene, and we want to do it good. In order to come up with a good description for that scene we first need to visually understand that scene. This is at least what humans do when they want to describe a scene. Our vision system starts working unconsciously and determines the contents of the scene. For a machine, it is not that easy. But we might at least follow a simple strategy to find out the contents of an image without heavily relying on parsing the image contents. Here we assume that visually similar images contain similar contents and attributes, such as illumination, weather, objects etc.

In order to represent images, we use the output of a deep learning model, 16-layer VGG trained on ImageNet, and use fc7 activations of ConvNet with 4096 dimensional vectors. After employing an Euclidean distance measurement against the query scene, followed by an adaptive neighborhood selection we find the top N images in our dataset that are visually similar to the query scene. Note that N is subject to change based on the density of candidates. At this moment we have a set of images that look very much like the scene we want to describe.

Visual retrieval and adaptive neighborhood selection. The algorithm determines the number of candidates to be selected adaptively based on the density as seen in A and B

Close enough, but we still need some work to do. Now that we find the visually similar images to our own query image, we can reduce the problem to, what to do with those images. The short answer is, we don’t deal with images from now on but change our focus to textual domain by using the captions of visually similar images, as we assume that they contain the information we need to describe the query scene.

Distributed Representation via Word Vectors

Word vectors are very popular these days, but their popularity has some solid ground. They can accurately capture the meaning of words in higher dimensions where each word is associated with a real valued vector.

Vectors encoding gender relation between words on the left, singular/plural relation between two words on the right. (Mikolov et al., 2013)

What makes word vectors so popular and powerful is, their simplicity and accuracy. With simple arithmetics it is possible to extract syntactic and semantic regularities between words singular/plural, gender, tense etc. In other words we can make analogies by using the vector differences and sums.

W(‘uncle) − W(‘man’) + W(‘woman’) ≃ W(‘aunt’)

W(‘king’) − W(‘man’) + W(‘woman’) ≃ W(‘queen’)

In the above example we see that we can make a nearest neighbor search around the result of vector operation “king — man + woman” and obtain “queen”. Replacing “king” with “uncle” results with “aunt” (Mikolov et al, 2013). This is very powerful especially if you want to encode/decode such regularities in a language.

We used 500 dimensional word embeddings to represent words and employed simple vector arithmetics to synthesize a new query using the captions of visually similar images. Primarily what we do here is to shift our focus to language domain from the visual domain and make a query expansion that is based on distributional representations.

Representing words with word vectors seems quite simple as it works like a lookup table where each word has a mapping in a higher dimensional space. But, how do we represent captions? We have mentioned that we can apply algebraic operations on a vector, and use the difference or sum vectors to encode some regularities. Using this rationale, we assume that a caption might be represented as the sum of its constituents. Therefore, we simply sum vectors for each word in a caption. In our case, each image had 5 different captions. Subsequently, we averaged these caption vectors.

At this step, we have a vector which we presume to represent an image. Considering we have plenty of captions in our pool we need to draw the best caption from that pool that describes the query image. Here we make another assumption, that is, the query image could be represented with a query vector. But we don’t have such a query vector. So, we synthesize one by averaging the caption vectors in the pool. The resulting vector is to represent the query image but in the language domain. What we do here is to expand the original image query by synthesizing a new query vector in language domain and changing modalities.

A system overview of the proposed query expansion approach for image captioning

Now that we have a query vector, we can do something cool with it. So, we simply re-rank all the caption vectors in the pool against this synthesized query vector. This re-ranking scheme considers the angular cosine distance between caption vectors and ranks them based on their distance. Finally, we return the top caption as the output of our algorithm for describing the query image.


Choosing which dataset to work with is a very important part of a study. There are a few established datasets that have been used in various vision and NLP related tasks such as UIUC Pascal Sentence, Flickr8K, Flickr30K, MS COCO, SBU Captioned Photo datasets, which have 1K, 8K and 30K, 300K and 1M captioned images respectively. Each dataset has been criticized from different aspects, such as size, noise, object distribution etc. In the recent studies MS COCO dataset has been used as the standard captioned image dataset, as it quite big in size, has diverse categories of objects with very structured tokenized captions as well as object segmentations.

In order to provide benchmark for future studies we used Flickr8K, Flick30K and MSCOCO datasets and conducted our evaluation tasks on these datasets separately.


So, what makes an image description good? How can we determine the quality of our descriptions? There are two common approaches in the literature. The first one is evaluating the provided descriptions against the ground-truth descriptions using machine translation and summarization metrics such as BLEU, TER, METEOR, ROUGE, CIDEr, etc. Second approach is designing human evaluation tasks to determine the quality of the descriptions, such as relevancy, descriptiveness etc. Both approaches have some advantages and disadvantages of their own. For example, it is easier to evaluate descriptions with automated metrics, but these metrics often have their quirks (see Elliott and Keller (ACL, 2014) for an extended review). Human evaluation tasks are good to determine the overall description quality, but these tasks are hard to replicate, costly, and often subject to bias based on the design of the tasks.

In our evaluation task we used smoothed BLEU, METEOR and CIDEr and compared our approach with some of the related works in the literature as well as the human baseline. Moreover, we conducted a human evaluation task with a similar setting as described in (Kuznetsova et al., 2012; Mason and Charniak, 2014).

A few results obtained with our method along with ground-truth captions.

The human evaluation task suggested that our results are quite relevant to the query scenes as shown above for a few scenes, and the machine translation metrics supported these findings but we believe that there are a few things we can do to improve our results in other words other pooling strategies such as Fisher Vectors and incorporating word order, etc. Hopefully, we will be addressing these in our upcoming studies.

Final Remarks

Before we presented our work at ACL 2015, we submitted our results to the MS COCO 2015 Captioning Challenge. We noticed that the competing methods were performing very well with the automatic machine translation metrics. Most methods were even performing better than humans, in terms of these metrics. I must note that we missed the human judgment and could not partake in the original challenge, and submitted our results afterwards.

Human judgment scores showed that there is still quite a big margin between the best performing methods compared to ground-truth image captions. This simply tells us that there still room for progress both in machine translation metrics and image captioning methods. We’ll probably see quite interesting studies trying to fill this gap in the very near future.

This is a joint work by Semih Yagcioglu(1), Erkut Erdem(1), Aykut Erdem(1) ve Ruket Cakici(2).

1: Hacettepe University Computer Vision Lab (HUCVL)

2: Middle East Technical University Department of Computer Engineering

Show your support

Clapping shows how much you appreciated Semih Yagcioglu’s story.