Deep Neural Networks & Image Captioning

Part 2: Using attention networks to improve the quality of image captions

Laura Mitchell
Mar 7, 2019 · 5 min read

In my previous blog, I introduced the concept of combining a CNN with an LSTM to generate a caption of an image, and talked about how we are using this at Badoo to help our users find love. This time, we will explore how the incorporation of attention networks can help us improve on and enrich the captions that are generated by our model.

Why attention networks?

To improve the quality of our captions, we often need the decoder to look at different parts of an image when generating a new word in the sequence. Incorporating an attention mechanism essentially allows the model to focus on the area relevant to the word it will show next. For example, while attempting to output the word ‘dog’ from the below image, we want the model to focus on the pixels where the dog is shown.

We can place more importance on the relevant pixels by taking a weighted average rather than simply the average weights. This can be thought of as the probability of focusing on a particular location in the image.

Incorporating the attention network

The attention network computes the weights of the pixels at each timestep. It considers the sequence of words that have been generated thus far and determines what should be described next. The weighted representation of the image can be considered with the previous word at each timestep to generate the next word. In the example below, it is the “memory” of the LSTM that can help it to learn that it is logical to write that the man is ‘holding a dog’ after ‘a man’.

Soft vs. hard Attention

In ‘soft’ attention, the weights of all of the pixels add up to one, as they do below where p is the number of pixels and t is the timestep.

It is the multi-layer perceptron (followed by softmax) that is used to calculate the weights. The model can be trained end-to-end as a part of the whole system using backpropagation due to soft attention’s deterministic nature. By contrast, hard attention’s stochastic nature makes it unsuitable for backpropagation training in the same way: instead of using all of the hidden states as an input for the decoding, it takes a sample at every timestep in order to generate the output word.

There are pros and cons to consider when using soft or hard attention, but there is a tendency for soft to be used due to the fact that the gradient can be computed instead of estimated through a stochastic process. I recommend reading the paper ‘Show, Attend and Tell: Neural Image Caption Generation with Visual Attention’ for further information on these differences.

Multimodal attentive translators

Another interesting approach involving attention is to use a multimodal attentive translator as covered in papers such as ‘MAT: A Multimodal Attentive Translator for Image Captioning’.

This is when the input image is passed as a sequence of detected objects to the LSTM as opposed to the whole decoded image. A sequential attention layer is also introduced which takes all encoded hidden states into consideration when generating a word.

Dense captioning using bounding boxes

A variation of image captioning with attention is to introduce dense captioning with the use of bounding boxes. A bounding box can be thought of as an imaginary box drawn around a region of interest in the image such as can be done during object detection.

Dense captioning involves describing what is in each of these bounding boxes. As a result, it allows us to get rich descriptions of what is in multiple parts of an image. This approach could also be used for image retrieval/semantic image searching.

Source: ‘DenseCap: Fully Convolutional Localization Networks for Dense Captioning’

Edge cases

We appreciate that for our model to perform as well as a human we still have a bit of a way to go when looking at predictions such as this…

Photo effects and Snapchat filters

We have also noticed that the model has a tendency to get confused when there are photo effects such as Snapchat filters included in the image. If you have come across some similar results and have some solutions for dealing with such cases, please do help us out by posting below! :-)

Edge cases aside, overall we have seen some pleasing results with regards to the quality of the sentences that are generated by our model, which have in turn allowed us to give our users a helping hand in finding love on Badoo.


So now you know a bit more about how Deep Neural Networks, with the incorporation of attention mechanisms, can generate accurate captions for images. This advancement in technology allows us to deliver insights to our business, and also has many other applications in the real world, like semantic image searches for example.

I hope this article has motivated you to discover more about deep learning architectures and gives you some inspiration as to where they could drive innovations in your workplace.

If you’ve enjoyed reading this and are keen to work on similarly interesting projects, we are looking to expand our Data Science team at Badoo, so please do get in touch! 🙂

If you have any suggestions/feedback, feel free to comment below.

Bumble Tech

This is the Bumble tech team blog focused on technology and…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store