Deep Neural Networks & Image Captioning

Part 1: Helping our users find love on Badoo

Laura Mitchell
Feb 28, 2019 · 5 min read

Badoo is the largest dating network in the world, with over millions users across 190 countries who upload over 10 million photos per day to our platform. These images provide us with a rich data set we can derive a wealth of insights from.

Our Data Science team use image captioning to describe what is in these images. Image captioning is essentially the process of generating a textual description of a picture’s content.

With these descriptions we are delivering insights to the business, which in turn helps us to improve the user experience. For example, if we are able to identify that two users are both keen tennis players, we can give them a helping hand in finding each other on Badoo. Ultimately, in this way we’re helping people across the world find love!

In this blog I will explain how the Badoo Data Science team use Deep Neural Networks to create these image captions.

CNN and LSTM architecture

The Deep Neural Network model we have in place is motivated by the ‘Show and Tell: A Neural Image Caption Generator’ paper. It uses a combination of a Convolutional Neural Network (CNN) and a Long Short Term Memory (LSTM).

The union of a CNN and LSTM in its simplest terms takes the image as the input and outputs a description of what that image depicts. The CNN acts as an encoder and the LSTM as a decoder.

Role of the CNN

CNNs (or ConvNets) have proven very effective in image classification and object detection tasks. Other algorithms tend to use a lot of the spatial interaction between pixels, but a CNN effectively uses adjacent pixel information to downsample/convolve the image first. For the purposes of image captioning, we can take a CNN that has been pre-trained to identify objects in images. Popular pre-trained models for such tasks are VGG-16, Resnet-50 and InceptionV3.

It is the CNN that creates the feature vector from the image, which is called an embedding. The embedding from the CNN layer is then fed as an input into the LSTM. In other words, the encoded image is transformed to create the initial hidden state for the LSTM decoder.

Role of the LSTM

LSTMs are a type of Recurrent Neural Network (RNN) and, unlike many other deep learning architectures, RNNs can form a much deeper understanding of a sequence as they keep a memory of the previous elements. They are able to consider a past sequence of events in order to make a prediction or classification about the next event. For image captioning, the LSTM takes the extracted features from the CNN and produces a sequence of words that describe what was in the image.

At each output timestep the LSTM generates a new word in the sequence.

Beam search

LSTMs can use the beam search method to construct sentences. It is a linear layer that is used to transform the decoder’s output into a score for each word. A greedy approach would be to simply choose the word with the highest score. However, this is suboptimal: if the first word is incorrect, the rest of the sequence will hinge on that. Beam search, on the other hand, is not a greedy approach in the sense that it doesn’t have to choose a word at any point in the sequence until it has finished decoding.

At each iteration, the previous state of the LSTM are considered to generate the softmax vector. The top most probable words are kept and used in the next inference step.

More specifically, at the first decode step the top k candidates are considered. Then it generates the top k second words for those first k words. This is repeated at each time step until the sequence is terminated. The sequence with the best overall score is then chosen.

Datasets and training

There are many open source datasets containing images and their captions that can be used for training your model. Popular choices for such tasks are the MS-COCO and Flickr 8k datasets.

During training, the network back-propagates and updates the weights based on the gradient of the loss functions.

Model evaluation

To evaluate how similar the predicted sentence is to the ground truth, an evaluation metric such as the Bilingual Evaluation Understudy (BLEU) score can be considered. The fundamental concept of the BLEU score is that it considers the correspondence between the model’s output and that of a human.


In this blog, I gave an overview of how we have combined a CNN and a LSTM to generate textual descriptions of images, and in turn how we are helping our users connect with others with similar interests. Keep an eye out for part two of this article, which covers how the incorporation of attention networks can enhance the accuracy of our descriptions.

If cool data science projects like this one interest you, we are hiring, so get in touch and join the Badoo family! 🙂

Any suggestions/feedback are welcome, so please feel free to comment below.

Bumble Tech

This is the Bumble tech team blog focused on technology and…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store