Image Captioning in Python with Keras

Bhavesh Wadhwani
The Startup
Published in
8 min readSep 20, 2019

--

END TO END approach for Image Captioning, starting right from data collection up to model building and making predictions on model.

Image for simple representation for Image captioning process using Deep Learning ( Source: www.packtpub.com )

1. Introduction

Caption generation is a challenging artificial intelligence problem where a textual description must be generated for a given photograph.

It requires both methods from computer vision to understand the content of the image and a language model from the field of natural language processing to turn the understanding of the image into words in the right order.

Deep learning methods have demonstrated state-of-the-art results on caption generation problems. What is most impressive about these methods is a single end-to-end model can be defined to predict a caption, given a photo, instead of requiring sophisticated data preparation or a pipeline of specifically designed models.

In this tutorial, you will discover how to develop a photo captioning deep learning model from scratch.

2. Prerequisites

This post assumes familiarity with basic Deep Learning concepts like Convolution Neural Networks, Transfer Learning, Recurrent Neural Networks, Gradient Descent, Feed-Forward, Back-propagation, Text Processing, Python syntax, Python data structures, Keras library etc.

3. Data Collection

There are many open source datasets available for this problem, like Flickr 8k (containing8k images), Flickr 30k (containing 30k images), MS COCO (containing 180k images), etc.

But for the purpose of this case study, I have used the Flickr 8k dataset which you can download by filling this form provided by the University of Illinois at Urbana-Champaign. Also training a model with large number of images may not be feasible on a system which is not a very high end PC/Laptop.

This dataset contains 8000 images each with 5 captions.

These images are distributed as follows:

  • Training Set — 6000 images
  • Dev Set — 1000 images
  • Test Set — 1000 images

4. Understanding the data

If you have downloaded the data from the link you will also get some text files related to the images. One of the files is “Flickr8k.token.txt” which contains the name of each image along with its 5 captions. We can read this file as follows:

Output:

Sample filename: ‘1000268201_693b08cb0e.jpg’

Sample Caption: A child in a pink dress is climbing up a set of stairs in an entryway .

Note: Format of output is containing name_of_file.jpg and 5 Captions for every file as shown above.

Doc will look like this with file name and 5 Captions per image.

5. Loading data

In this section we will extract data and store as per our needs. Ha

  1. Extracting captions and storing into dictionary format:
Code for loading descriptions as dictionary which contains name of file as key and captions as values

2. Clean Descriptions:

Code for cleaning descriptions

Now we can save descriptions for later use…

Code for saving descriptions into text file

6. Preparing Sequences for training :

In this section we will create sequences for captions which will be used while training model. We will add <startseq>at start and <endseq> at last for algorithm to understand start and end of captions.

Now we will create all the sequences of images, input sequences and output words for an image.

Here is how it will look like in sequences
Code for creating sequences of images, input sequences and output words for an image

7. Image feature extraction:

In this section extract features from images using transfer learning.

We need to convert every image into a fixed sized vector which can then be fed as input to the neural network. There are two ways

  1. To train network from scratch.
  2. To use pre-trained model and its weights trained on larger similar data .This is called as transfer learning .

For this case study, we choose 2nd approach. This is an optimization that will make training our models faster and consume less memory. We choose VGG as our model for transfer learning and weights as “imagenet” . We will remove the last layer from the loaded model, as this is the model used to predict a classification for a photo.

We are not interested in classifying images, but we are interested in the internal representation of the photo right before a classification is made. These are the “features” that the model has extracted from the photo.

Keras also provides tools for reshaping the loaded photo into the preferred size for the model (e.g. 3 channel 224 x 224 pixel image).

Code for extracting input features for image and saving into pickle file later use

Now our data is ready for modelling. So why to waste more time. Lets get our hands dirty in making some really cool models.

8. Model preparation:

But what could be best model for our purpose?

We have Images and captions. We have to use both the features in a efficient way so that we get benefited from both features. We can use Inject and Merge Architectures for the Encoder-Decoder Model .For more details i suggest you to read from above link.

1. Inject Model

The inject model combines the encoded form of the image with each word from the text description generated so-far.

The approach uses the recurrent neural network as a text generation model that uses a sequence of both image and word information as input in order to generate the next word in the sequence.

In these ‘inject’ architectures, the image vector (usually derived from the activation values of a hidden layer in a convolutional neural network) is injected into the RNN, for example by treating the image vector on a par with a ‘word’ and including it as part of the caption prefix.

Where to put the Image in an Image Caption Generator, 2017.

Inject Architecture for Encoder-Decoder Model
Taken from “What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?”.

2. Merge Model

The merge model combines both the encoded form of the image input with the encoded form of the text description generated so far.

The combination of these two encoded inputs is then used by a very simple decoder model to generate the next word in the sequence.

The approach uses the recurrent neural network only to encode the text generated so far.

In the case of ‘merge’ architectures, the image is left out of the RNN sub-network, such that the RNN handles only the caption prefix, that is, handles only purely linguistic information. After the prefix has been vectorised, the image vector is then merged with the prefix vector in a separate ‘multi-modal layer’ which comes after the RNN sub-network.

Where to put the Image in an Image Caption Generator, 2017.

Merge Architecture for Encoder-Decoder Model
Taken from “What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?”.

What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?, 2017.

We will use merge model architecture for this case study as it has been proven it works better in such cases.

Schematic of the Merge Model For Image Captioning

Now, you would be thinking WHY this will work better ? → For that you have to read below paragraph!

Here is one more paper ( “Where to put the Image in an Image Caption Generator?” ), I would suggest you to read this here. To get a clear idea why we are choosing this type of architecture.

For more clear understanding, here is image from “Where to put the Image in an Image Caption Generator?”

Recursive Framing of the Caption Generation Model
Taken from “Where to put the Image in an Image Caption Generator.”

Now, Lets define a model for our purpose.

Code for defining a model as per our requirement

Here we have used CuDNNLSTM which helps us to train our model faster. You can learn more about this layer in keras over here.

Outline for our model layers:

Model summary

This is how will our model look like:

Model outline with input and output shapes

Till now it seems everything seems fine but i am sure if you are following this tutorial and are doing similar things in your personal laptop with less than 32 GB RAM you might error. As our data is quite large and to work with this amount of data we need to fit all data in RAM to work with it. So, how can we move further now??

Do we have any option? Do you know any other option?

Yes!! you guessed it correct we can do this by partially fitting the data.

But how to achieve this in code part. Is that troubling you?

No worry, i got you covered here !!

Below is the code for generator function through which we will achieve this target.

Training….

Epoch 10/10 — — Training completed

TADA!! We have done it !

Now what ? We need to check how is our model performing .

But, before that we need to decide on what basis do judge our model performance?

Here is one metric named BLEU. This can certainly help us.

BLEU stands for Bilingual Evaluation Understudy.

It is an algorithm, which has been used for evaluating the quality of machine translated text. We can use BLEU to check the quality of our generated caption.

  • BLEU is language independent
  • Easy to understand
  • It is easy to compute.
  • It lies between [0,1]. Higher the score better the quality of caption

How to calculate BLEU score?

predicted caption= “the weather is good”

references:

  1. the sky is clear
  2. the weather is extremely good

first, convert the predicted caption and references to unigram/bigrams.

BLEU calculation

Great then, we are ready now to evaluate our model.

Code for Model Evaluation

Now, Its time to show our real time results — Output of what we have done so far!!

This is our generated caption for our example

Note : Output will contain <startseq> and <endseq> as we had made in input.

startseq dog is running through the water endseq → This will out output look like for this we can skip 1st and last word while displaying output.

Here are few samples for not so accurate results.

No beach here!!
This is OK! but can be improved much more

Future Scope:

This is just a first cut model for we can improve out results by training on larger data. More tuning the models. Please try and share with me the results!! Would love to know that from you.

Also, please let me know if you think anything above can be improved.

Here are list of sources where i have learned above things from. They are really important without them this would have not been easy.

References:

1. Applied AI course. For detailed video explanations for RNN’s and CNN’s.

2. Machine Learning Mastery. Thanks Jason for your blogs. They have really helped me to grow.

3. Andrej Karpathy. He has a very good lecture on RNN’s and its uses. Link.

You can connect with me on LinkedIn.

LinkedIn : https://www.linkedin.com/in/bhaveshwadhwani/

Github : https://github.com/bhaveshwadhwani

Check my portfolio website here.

Thanks !! For reading so far and stay happy and motivated !

Will see you on my next post!!!

--

--