Inside AI

Image Captioning With Flickr8k Dataset & BLEU

7 min readMay 1, 2019

1. Introduction:

RNN’s have become very powerful. Especially for sequential data modelling. Andrej Karapathy has very nicely explained the use of RNN’s in his blog The Unreasonable Effectiveness of Recurrent Neural Networks. There are basically four types of RNN’s.

Image captioning is an application of one to many RNN’s. for a given input image model predicts the caption based on the vocabulary of train data. We are considering the Flickr8K dataset for this case study. The official site for data is not working, but thanks to Jason Brownlee. From here you can download the dataset.

2. Why Flickr8k dataset…?

It is small in size. So, the model can be trained easily on low-end laptops/desktops...
Data is properly labelled. For each image 5 captions are provided.
The dataset is available for free.

3. Let’s understand the data…

Data pre-processing and cleaning is an important part of the whole model building process. Understanding the data helps us to build more accurate models.

After extracting zip files you will find below folders…

Flickr8k_Dataset: Contains a total of 8092 images in JPEG format with different shapes and sizes. Of which 6000 are used for training, 1000 for test and 1000 for development.
Flickr8k_text : Contains text files describing train_set ,test_set. Flickr8k.token.txt contains 5 captions for each image i.e. total 40460 captions.

4. EDA…

We have mainly two types of data.

Images
Captions (Text)

The size of the training vocabulary is 7371. The top 10 most frequent words are

('a', 46784),('in', 14094),('the', 13509),('on', 8007),('is', 7196),
('and', 6678),('dog', 6160),('with', 5763),('man', 5383),       ('of', 4974)

Since the words which occur very less does not carry much information. We are considering words with a frequency of more than 10.

The distribution of top 50 words is as follows…

Here with the distribution of the least frequent word we removed earlier

Distribution of lease frequent words with a frequency of less than 1

Avg caption length per image in the training

The mean,std-dev and percentiles is as follows…

count    6000.000000
mean       10.815467
std         2.057137
min         4.200000
25%         9.400000
50%        10.600000
75%        12.200000
max        19.200000

Maximum Sequence length found 37.

5. How to Featurize images….?

There are already pre-trained models on standard Imagenet dataset provided in keras. Imagenet is a standard dataset used for classification. It contains more than 14 million images in the dataset, with little more than 21 thousand groups or classes.

We will be using InceptionV3 by google. The research paper can be found here.

Why Inception…?

It has a smaller weight file i.e approx 96 MB
It is faster to train.

We will remove soft-max layer form inception as we want to use it as a feature extractor. For a given input image inception gives us 2048 dimensional feature extracted vector. The code is as follow

For every training image, we are resizing it to (299,299) and then passing it to Inception for feature extraction. Remember to save the train_image_extracted dictionary. it will save a lot of time if you are fine-tuning the model.

6. Caption Preprocessing...

Each image in the dataset is provided with 5 captions. For e.g image 1000268201_693b08cb0e.jpeg has captions

['A child in a pink dress is climbing up a set of stairs in an entry way .',
 'A girl going into a wooden building .',
 'A little girl climbing into a wooden playhouse .',
 'A little girl climbing the stairs to her playhouse .',
 'A little girl in a pink dress going into a wooden cabin .']

Captions are read from Flickr8k.token.txt file and stored in dictionary k:v where k = image id and value =[ list of caption ].

Since there are 5 captions for each image and we have preprocessed and encoded them in below format

“startseq “ + caption + “ endseq”

The reason behind startseq and endseq is,

startseq : Will act as our first word when feature extracted image vector is fed to decoder. It will kick-start the caption generation process.

enseq : This will tell the decoder when to stop. We will stop predicting word as soon as endseq appears or we have predicted all words from train dictionary whichever comes first.

7. Sequential Data preparation….

This one is a tricky part. how do we convert this data to sequential form..?

For the understanding purpose, we will consider the above image and corresponding sequences where a bunch of people are swimming in water.

First fed the image to inception and get feature extracted 2048 dimensional vector.

caption: startseq a bunch of people swimming in water endseq.

we will form the sequence as below.

Image_vector + ['startseq']
Image_vector + ['startseq', 'a']
Image_vector + ['startseq', 'a', 'bunch']
Image_vector + ['startseq', 'a', 'bunch', 'of']
Image_vector + ['startseq', 'a', 'bunch', 'of', 'people']
Image_vector + ['startseq', 'a', 'bunch', 'of', 'people','swimming']
...

convert the sequence to numerical with the help of vocabulary. Please refer this for in detail diagrammatic explanation. Below is the source code for dataset preparation.

The problem….?

There is a problem if we fit all the data point at once to model. Since we have total 40k captions.

Max length of the caption is 37. So each caption in encoded into a sequence of 37.

Let's assume on an average, to encode a sequence we need 10 rows.

And, each word in a sequence will be embedded to 300-dimensional glove vector.

So considering 1 byte for each number , our final data matrix will occupy..

40k x 10x((37 x 200) + 2048) ===> approx. 3.7GB at least.

Solution:

Rather than getting whole data at one time, use data_generator to generate data in batches.

8. BLEU…?

BLEU stands for Bilingual Evaluation Understudy.

It is an algorithm, which has been used for evaluating the quality of machine translated text. We can use BLEU to check the quality of our generated caption.

BLEU is language independent
Easy to understand
It is easy to compute.
It lies between [0,1]. Higher the score better the quality of caption

How to calculate BLEU score…?

predicted caption= “the weather is good”

references:

the sky is clear
the weather is extremely good

first, convert the predicted caption and references to unigram/bigrams.

BLEU tells how good is our predicted caption as compare to the provided 5 reference captions.

9. Inference…

For a given feature extracted test image and startseq as i/p to model, we get a distribution of probability over all the words in the vocabulary. The word corresponding to the index of maximum probability is the predicted word.

We stop predicting when the word “endseq” appears.

Here are some examples of my ipnb notebook...

Prediction is very good. It describes an image correctly. Observe the BLEU value

A child is predicted correctly but the model is treating adults as a child. BLEU value drops here as prediction is not analogous to references.

Not a good one. BLEU is very small, close to zero except, unigram.

Nooo… not a good one.

Nice try, but not an accurate caption. Observe the high value of unigram

Woman predicted as a little boy.

10. Conclusion:

The performance of the model can be improved by training it on a larger dataset and hyperparameter tuning.
From above we can observe that unigram bleu scores favour short predictions.
Prediction is good if all the BLEU scores are high.

That’s it from my end…!

For whole code please check the ipython notebook here.

End.