Keras Tutorial: Content Based Image Retrieval Using a Convolutional Denoising Autoencoder

Published in

Sicara's blog

6 min readSep 14, 2017

Read the original article on Sicara’s blog here.

Content based image retrieval (CBIR) systems enable to find similar images to a query image among an image dataset. The most famous CBIR system is the search per image feature of Google search. This article uses the keras deep learning framework to perform image retrieval on the MNIST dataset.

Our CBIR system will be based on a convolutional denoising autoencoder. It is a class of unsupervised deep learning algorithms.

Content based image retrieval

To explain what content based image retrieval (CBIR) is, I am going to quote this research paper.

There are two [image retrieval] frameworks: text-based and content-based. The text-based approach can be tracked back to 1970s. In such systems, the images are manually annotated by text descriptors, which are then used by a database management system to perform image retrieval. There are two disadvantages with this approach. The first is that a considerable level of human labour is required for manual annotation. The second is the annotation inaccuracy due to the subjectivity of human perception. To overcome the above disadvantages in text-based retrieval system, content-based image retrieval (CBIR) was introduced in the early 1980s. In CBIR, images are indexed by their visual content, such as color, texture, shapes.

Basically we first extract features from an image database and store it. Then we compute the features associated with a query image. Finally we retrieve images with the closest features.

Feature extraction for content based image retrieval

The key point about content based image retrieval is the feature extraction. The features correspond to the way we represent an image on a high level. How to describe the colours on an image? Its texture? The shapes on it? The features we extract should also allow an efficient retrieval of the images. This is especially true if we have a big image database.

There are many ways to extract these features.

One way is to use what we call hand crafted features. Examples are: histogram of colours to define colours, histogram of oriented gradients to define shapes.

Other descriptors like SIFT and SURF have proven to be robust for image retrieval applications.

Another possibility is to use deep learning algorithms. In this research paper the authors demonstrate that convolutional neural networks (CNN) trained for classification purposes can be used to extract a ‘neural code’ for images. These neural codes are the features used to describe images. It also demonstrates that it performs as well as state of the art approaches on many datasets. The problem about this approach is that we first need labelled data to train the neural network. The labelling task can be costly and time consuming. Another way to generate these ‘neural codes’ for our image retrieval task is to use an unsupervised deep learning algorithm. This is where the denoising autoencoder comes.

Denoising autoencoder

A denoising autoencoder is a feed forward neural network that learns to denoise images. By doing so the neural network learns interesting features on the images used to train it. Then it can be used to extract features from similar images to the training set.

If you are not familiar with autoencoders, I highly recommend to first browse these three sources:

Denoising autoencoder for content based image retrieval

We use the convolutional denoising autoencoder algorithm provided on keras tutorial.

Training the model

For the general explanations on the above lines of code please refer to keras tutorial.

Notice that there are small differences compared to the tutorial. The first difference is this line:

encoded = MaxPooling2D((2, 2), padding='same', name='encoder')(x)

We set a name to the encoder layer in order to be able to access it.

We also saved the learned model by adding:

autoencoder.save('autoencoder.h5')

This will enable us to load it later in order to test it.

Finally, we reduced the number of epochs from 100 to 20 in order to save time :).

Denoising an image

Let’s try our learned model to denoise an input test image.

First we regenerate the noisy data and load the previously trained autoencoder.

Then we call the following function that denoises the first noisy test image and plot it:

The result is

Computing the features of the training dataset

Our image database is the MNIST training dataset.

Our goal is to provide a query image and find the closest MNIST images.

First, we compute the features of the training dataset and the query images:

Scoring function

Before scoring our model we need to understand the scoring function we will use.

To assess the model, we use the scikit learn function: label_ranking_average_precision_score. This function takes two arrays as input. First an array of zeros and ones. Second an array of relevance scores.

In our case, we compute the relevance score from the computed distance between the feature of the query image and the images of the database. The lower the distance the higher the relevance score should be.

We construct the first array following this rule: for each image on database, if the image has the same label as the query image, we append a ‘1’ to the array. Otherwise we append a ‘0’.

This scoring function returns a maximum score of 1 if the closest images have the same label as the query image. If there are images with a different label that are closer to the query image, the score decreases.

To get a feel of what it does let’s compute the value of this scoring function on some examples.

Suppose we have a query image with label ‘7’ and that we have four images in our database with following labels : ‘7’, ‘7’, ‘1’, ‘0’. The first two images of the database are relevant regarding the query image, and the two last ones are not. The first array that we pass to the scoring function should be [1, 1, 0, 0]. For each image on our image database we will compute a relevance score:

Scoring the model

For each query image feature, we compute the Euclidian distance to the training dataset images features. The closer the distance the higher the relevance score should be. Then we apply the scoring function label_ranking_average_precision_score to our results.

Results

The y axis correspond to the score computed with the label ranking average precision scoring function. The x axis corresponds to the n first results assessed.

To better understand this graph I will give an example. Suppose we have a database of 3images with labels 7, 7, 1 . And suppose the input image has a label 7. If our algorithm sorts the results on the following order: 7, 1, 7. First we will score only the first returned image [7]: the scoring function returns 1. Then we assess the first two images returned [7, 1]: the scoring function return 1. Then we assess the first three results [7, 1, 7]: the score decreases and is now equal to 0.83 etc…

Overall the more retrieved images we assess the worse the score is.

Example:

You can find the full code here.

Conclusion

We tested an image retrieval deep learning algorithm on a basic dataset. Our convolutional denoising autoencoder is efficient when considering the first retrieved images. But we tested it on similar images. We didn’t have to deal with color, scaling and rotation issues.

Want to go further?

To learn more on autoencoders for CBIR you can read this research paper from Alex Krizhevsky and Georey Hinton.

If you liked this article please share it or leave your feedback below.

Want more articles like this one? Don’t forget to hit the Follow button!