Demystifying Neural Networks: Similar Image Search with AutoEncoder

6 min readFeb 3, 2024

This article is part of the series Demystifying Neural Networks.

Introduction

In the realm of machine learning, the quest to develop systems capable of identifying, categorizing, and differentiating between images has led to remarkable innovations. Among these, the use of neural networks to facilitate similar image search stands out as a particularly fascinating application. This blog post delves into the mechanics behind using autoencoders, a specific type of neural network, for similar image search tasks, providing insights into their structure, operation, and effectiveness.

Understanding Autoencoders

At its core, an autoencoder is a neural network designed to learn an efficient representation (encoding) of input data, typically for dimensionality reduction, by passing the data through a bottleneck architecture. The network is comprised of two main components: an encoder and a decoder. The encoder compresses the input into a latent-space representation, and the decoder reconstructs the input from this representation as accurately as possible. The beauty of autoencoders lies in their ability to distill the essence of data into a more manageable form without supervision.

Encoder: The Compression Master

The encoder part of an autoencoder takes high-dimensional input data (like images) and compresses it into a lower-dimensional latent space. This process involves learning the most salient features of the data necessary for reconstruction, effectively filtering out noise and redundancy.

Decoder: The Reconstruction Master

The decoder, on the other hand, takes this compressed data and attempts to reconstruct the original input. The fidelity of the reconstruction serves as a measure of the autoencoder’s performance, with higher accuracy indicating that the network has successfully captured the critical structure of the input data.

Autoencoders in Similar Image Search

Applying autoencoders to similar image search involves training the network to generate a compact and informative representation of each image. Once trained, the encoder part can transform any image into its encoded form, which can then be used to measure similarity between images. This similarity is often calculated using distance metrics such as Euclidean distance or cosine similarity, where smaller distances correspond to more similar images.

Step-by-Step Approach

1. Training the Autoencoder: The first step involves training the autoencoder on a dataset of images. During this phase, the network learns to compress and reconstruct the images, honing its ability to capture their essential features.

2. Generating Embeddings: Once trained, the encoder is used to convert images into their latent-space representations or embeddings. These embeddings, being of lower dimensionality than the original images, are easier to work with and compare.

3. Finding Similar Images: To find images similar to a given query image, the query is first encoded into its embedding. The distances between this embedding and those of other images in the dataset are then calculated. Images with the closest embeddings are deemed the most similar to the query.

Enhancing Performance with CNNs

While traditional autoencoders are effective, their performance in image-related tasks can be significantly improved by incorporating convolutional neural networks (CNNs). CNNs excel at handling spatial data like images, thanks to their ability to capture local patterns through convolutional filters. By replacing dense layers with convolutional layers in the encoder and decoder, we obtain a CNN-based autoencoder that is more adept at understanding and reconstructing images, thus providing a more accurate and nuanced basis for similarity search.

CNN-based Autoencoder Architecture

Encoder: Uses convolutional and pooling layers to downsample the image, focusing on extracting spatial hierarchies of features.
Decoder: Employs convolutional and upsampling layers to reconstruct the image from the encoded representation, gradually regaining spatial resolution.

Implementing a Similar Image Search

A practical implementation involves several steps, from preprocessing the dataset (e.g., MNIST) and defining the autoencoder model to training the network and using the trained encoder for image search. The process concludes with visualizing the original and similar images, showcasing the autoencoder’s ability to identify and retrieve images with similar content.

The code is available in this colab notebook:

import numpy as np
import matplotlib.pyplot as plt
from keras.layers import Input, Conv2D, MaxPooling2D, UpSampling2D
from keras.models import Model
from keras.datasets import mnist
from keras.optimizers import Adam
from scipy.spatial.distance import cdist

def load_and_preprocess_data():
    """Load and preprocess the MNIST dataset."""
    (x_train, _), (x_test, _) = mnist.load_data()
    x_train = np.expand_dims(x_train, axis=-1).astype('float32') / 255.
    x_test = np.expand_dims(x_test, axis=-1).astype('float32') / 255.
    return x_train, x_test

def build_autoencoder():
    """Build and compile the CNN-based autoencoder."""
    # Input layer: accepts images of shape 28x28x1 (MNIST images)
    input_img = Input(shape=(28, 28, 1))
    
    # Encoder
    # Convolutional layer with 32 filters, each 3x3, using 'relu' activation. 'same' padding ensures output size matches input size.
    x = Conv2D(32, (3, 3), activation='relu', padding='same')(input_img)
    # Max pooling layer to reduce spatial dimensions by half, improving computational efficiency and helping encode positional information.
    x = MaxPooling2D((2, 2), padding='same')(x)
    # Another convolutional layer with 16 filters to further extract features from the image.
    x = Conv2D(16, (3, 3), activation='relu', padding='same')(x)
    # Reducing spatial dimensions again to further compress the representation.
    x = MaxPooling2D((2, 2), padding='same')(x)
    # Final convolutional layer in the encoder with 8 filters, focusing on the most abstract features of the image.
    x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
    # Last max pooling layer in the encoder to achieve the final compressed representation.
    encoded = MaxPooling2D((2, 2), padding='same')(x)
    
    # Decoder
    # Convolutional layer with 8 filters, starting the process of decoding the compressed representation.
    x = Conv2D(8, (3, 3), activation='relu', padding='same')(encoded)
    # Upsampling layer to start expanding the spatial dimensions back to the original size.
    x = UpSampling2D((2, 2))(x)
    # Convolutional layer with 16 filters to further refine the decoded features.
    x = Conv2D(16, (3, 3), activation='relu', padding='same')(x)
    # Upsampling again to get closer to the original image size.
    x = UpSampling2D((2, 2))(x)
    # Convolutional layer with 32 filters, nearly restoring the original depth of features.
    x = Conv2D(32, (3, 3), activation='relu')(x)  # Note: No padding here, changes size slightly.
    # Final upsampling to match the original image dimensions.
    x = UpSampling2D((2, 2))(x)
    # Output layer to reconstruct the original image. Uses 'sigmoid' activation to output pixel values between 0 and 1.
    decoded = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)
    
    # Compiling the autoencoder model with Adam optimizer and binary cross-entropy loss.
    autoencoder = Model(input_img, decoded)
    autoencoder.compile(optimizer=Adam(), loss='binary_crossentropy')
    return autoencoder


def train_autoencoder(autoencoder, x_train, x_test):
    """Train the autoencoder."""
    autoencoder.fit(x_train, x_train, epochs=10, batch_size=128, shuffle=True, validation_data=(x_test, x_test))

def generate_embeddings(encoder, x_test):
    """Generate embeddings for the test set."""
    return encoder.predict(x_test)

def find_similar_images(embeddings, selected_indices):
    """Find and return indices of similar images based on embeddings."""
    similar_images_indices = []
    for index in selected_indices:
        distances = cdist(embeddings[index:index+1], embeddings, 'euclidean')
        closest_indices = np.argsort(distances)[0][1:4]  # Exclude self
        similar_images_indices.append(closest_indices)
    return similar_images_indices

def display_similar_images(x_test, selected_indices, similar_images_indices):
    """Visualize the original and similar images."""
    plt.figure(figsize=(10, 7))
    for i, (index, sim_indices) in enumerate(zip(selected_indices, similar_images_indices)):
        ax = plt.subplot(3, 4, i * 4 + 1)
        plt.imshow(x_test[index].reshape(28, 28))
        plt.title("Original")
        plt.gray()
        ax.axis('off')
        
        for j, sim_index in enumerate(sim_indices):
            ax = plt.subplot(3, 4, i * 4 + j + 2)
            plt.imshow(x_test[sim_index].reshape(28, 28))
            plt.title(f"Similar {j+1}")
            plt.gray()
            ax.axis('off')
    plt.tight_layout()
    plt.show()

# Main workflow
x_train, x_test = load_and_preprocess_data()
autoencoder = build_autoencoder()
train_autoencoder(autoencoder, x_train, x_test)

encoder = Model(autoencoder.input, autoencoder.layers[-7].output)
encoded_imgs = generate_embeddings(encoder, np.reshape(x_test, (len(x_test), 28, 28, 1)))
encoded_imgs_flatten = encoded_imgs.reshape((len(x_test), np.prod(encoded_imgs.shape[1:])))

np.random.seed(0)
selected_indices = np.random.choice(x_test.shape[0], 3, replace=False)
similar_images_indices = find_similar_images(encoded_imgs_flatten, selected_indices)
display_similar_images(x_test, selected_indices, similar_images_indices)

Conclusion

Autoencoders, especially when enhanced with CNN architectures, offer a powerful tool for similar image search tasks. By learning to compress images into meaningful representations, they enable efficient and effective similarity measurements. This capability has vast applications, from organizing photo libraries to improving the accuracy of search engines. As we continue to demystify and harness the potential of neural networks, the horizon of possibilities keeps expanding, promising even more innovative solutions to complex challenges.

Through this exploration, we’ve seen how a combination of theoretical knowledge and practical application can illuminate the path to understanding and leveraging the power of neural networks for tasks as nuanced and diverse as similar image search. Whether you’re a seasoned practitioner or a curious novice, the journey through the landscape of neural networks is one of endless learning and discovery.