Exploring Siamese Networks for Image Similarity using Contrastive Loss

Srikaran
9 min readAug 9, 2023

--

Introduction

In the field of computer vision, accurately measuring image similarity is a crucial task with a wide range of real-world applications. From image search engines to face recognition systems and content-based recommendation systems, the ability to effectively compare and find similar images is of great importance. Siamese networks, coupled with contrastive loss, provide a powerful framework for learning image similarity in a data-driven manner. In this blog post, we will delve into the intricacies of Siamese networks, delve into the concept of contrastive loss, and explore how these two components work together to create an effective image similarity model.

In the field of computer vision, accurately measuring image similarity is a crucial task with a wide range of real-world applications. From image search engines to face recognition systems and content-based recommendation systems, the ability to effectively compare and find similar images is of great importance. Siamese networks, coupled with contrastive loss, provide a powerful framework for learning image similarity in a data-driven manner. In this blog post, we will delve into the intricacies of Siamese networks, delve into the concept of contrastive loss, and explore how these two components work together to create an effective image similarity model.

What are Siamese Neural Networks?

Siamese neural networks are a class of neural network architectures designed to compare and measure similarity between pairs of input samples. The term “Siamese” comes from the idea that the network architecture consists of twin neural networks, which are identical in structure and share the same set of weights. Each network processes one input sample from the pair, and their outputs are compared to determine the similarity or dissimilarity between the two inputs.

The main motivation behind Siamese networks is to learn a meaningful representation of input samples that can capture their essential features for similarity comparison. These networks excel in tasks where direct training with labeled examples is limited or difficult, as they can learn to differentiate between similar and dissimilar instances without requiring explicit class labels.

The architecture of a Siamese network typically consists of three main components: the shared network, the similarity metric, and the contrastive loss function.

  1. Shared Network: The shared network is the core component of the Siamese architecture. It is responsible for extracting meaningful feature representations from the input samples. The shared network consists of layers of neural units, such as convolutional layers or fully connected layers, that process the input data and produce fixed length embedding vectors. By sharing the same weights between the twin networks, the model learns to extract similar features for similar inputs, enabling effective comparison.
  2. Similarity Metric: Once the inputs are processed by the shared network, a similarity metric is used to compare the generated embeddings and measure the similarity or dissimilarity between the two inputs. The choice of similarity metric depends on the specific task and the nature of the input data. Common similarity metrics include Euclidean distance, cosine similarity, or correlation coefficient. The similarity metric quantifies the distance or correlation between the embeddings and provides a measure of similarity between the input samples.
  3. Contrastive Loss Function: To train the Siamese network, a contrastive loss function is employed. The contrastive loss function encourages the network to produce similar embeddings for similar inputs and dissimilar embeddings for dissimilar inputs. It penalizes the model when the distance or dissimilarity between similar pairs exceeds a certain threshold, or when the distance between dissimilar pairs falls below another threshold. The exact formulation of the contrastive loss function depends on the chosen similarity metric and the desired margin between similar and dissimilar pairs.

During training, the Siamese network learns to optimize its parameters to minimize the contrastive loss and produce discriminative embeddings that effectively capture the similarity structure of the input data.

Contrastive Loss Function

Contrastive loss is a loss function commonly used in Siamese networks for learning similarity or dissimilarity between pairs of input samples. It aims to optimize the network’s parameters in such a way that similar inputs have embeddings that are closer together in the feature space, while dissimilar inputs are pushed further apart. By minimizing the contrastive loss, the network learns to generate embeddings that effectively capture the similarity structure of the input data.

To understand the contrastive loss function in detail, let’s break it down into its key components and steps:

  1. Input Pairs: The contrastive loss function operates on pairs of input samples, where each pair consists of a similar or positive example and a dissimilar or negative example. These pairs are typically generated during the training process, with positive pairs representing similar instances and negative pairs representing dissimilar instances.
  2. Embeddings: The Siamese network processes each input sample through a shared network, generating embedding vectors for both samples in the pair. These embeddings are fixed-length representations that capture the essential features of the input samples.
  3. Distance Metric: A distance metric, such as Euclidean distance or cosine similarity, is used to measure the dissimilarity or similarity between the generated embeddings. The choice of distance metric depends on the nature of the input data and the specific requirements of the task.
  4. Contrastive Loss Calculation: The contrastive loss function computes the loss for each pair of embeddings, encouraging similar pairs to have a smaller distance and dissimilar pairs to have a larger distance. The general formula for contrastive loss is as follows:
  5. L = (1 — y) * D² + y * max(0, m — D)²

Where:

  • L: Contrastive loss for the pair.
  • D: Distance or dissimilarity between the embeddings.
  • y: Label indicating whether the pair is similar (0 for similar, 1 for dissimilar).
  • m: Margin parameter that defines the threshold for dissimilarity.

The loss term (1 — y) * D² penalizes similar pairs if their distance exceeds the margin (m), encouraging the network to reduce their distance. The term y * max(0, m — D)² penalizes dissimilar pairs if their distance falls below the margin, pushing the network to increase their distance.

5. Aggregating the Loss: To obtain the overall contrastive loss for the entire batch of input pairs, the individual losses are usually averaged or summed across all the pairs. The choice of aggregation method depends on the specific training objective and optimization strategy.

By minimizing the contrastive loss through gradient-based optimization methods, such as backpropagation and stochastic gradient descent, the Siamese network learns to produce discriminative embeddings that effectively capture the similarity or dissimilarity structure of the input data.

The contrastive loss function plays a crucial role in training Siamese networks, enabling them to learn meaningful representations that can be used for various tasks such as image similarity, face verification, and text similarity. The specific formulation and parameters of the contrastive loss function can be adjusted based on the characteristics of the data and the requirements of the task at hand.

Siamese Neural Networks in PyTorch

Let's get started with coding then!

1. Dataset Creation

We are using dog images dataset from repo http://vision.stanford.edu/aditya86/ImageNetDogs/

Once the dog images are downloaded from the repo using the link. The folder looks like this

# the directory structure of dog images after the downloading from the repo:

# root_dir

# ├── Japanese_spaniel

# │ ├── Japanese_spaniel.jpg

# │ ├── Japanese_spaniel2.jpg

# │ └── …

# └── Shih-Tzu

# ├── Shih-Tzu1.jpg

# ├── Shih-Tzu2.jpg

# └── …

We are picking 3 pair of similar images(dog breeds) and 3 pair of dissimilar images(dog breeds) to fine-tune the model, to keep things simple for negative samples, for given anchor image(dog breed) any other dog breed other than the ground truth dog breed was considered as negative label.

Disclaimers — “similar images” meaning, images from the same dog breed were considered positive pair and “dissimilar images” meaning, images from different dog breed were considered negative.

Code Explanation

In the above code line, no 46: 6 images were randomly picked from each of the dog image folder.

In the above code line, no 47: the picked images were moved into a ‘tmp’ folder and renamed as “similar_images ”since they are from the same dog breed folder.

In the above code line, no 55: Once all this is complete, they were moved into the “similar_all_images” folder.

In the above code line, no 56,57: Similarly to get dissimilar image pair, 3 images were picked from two different dog breed folders.

then the flow above was repeated again to have dissimilar image pairs and move them to “dissimilar_all_images” folder.

Once all this is done, we can move on to dataset object creation.

In the above code line, no 8 to 10: The image is being pre-processed, which includes resizing of image to 256. We are using a batch size of 32, this can be different depending on your computational power and GPU.

Our network is called SiameseNetworkand we can see that it looks almost identical to a standard CNN. The only difference that can be noticed is that we have two forward functions ( forward_onceand forward). Why is that?

We mentioned that we pass two images through the same network. This forward_oncefunction, called inside the forwardfunction, will take an image as input and pass it into the network. The output is stored into output1and the output from the second image is stored into output2, as we can see in the forwardfunction. In this way, we have managed to input two images and get two outputs from our model.

We have seen how the loss function should look like, now let’s code it. We create a class called ContrastiveLossand similarly as in the model class we will have a forwardfunction.

Following the flow diagram from the top, we can start creating the training loop. We iterate 100 times and extract the two images as well as the label. We zero the gradients and pass our two images into the network, and the network outputs two vectors. The two vectors, and the label, are then fed into the criterion (loss function) that we defined. We backpropagate and optimize. For some visualization purposes and to see how our model is performing on the training set, so we will print the loss every 10 batches.

We can now analyze the results. The first thing we can see is that the loss started around 1.6 and ended at a number pretty close to 1.

It would be interesting to see the model in action. Now comes the part where we test our model on images it didn’t see before. As we have done before, we create a Siamese Network Dataset using our custom dataset class, but now we point it to the test folder.

As the next steps, we extract the first image from the first batch and iterate 5 times to extract the 5 images in the next 5 batches because we set that each batch contains one image. Then, combining the two images horizontally, using torch.cat(), we get a pretty clear visualization of which image is compared to which.

We pass in the two images into the model and obtain two vectors, which are then passed into theF.pairwise_distance() function, this will calculate the euclidean distance between the two vectors. Using this distance, we can as a metric of how dissimilar the two faces are.

Summary

Siamese networks, in combination with contrastive loss, provide a robust and effective framework for learning image similarity. By training on pairs of similar and dissimilar images, these networks can learn to extract discriminative embeddings that capture the essential visual features. The contrastive loss function further enhances the model’s ability to accurately measure image similarity by optimizing the embedding space. With the advancements in deep learning and computer vision, Siamese networks offer great potential in various domains, including image search, face verification, and recommendation systems. By leveraging these techniques, we can unlock exciting possibilities for content-based image retrieval, visual understanding, and intelligent decision-making in the visual domain.

--

--