Siamese Neural Network for One-shot Image recognition-Paper Analysis

So today we will be going to analyse the paper “Siamese Neural Network for One-shot Image recognition-Paper Analysis”. So lets get started.

What is Siamese Neural Network?

Siamese Neural Network is a special type of neural network in first we train an image with a sequence of convolutional layers, pooling layers and fully connected layers we end up with a feature vector f(x1).(See in Fig 1)

Then we train another image in the same sequence to get another feature vector f(x2). Now we compute d which will be the distance between each of the points in feature vector f(x1) with the feature vector f(x2).

If d is small we can tell both images are same else if d is large it’s the other way round.

Fig 1: A Siamese Neural Network for Image Recognition

Now One-shot image recognition is being used in this paper to classify the characters correctly.

One-shot Image Recognition
People may ask why have they used One-shot image recognition method though there are other state of art models like CNN and Hierarchical Bayesian Program Learning. The main reason for people using this method is the lack of data. The state of art Machine Learning Algorithms work very well when there is a huge amount of data but can fail miserably if there is a data scarcity.

In this method the model must make the correct prediction given only one example in each class in the training set. In this paper however the author has used more than one example for each class but it is very less compared to what the state of art algorithm requires.

Previous Work

One-shot learning algorithm is one part of Machine learning which has been neglected over the years by the ML Community. Previously Li Fei-Fei developed a variational Bayesian framework for one-shot image classification. Lately Lake approached the same problem using Heirarchical Bayesian Program Learning(HBPL), in which the algorithm determined the structural explanation for the observed pixels.


Dataset Used-

The Omniglot dataset is being used to build the model. It contains 50 alphabets ranging from well-established international languages. It is split into 40 alphabet background set and 10 alphabet evaluation set. The background set is used for hyperparameter tuning and feature mapping while evaluation set is used for one-shot classification performance.

Fig 2- Example of a 20-way one-shot classification task using Omniglot dataset


In this model a siamese neural network consists of twin networks which takes two distinct inputs with energy function at the top. The parameters which are used in the twin networks are tied with each other. This method of weight tying ensures that the two similar images when mapped by their respective networks will be at a same location in feature space as each network computes the same function.

The model is a siamese convolutional neural network with L layers each with N1 units, where h(1,l) represents the hidden vector in the layer l of the first twin and h(2,l) denotes the second twin. It used ReLU units in the first L-2 layers and sigmoidal in the remaining layers.

It uses a sequence of convolutional layers each of which uses a single channel with filters of varying size and the fixed stride of 1. The number of convolutional filter is specified as a multiples of 16 to optimize performance.(This has been observed that multiples of 16 has increased the performance of the model). Then the network used ReLU activation function to output feature map and it is followed by max pooling with stride of 2. The kth filter map in each layer takes the following form:

where is a W is a 3-dimensional tensor representing feature maps for layer l.

Fig 3- Deep Convolutional Network for One-shot Image Recognition


The author have used various methods to ensure that the model learns everything well.

Loss function- The loss function that is being used in this model is regularized cross entropy which is often used in neural network.

Weight initialization- The model weights are initialized in the convolutional layer from a normal distribution with mean zero and standard deviation 0.01. Biases were initialized with a normal distribution but mean of 0.5 and standard deviation 0.01. But in the fully connected layer the biases are initialized with a normal distribution having mean zero and standard deviation 0.2.

Optimization- The optimizer that is used is momentum where a mini-batch size of 128 and learning rate nj.

Learning schedule- The model used different learning rates for each layer, but learning rates are decayed uniformly by 1 percent per epoch. The momentum starts at 0.5 in each layer and linearly increases until it reaches the value uj. The model was trained for 200 epochs but monitored one-shot validation error on a set of 320 one-shot learning tasks generated randomly from validation set.

Hyperparameter optimization- The model has used beta version of Whetlab, a bayesian optimization framework, to perform hyperparameter selection. The size of the convolutional filters vary from 3*3 to 20*20 while the number of convolutional filters varied from 16 to 256 using multiples of 16. Fully connected layers ranged from 128 to 4096 units.

The model also augmented the training set using small affline distortion.


The one-shot learning performance was evaluated by developing a 20-way within-alphabet classification in which an alphabet is first chosen from among those reserved for the evaluation set, along with twenty characters taken uniformly at random.

FIg 4-Comparing best one-shot accuracy from each type of network

The results in Fig 4 shows that Convolutional Siamese Net has performed very well and only a few % away from humans(less avoidable bias).

My insights

Koch et al has used convolutional siamese network to classify pairs in Omniglot dataset by a twin network which also used convolutional neural nets(CNNs). The twin network had the following architecture: convolution with 64 @ 10*10 filters, ReLU -> maxing pooling -> convolution with 128 @ 7*7 filters, ReLU ->max pooling -> convolution with 128 @ 4*4 filters, ReLU -> max pooling -> convolution with 256 @ 4*4 filters. This was followed by a fully connected layer with sigmoid activation and L1 siamese distance and the last layer was a fully connected with sigmoid activation.

Fig 5- Architecture Diagram by Hastley

The output was transformed into [0,1]. Where for images which are similar we had t=1 and t=0 where images were dissimilar. It was trained on logistic regression with sigmoid activation. The loss function was cross entropy and there was a L2 weight decay tern for improving generalization.

It siamese net then takes a test image and then outputs the image from the support set it thinks it is similar to.

Notice- It uses argmax instead of argmin because of L2 norm. The more higher L2 is the more different the images will be, but this model has a different approach it outputs p(x1∘x2) so we require the maximum.


I have tried my best to present this paper to you in a simpler way than what is in the real paper. I will be implementing it in near future. Thanks for reading. I will love to hear feedback or answer any questions that you have.