Fig. Image by Girija Shankar Behera

Face Recognition using Siamese Networks

Do you have a smaller dataset? Don’t worry, this one is for you!!

Girija Shankar Behera
Published in
10 min readJan 20, 2021


A Facial Recognition System is a technology that can capture a human face anywhere in an image or a video and also can find out its identity. A Face Recognition system has proven to be very beneficial in case of user identity verification in the recent past replacing the age-old authentication mechanisms like password protection or the One Time Passwords. In the last decade or so, we have seen a huge growth in the smart mobile industry for using face verification, and also numerous apps like Snapchat or Instagram which can put interesting filters on face. The ongoing research in this field has come up with a lot of scope in a wide range of applications, such as surveillance systems or law enforcement.

This article is about exploring a popular algorithm running behind a Face Recognition system. Specifically, I’m going to discuss the Siamese Networks. The article assumes that the reader has a basic understanding of Machine Learning and Deep Learning technologies and has some idea in building Neural Networks with TensorFlow. This is a list of topics that we are going to cover in this article.

  • What is Face Recognition? & How is it different from a Face Detection System?
  • A little background on the term Siamese
  • Understanding of Siamese Networks
  • Olivetti Dataset for the faces, Generating Image Pairs for Training & Testing
  • Building a TensorFlow model for Siamese Networks, training and testing the model
  • Where and When to use the Siamese Networks?

What is Face Recognition? & How is it different from Face Detection?

The question for me was, Is there a difference between the two? Or are they the same thing?

Well, Face Detection is finding out that there’s a human face in a photo, whereas Face Recognition is finding out the human face in the photo, and recognizing the person’s identity as well. So, a Face Recognition system needs a Face Detection to run first, get a list of all the face(s) found in the photo, and then run its authentication algorithm to detect who is/are there in the photo.

Fig. Face Detection vs Face Recognition

In ML terms, for a Face Recognition system to work, we need to have a Face Detection system like a Haar Cascade which would crop out all the faces found in a photo, and then run the Face Recognition algorithm on these cut faces.

A little background on the term Siamese

The term originally comes from the conjoined twin brothers Chang and Eng Bunker(May 11, 1811 — January 17, 1874), who were the first pair to be known internationally. The term is used for those twins who are physically connected to each other at the chest, or at the abdomen or the pelvis. The two individuals were originally from Thailand, formerly known as Siam, hence the name.

The Neural Network we are going to see in this article also consists of a pair of Networks which are actually the same, hence the name derives from the Siamese Twins.

Understanding of Siamese Networks

As we saw above, the Siamese twins are connected physically, the Siamese network also consists of a pair of Neural Networks which are identical to each other, also known as Sister Networks.


Unlike a conventional CNN, the Siamese Network does not classify the images into certain categories or labels, rather it only finds out the distance between any two given images. If the images have the same label, then the network should learn the parameters, i.e. the weights and the biases in such a way that it should produce a smaller distance between the two images, and if they belong to different labels, then the distance should be larger.

Fig. Architecture of a Siamese Network.

As it shows in the diagram, the pair of the networks are the same. The Siamese Network works as follows.

  • To train a Siamese Network, a pair of images are picked from the dataset, each one processed by one of the networks above. (In next few sections, we will see how to generate pairs of images from the dataset.)
  • The networks have the same structure, hence the same operations will be performed on the respective images.
  • The Neural Networks at the end have Fully Connected Layers, with the last one consisting of 128 nodes. This layer is the final feature that gets produced when the network is applied on the image. It’s called the Embedding Layer Representation. So the two images in the pair processed by the Siamese Network produce two different Embedding Layer Representations.
  • The Network then finds the Euclidean distance between both the embedding layers. If the images are of the same person, then it is expected that the embeddings will be very similar, hence distance should be smaller. However, if the images are of different people, then the distance is expected to be a higher value.
  • A Sigmoid Function is applied on the distance value to bring it to 0–1 range.
  • A loss function is put on the sigmoid result, to penalize the network to update its weights and the biases. I’m using Binary Cross Entropy in this article for the loss function. Updation of the weights and the biases done on both the networks are exactly the same.

This process repeats for all the image pairs generated from the dataset. The same approach described above is put into code below. I have used TensorFlow APIs to build the network architecture.

Model Implementation:

  • The network takes images of shape 64x64.
  • Then there are three batches of Conv-Pool-Dropout present in the network (as shown in the image above).
  • The network ends with the 128 node Fully Connected Embedding Layer Representation.
  • The fact that the network uses the same structure twice with two different images, it actually can be achieved with a single instance of the network. With this the parameter updating also becomes easier, as the weights and the biases will be updated in the same instance only.
  • Two images are provided to the network, and the network produces the embedding layers or the features, hence the network also acts as a Feature Extractor.
  • The Euclidean distance is calculated by finding out the square root of the sum of the squares of the difference of both the embeddings. Lambda API is used from TensorFlow Layers for this purpose. The distance value is adjusted to a range of 0–1 using Sigmoid.

Here, the loss function Binary Cross Entropy is used on the model.

Olivetti Dataset for the faces

Before diving in further into how we generate image pairs, we will look into what Dataset we are using. For my Face Recognition system, I’m using the Olivetti dataset fetched from sklearn datasets API. It has a total of 400 face images for 40 people with 10 images per person.

Fig. Top 2 rows have a set of 10 faces of the same person. Bottom 2 rows have faces of 10 different people shown. Pictures taken from Olivetti Faces from sklearn.datasets API. Image by Girija Behera

It has ids as the labels for the images. The Olivetti face dataset has the following features.

  • All the images have only the faces cropped in, even the ears have been cut out.
  • The images are gray scaled. And it seems the contrast and the brightness are adjusted in them.
  • A person has around 10 images, each one with possibly a different face expression.

Point to note: This is the format the images will be used for training. For testing I wanted to use my own images which are not there in the dataset initially. To be able to do that I also need to convert my images to the same format before we use them for testing. More on this is written later in the article.

Generating Image Pairs for Training

Unlike a Regular CNN, here we don’t generate one image at a time, rather we generate a pair of images from the dataset.

  • An image can be paired up with another image of the same label making a positive pair, or with another image of a different label making a negative pair.
  • The code above starts with collecting the indices for each label.
  • Then it iterates over the images dataset, and pairs up each image with a random image of the same label as a positive pair, and a random image of any other label as a negative pair.
Fig. Sample Image Pairs generated for Siamese Network Training

This is some sample image pairs generated from the method above. The method generates two image pairs for each image in the dataset. i.e. for each image it generates a positive pair and a negative pair. Hence, a total of 800 image pairs will be generated and used for Model training.

Model Training

So we are done with the model structure and the training image generation part. Now, we start with the actual training using the generated image pairs.

Each time we provide 64 pairs of images in a batch and this runs for 100 iterations.

Model Performance

Fig. Training Loss and Accuracy Plot

The above plot shows a history of the accuracy and the loss of the model during training and validation. So I’ve ran the training for 100 iterations, looking at the graph it seems the validation loss decreases constantly up to 80 iterations, then it does not change much. Similarly, no noticeable increment in the validation accuracy after the first few iterations. The final model that we’ve got in hand looks somewhat good. Let’s see how it performs while prediction.

Generating Image Pairs for Testing

To test the model’s performance, we need to perform the following steps:

  • The test image will be paired up with one random image for each person in the dataset.
  • To do this, I find all the indices for each of the labels again. (like earlier during training image pair generation).
  • Then for each label a random image is fetched and paired up with the test image.
  • The whole test image pairs now consist of a total of 40 image pairs, as the dataset has 400 images for 40 people.

The Siamese Model predicts the similarity score for each test pair.

And below goes the prediction result, the test image was taken randomly from the same dataset. The number in each of these plots represent the Similarity Score between the two images. It seems the eighth and the tenth image, i.e. the second one in both the 4th and the 5th rows have the higher similarity scores than the others.

Fig. Prediction result

Where and When to use a Siamese Model?

Well, this was one of the most important parts while learning about Siamese Networks. This is mostly used when we don’t have a huge number of images for the training. A Deep Learning Neural Network for Classification performs better only when the number of images in the dataset is huge.

For ex. If we are building a Face Recognition System for an Office with only 100 employees. Their face repository has only a maximum of 10 face images for each of them. Here the number of images are really less, to create a Classification Neural Network. Siamese Network comes in as a handy replacement here. It does not learn the classes, rather it only finds the distance between the images. The number of images need not be high for a Siamese Network.

Let’s say tomorrow a new employee joins in, and we’ve got only a single image for him. With the Siamese Model in place, we don’t even need to retrain the model. We can add the single image that we got for the new employee into our face repository and the model would work the same. It would be able to find out the distance of the new employee with the others. Whereas, with a classification model, it needs to re-learn the features for the new employee, to be able to classify him.

Generating Test Images:

To sum up what’s said above, this time I’m going to try with one of my images for testing, as it was not there in the original dataset. I’ll add only one image in the dataset. And like it’s mentioned, the above model would still work the same.

Fig. Test image transition. 1. new image selected for testing, 2. haar cascade applied to get the face coordinate from the image, 3. crop the face from the image using the coordinates, 4. convert the face to grayscale

The above image shows the transition of the test image taken. It was initially a colored image, on which we applied Haar Cascade to get the face coordinates. These coordinates are then used to crop the face part from the whole image, and then we converted it to grayscale. This is then added to the training dataset.

Note that this is the only image I’m adding to the training dataset. As the training dataset increases, the labels also increase. A Regular CNN would need a new model to be trained with the new image data for classification. But for Siamese Network, we don’t need to retrain the model. It would be able to find the distance out between two unseen images as well.

Fig. A visual representation of how a Face Recognition system works

The left image is mine again, taken for testing and it went through the same transition as the one before. From the GIF above, it seems from the 41 faces in the current dataset(adding my image), the Face Recognition system finds positive matches with 4–5 of them. I have taken a threshold of 0.5, to make a pair a positive match only if the similarity score is greater than or at least equals to the threshold.


That’s it for this article. We saw how a Face Recognition System works. We started with some basics of the term Siamese and understanding how a Siamese Network looks like. Then we jumped directly into the code for creating a Siamese Network, generating image pairs, and then finally training the model and testing it. Finally we saw, when a Siamese Network can be useful and then tried with one of the demos.

Working on this was a lot of fun. If anything needs improvement, please feel free to let me know in the comments. Thank you.



Girija Shankar Behera
Writer for

Software Engineer by Profession, passionate about Data Science and Machine Learning.