An Introduction to Siamese Neural Networks

Hugh Ding
10 min readApr 28, 2023

--

Have you ever wondered how your phone is able to unlock with just a glance? Or how social media platforms can suggest friends you haven’t seen in years? It’s all thanks to facial recognition technology. From detecting emotions to preventing crime, facial recognition has the power to transform our world, but it’s up to us to ensure it’s used responsibly.

Have you ever wanted to create your own facial verification lock? Well I have, and I wanted to set it up in my room as a lock, like those super safes or top-secret rooms in movies.

Using a Siamese neural network (SNS), I was able to create a simple facial verification SNS as the first AI that I built!

What are Siamese Neural Networks?

A Siamese neural network is a type of neural network architecture that consists of two or more identical subnetworks that share the same set of weights and parameters. The term “Siamese” comes from the name of the famous Siamese twins, who were conjoined twins sharing the same body but having separate heads. Similarly, in Siamese neural networks, the subnetworks have separate input data but share the same weights, allowing them to learn and extract features from the input data in a coordinated manner.

The main purpose of Siamese neural networks is to measure the similarity or distance between two input samples. This is achieved by passing the input samples through the two subnetworks and comparing the output features using a distance metric, such as Euclidean distance or cosine similarity. The output of the Siamese network is a scalar value that represents the similarity or dissimilarity between the two input samples.

Why SNS?

In recent years, there has been a significant increase in the use of deep learning techniques for various applications, such as computer vision, natural language processing, and speech recognition. Among these techniques, siamese neural networks have gained much attention due to their ability to solve problems that require similarity or distance measurement, such as face recognition, signature verification, and recommendation systems. This article aims to provide an overview of Siamese neural networks, including their architecture, training, and applications.

The Architecture of Siamese Neural Networks

Siamese Network for Signature Verification, Image created by Sean Benhur

The architecture of a Siamese neural network consists of two or more identical subnetworks that share the same set of weights and parameters. Each subnetwork takes an input sample and passes it through a series of layers, such as convolutional layers, pooling layers, and fully connected layers, to extract high-level features from the input data. The output features from both subnetworks are then concatenated and passed through a distance metric layer to calculate the similarity or distance between the two input samples.

The distance metric layer can take different forms, depending on the application. For example, in face recognition, the distance metric layer can be a Euclidean distance layer that calculates the distance between the two feature vectors representing the input faces. In signature verification, the distance metric layer can be a cosine similarity layer that measures the similarity between the two feature vectors representing the input signatures.

Training Siamese Neural Networks

Feature Extraction of Input image

Training a Siamese neural network involves two main steps: feature extraction and similarity/distance metric learning. During the feature extraction step, the subnetworks are trained to extract discriminative features from the input data using a supervised or unsupervised learning approach. This can be achieved by minimizing a loss function that measures the difference between the predicted output features and the true output features.

In the similarity/distance metric learning step, the Siamese neural network is trained to measure the similarity or distance between the two input samples using a distance metric layer. This can be achieved by minimising a loss function that measures the difference between the predicted similarity/distance and the true similarity/distance.

There are different loss functions that can be used for training siamese neural networks, such as contrastive loss, triplet loss, and quadruplet loss. The contrastive loss is used when the output labels are binary (similar or dissimilar), and it measures the distance between the two feature vectors for similar samples and maximises the distance for dissimilar samples. The triplet loss and quadruplet loss are used when the output labels are more than two (e.g., anchor, positive, and negative), and they measure the distance between the anchor and positive samples and the distance between the anchor and negative samples.

Applications of Siamese Neural Networks

Siamese neural networks have been successfully applied in various applications, such as:

Face recognition:

In face recognition, siamese neural networks are used to compare and match faces in images or videos. The input images are passed through two identical subnetworks that extract the features of the faces. The output features are then compared using a distance metric, such as Euclidean distance or cosine similarity, to determine the similarity between the two faces. This approach has been used in security systems and surveillance cameras to identify suspects or detect unusual behaviour.

Signature verification:

In signature verification, siamese neural networks are used to compare and match signatures in documents or transactions. The input signatures are passed through two identical subnetworks that extract the features of the signatures. The output features are then compared using a distance metric, such as cosine similarity or dynamic time warping, to determine the similarity between the two signatures. This approach has been used in banking systems and legal documents to prevent fraud and verify the authenticity of signatures.

Text matching:

In text matching, siamese neural networks are used to compare and match texts, such as sentences or paragraphs, based on their semantic similarity. The input texts are passed through two identical subnetworks that extract the semantic features of the texts. The output features are then compared using a distance metric, such as cosine similarity or Manhattan distance, to determine the similarity between the two texts. This approach has been used in information retrieval systems and recommendation systems to suggest similar texts or products based on user preferences.

Image retrieval:

In image retrieval, siamese neural networks are used to compare and match images based on their content similarity. The input images are passed through two identical subnetworks that extract the visual features of the images. The output features are then compared using a distance metric, such as cosine similarity or Euclidean distance, to determine the similarity between the two images. This approach has been used in search engines and e-commerce platforms to suggest similar images or products based on user preferences.

Want to try and create one yourself?

Here’s my project, a facial verification that took a live image from my Mac’s camera and paired it against other example images to recognize the similarity between them with a Siamese neural network.

Here are the main components of the project:

Collecting positive and input data

# Establish a connection to the webcam
cap = cv2.VideoCapture(0)
while cap.isOpened():
ret, frame = cap.read()

# Cut down frame to 250x250px, (y,x)
frame = frame[200:200+250,600:600+250, :]

# Collect anchors when click a
if cv2.waitKey(1) & 0XFF == ord('a'):
# Create the unique file path and placing in folder
imgname = os.path.join(ANC_PATH, f'{uuid.uuid1()}.jpg')
# Write out anchor image
cv2.imwrite(imgname, frame)

# Collect positives when click p
if cv2.waitKey(1) & 0XFF == ord('p'):
# Create the unique file path
imgname = os.path.join(POS_PATH, f'{uuid.uuid1()}.jpg')
# Write out positive image
cv2.imwrite(imgname, frame)

# Show image back to screen
cv2.imshow('Image Collection', frame)

# Breaking gracefully
if cv2.waitKey(1) & 0XFF == ord('q'):
break


# good to help forcefully stop cv2 webcam, in case of freezing
# Release the webcam
cap.release()
# Close the image show frame
cv2.destroyAllWindows

This code establishes a connection to a webcam using OpenCV (cv2) and captures frames from it. The captured frames are then processed and shown back on the screen. The code also provides functionality to collect anchor images and positive images from the captured frames.

Make embeddings

# Used to take input of the image
def make_embedding():
inp = Input(shape=(100,100,3), name='input_image')

# First block
# reads the image, taking it apart piece by piece by pixel size.
# 64 10x10 parts
c1 = Conv2D(64, (10,10), activation='relu')(inp) # passing inp into c1
m1 = MaxPooling2D(64, (2,2), padding='same')(c1)

# Second block
c2 = Conv2D(128, (7,7), activation='relu')(m1)
m2 = MaxPooling2D(64, (2,2), padding='same')(c2)

# Third block
c3 = Conv2D(128, (4,4), activation='relu')(m2)
m3 = MaxPooling2D(64, (2,2), padding='same')(c3)

# Final embedding block
c4 = Conv2D(256, (4,4), activation='relu')(m3)
f1 = Flatten()(c4)
d1 = Dense(4096, activation='sigmoid')(f1)


return Model(inputs=[inp], outputs=[d1], name='embedding')

This code defines a function named make_embedding() which creates a convolutional neural network (CNN) model that takes an image as input and returns an embedding vector as output. The embedding is a compact representation of the input image that can be used for various tasks such as image classification, object detection, and face recognition.

The function starts by defining an input layer using Input() with the shape of the input image, which is 100x100 pixels with 3 colour channels (RGB).

The input image is then passed through several convolutional and max-pooling layers, arranged into three blocks. The first block has one convolutional layer followed by a max-pooling layer. The second block has two convolutional layers followed by a max-pooling layer. The third block has three convolutional layers followed by a max-pooling layer.

After the third block, the output is fed into a final embedding block, which has one convolutional layer, followed by a flatten layer, and finally a dense layer with 4096 neurons and a sigmoid activation function. This dense layer outputs the embedding vector, which is the final output of the model.

The model architecture defined in this code is a simplified version of the popular VGG16 CNN architecture, which is commonly used for image classification tasks. However, instead of using softmax activation and a classification layer at the end, this model outputs a dense embedding vector. This embedding vector can be used as a feature vector to compare images or to perform various tasks such as clustering and classification.

In summary, this code defines a CNN model that takes an input image, passes it through several convolutional and max-pooling layers, and outputs an embedding vector. This embedding can be used as a compact representation of the input image for various computer vision tasks.

def make_siamese_model():

# Anchor image input in the network
input_image = Input(name='input_img', shape=(100,100,3))

# Validation image in the network
validation_image = Input(name='validation_img', shape=(100,100,3))

# Combine siamese distance components
siamese_layer = L1Dist()
siamese_layer._name = 'distance'
distances = siamese_layer(embedding(input_image), embedding(validation_image))

# Classification layer, final output, 1 or 0
classifier = Dense(1, activation='sigmoid')(distances)

# All the layers together
return Model(inputs=[input_image, validation_image], outputs=classifier, name='SiameseNetwork')

The make_siamese_model function creates a siamese model with two inputs: input_img and validation_img. These inputs are passed through an embedding layer, which was defined in the previous code block. The outputs of the embedding layer for the two inputs are then passed through an L1Dist layer, which computes the Manhattan distance between the two embeddings. The resulting distance is then passed through a Dense layer with sigmoid activation to classify the images as similar (1) or dissimilar (0).

# decorator to complie a function into a callable tf graph
# make the entire NN into a graph to train more efficiently
@tf.function
def train_step(batch):


# record operations for automatic differentiation
with tf.GradientTape() as tape:


# Get anchor and pos/neg image (features)
X = batch[:2]


# Get label
y = batch[2]


# Forward pass
# NOTE: important to have traning=True as some layers only activate when this happens
ypred = siamese_model(X, training=True)
# Calculate loss, passing through y=true, ypred=predicted outcome
loss = binary_cross_loss(y ,ypred)
print(loss)
# Calculate gradients
grad = tape.gradient(loss, siamese_model.trainable_variables)


#Calculate updated wights and apply to siamese model
opt.apply_gradients(zip(grad, siamese_model.trainable_variables))


return loss

The train_step function defines a single training step for the siamese model. It takes a batch of data as input, which consists of two images and a label indicating whether they are similar or dissimilar. The function uses automatic differentiation to calculate the gradients of the loss with respect to the trainable variables in the siamese model, and applies these gradients to update the model weights.

# Import metric calculations
from tensorflow.keras.metrics import Precision, Recall
def train(data, EPOCHS):
# Loop through epochs
for epoch in range(1, EPOCHS+1):
print('\n Epoch {}/{}'.format(epoch, EPOCHS))
progbar = tf.keras.utils.Progbar(len(data))

# Creating a metric object
r = Recall()
p = Precision()

# Loop through each batch
for idx, batch in enumerate(data):
# Run train step here
loss = train_step(batch)
yhat = siamese_model.predict(batch[:2])
r.update_state(batch[2], yhat)
p.update_state(batch[2], yhat)
progbar.update(idx+1)
print(loss.numpy(), r.result().numpy(), p.result().numpy())

# Save checkpoints
if epoch % 10 == 0:
checkpoint.save(file_prefix=checkpoint_prefix)

The train function trains the siamese model for a specified number of epochs, looping through each batch of data and calling the train_step function to update the model weights. During training, the function also calculates the precision and recall metrics for each batch, and prints the loss, precision, and recall for each epoch. Additionally, the function saves a checkpoint of the model every 10 epochs.

After the training, I exported the siamese model to visual studio code for the UI made using Kivy to run the SNS but with a smaller data set for comparison to the training images.

Problems

Building AI’s with mac’s has many limitations, such as the poor CPU that causes the computer to heat up after 3 minutes of training. Worry not — I used Google Colab, a notebook software similar to Jupyter Notebook, and it runs the code on its own CPU/GPU, making it unnecessary to train it locally. All I had to do as mentioned before was download the end product which was the SNS file after it was trained and put it in VScode to be run normally.

For a full run-through of the code refer to https://github.com/nicknochnack/FaceRecognition/blob/main/Facial%20Verification%20with%20a%20Siamese%20Network%20-%20Final.ipynb

Which I used as a guide/tutorial during my journey to creating my first AI.

I hope you will start or continue to progress in your journey in AI/ML and make a change!

--

--