Building a Siamese Neural Network for Face Verification: A Comprehensive Guide

6 min readAug 18, 2023

Introduction

In the realm of technological advancements, understanding the similarity between two data points proves important. Consider a scenario involving a facial recognition system. A camera captures an individual’s face, and subsequently, the face recognition system determines access permission to a building, relying on a database of known faces.

Computer vision proves crucial in this kind of technology since we are comparing features extracted from faces that were captured by camera.

Attempting to construct this system solely through conventional classification methods proves infeasible. A more effective approach involves the use of Siamese neural networks.

Project Repository:

GitHub - issamjebnouni/Facial-Verification: A facial verification project with a siamese neural…

A facial verification project with a siamese neural network. Face detection is performed with MTCNN and feature…

github.com

Background

This project delves into these important concepts: object detection, feature extraction, TensorFlow’s Functional API, MTCNN and the Siamese neural network architecture.

Object detection: An essential computer vision task that involves identifying and locating objects within images or videos, with applications ranging from autonomous vehicles to surveillance systems.
Feature extraction: The process of extracting pertinent information from raw images, creating compact, representative feature vectors for training image classifiers.
TensorFlow Functional API: A dynamic approach to building intricate machine learning models in TensorFlow. This enables multi-input, multi-output, shared layer architectures, offering greater flexibility than the Sequential API. You have to create your own input layers and go on defining each layer and apply it to the output of the previous layer like a function hence the name functional API.
MTCNN (Multi-task Cascaded Convolutional Networks): A sophisticated face detection algorithm, MTCNN employs a three-stage neural network to and detect faces in images and identify facial landmarks in the process.

MTCNN three-stage face detection algorithm

Siamese neural network: Named after Siamese twins, this neural architecture is tailored for comparing the likeness or disparity between two input samples. Primarily used for facial recognition and similarity-based tasks, it features twin subnetworks sharing weights, generating embeddings for input into a specialized loss function. This loss function minimizes the distance between similar inputs and maximizes the distance between dissimilar ones.

Implementation

Generating Anchor and Positive Images:

The anchor images act as a reference point for comparison, I gathered them alongside positive images of my face using OpenCV and a webcam.

Using LFW Dataset for Negative Samples:

Labeled Faces in the Wild (LFW) is a database of face photographs designed for studying the problem of unconstrained face recognition. This database was created and maintained by researchers at the University of Massachusetts, Amherst. 13,233 images of 5,749 people were detected and centered by the Viola Jones face detector and collected from the web. 1,680 of the people pictured have two or more distinct photos in the dataset.

Data augmentation:

I applied for each image consecutively: random brightness adjustment, random right/left flip and random quality deterioration and generated for each image 9 other augmented images so that we multiply our data size by the factor of ten.

def data_aug(img):
    data = []
    for i in range(9):
        img = tf.image.stateless_random_brightness(img, max_delta=0.02, seed=(1,2))
        img = tf.image.stateless_random_flip_left_right(img, seed=(np.random.randint(100),np.random.randint(100)))
        img = tf.image.stateless_random_jpeg_quality(img, min_jpeg_quality=90, max_jpeg_quality=100, seed=(np.random.randint(100),np.random.randint(100)))
        data.append(img)

    return data

Face extraction and saving:

def save_face(img_path, dest_folder):

  img = cv2.imread(img_path)
  detector = MTCNN()
  faces = detector.detect_faces(img)

  #fetching the (x,y)co-ordinate and (width-->w, height-->h) of the image
  x1,y1,w,h = faces[0]['box']
  x1, y1 = abs(x1), abs(y1)
  x2 = abs(x1+w)
  y2 = abs(y1+h)

  #locate the co-ordinates of face in the image
  store_face = img[y1:y2,x1:x2]
  store_face = cv2.resize(store_face, (224, 224))    
  cv2.imwrite(os.path.join(dest_folder, os.path.basename(img_path)), store_face)

Creating the Dataset:

#setup path
POS_PATH = os.path.join('data', 'positive')
NEG_PATH = os.path.join('data', 'negative')
ANC_PATH = os.path.join('data','anchor')

# Get Image Directories
anchor = tf.data.Dataset.list_files(ANC_PATH+'/faces/*.jpg').take(2900)
positive = tf.data.Dataset.list_files(POS_PATH+'/faces/*.jpg').take(2900)
negative = tf.data.Dataset.list_files(NEG_PATH+'/faces/*.jpg').take(2900)

# Create Labeled Dataset
positives = tf.data.Dataset.zip((anchor, positive, tf.data.Dataset.from_tensor_slices(tf.ones(len(anchor)))))
negatives = tf.data.Dataset.zip((anchor, negative, tf.data.Dataset.from_tensor_slices(tf.zeros(len(anchor)))))
data = positives.concatenate(negatives)

def preprocess(file_path):
    byte_img = tf.io.read_file(file_path)
    img = tf.io.decode_jpeg(byte_img)
    return img

def preprocess_twin(input_img, validation_img, label):
  return(preprocess(input_img), preprocess(validation_img), label)

# Build dataloader pipeline
data = data.map(preprocess_twin)
data = data.cache()
data = data.shuffle(buffer_size=10000)

# Training partition
train_data = data.take(round(len(data)*.7))
train_data = train_data.batch(16)
train_data = train_data.prefetch(8)

# Testing partition
test_data = data.skip(round(len(data)*.7))
test_data = test_data.take(round(len(data)*.7))
test_data = test_data.batch(16)
test_data = test_data.prefetch(8)

Building the Embedding Layer:

def make_embedding():
    inp = Input(shape=(224,224,3), name='input_image')
    base_model = VGGFace(model='resnet50', include_top=False, input_shape=(224, 224, 3), pooling='avg')
    d = base_model(inp)
    return Model(inputs=[inp], outputs=[d], name='embedding')

Constructing the Distance Layer:

class L1Dist(Layer):

    def __init__(self, **kwargs):
        super().__init__()

    def call(self, input_embedding, validation_embedding):
        return tf.math.abs(input_embedding - validation_embedding)

Building the Siamese Network:

def make_siamese_model():

    # Anchor image input in the network
    input_image = Input(name='input_img', shape=(224,224,3))

    # Validation image in the network
    validation_image = Input(name='validation_img', shape=(224,224,3))

    # Combine siamese distance components
    siamese_layer = L1Dist()
    siamese_layer._name = 'distance'
    distances = siamese_layer(embedding(input_image), embedding(validation_image))
    d = Dense(256)(distances)
    # Classification layer
    classifier = Dense(1, activation='sigmoid')(d)

    return Model(inputs=[input_image, validation_image], outputs=classifier, name='SiameseNetwork')

Defining Custom Training Loop:

binary_cross_loss = tf.losses.BinaryCrossentropy(from_logits=True)
opt = tf.keras.optimizers.Adam(1e-4) #0.0001

@tf.function
def train_step(batch):
  with tf.GradientTape() as tape:
    # Get anchor and positive/negative image
    X = batch[:2]
    # Get label
    y = batch[2]

    # Forward pas
    yhat = siamese_model(X, training=True)

    # Calculate loss
    loss = binary_cross_loss(y,yhat)

  print(loss)

  # Calculate gradients
  grad = tape.gradient(loss, siamese_model.trainable_variables)

  # Calculate updated weights and apply to siamese model
  opt.apply_gradients(zip(grad, siamese_model.trainable_variables))

  return loss

def train(data, EPOCHS):
    # Loop through epochs
    for epoch in range(1, EPOCHS+1):
        print('\n Epoch {}/{}'.format(epoch, EPOCHS))
        progbar = tf.keras.utils.Progbar(len(data))

        # Creating a metric object
        r = Recall()
        p = Precision()

        # Loop through each batch
        for idx, batch in enumerate(data):
            # Run train step here
            loss = train_step(batch)
            yhat = siamese_model.predict(batch[:2])
            r.update_state(batch[2], yhat)
            p.update_state(batch[2], yhat)
            progbar.update(idx+1)
         print(f'Epoch loss: {loss.numpy()} | Epoch recall : {r.result().numpy()} | Epoch precision : {p.result().numpy()}')

Training and Saving The Model:

EPOCHS = 10
train(train_data, EPOCHS)
siamese_model.save('siamesemodel.h5')

Evaluating The Model:

This model achieved 0.995 average test recall and 1.0 average test precision.

Real-Time Testing:

def verify(model, detection_threshold, verification_threshold):
  results = []
  inp_img = 'application_data/input_image/input_image.jpg'
  verif_folder = 'application_data/verification_images/faces'
  for image in os.listdir(verif_folder):
    input_img = preprocess(inp_img)
    validation_img = preprocess(os.path.join(verif_folder,image))
    result = model.predict(list(np.expand_dims([input_img, validation_img], axis=1)))
    results.append(result)
  detection = np.sum(np.array(results) > detection_threshold)
  verification = detection / len(os.listdir(verif_folder))
  verified = verification > verification_threshold
  return results, verified

cap = cv2.VideoCapture(0)
while cap.isOpened():
  ret, frame = cap.read()
  cv2.imshow('Verification', frame)

  # Verification trigger
  if cv2.waitKey(10) & 0xFF == ord('v'):
    cv2.imwrite('application_data/input_image/input_image.jpg', frame)
    save_face('application_data/input_image/input_image.jpg', 'application_data/input_image')
    results, verified = verify(siamese_model, 0.8, 0.8)
    print(verified)

  if cv2.waitKey(10) & 0xFF == ord('q'):
    break
cap.release()
cv2.destroyAllWindows()

Results and Analysis

Similar projects I inspired from used the whole captured frame and extracted embeddings from it with a self-made convolutional network. I tried that approach at first and the results were not all that satisfying. Changing your clothes or changing the background will affect the predictions. So I used the MTCNN algorithm to detect faces and then I cropped them and constructed my dataset of faces only. I also chose a powerful pretrained model trained on the face recognition specifically which is VGGFace to extract more meaningful features from the faces. Doing so boosted the performance of the siamese model a lot. The final model achieved astonishing results with 99.5% average test recall and 100% average test precision on the whole test set.

Summary

This article delved into the realm of facial recognition using Siamese neural networks.

Our journey led us to build a Siamese network, specifically designed for comparing similarities. With twin subnetworks generating embeddings and a tailored loss function, it excelled in facial recognition.

By crafting a training dataset, incorporating data augmentation, and utilizing VGGFace for feature extraction, we strengthened the model’s accuracy. The Siamese architecture, coupled with a custom training loop, yielded remarkable average test recall and precision — 99% and 100% respectively.

Real-time testing demonstrated the model’s practical utility. In conclusion, this article outlined a powerful approach to building efficient facial recognition systems with broad applications.