One shot learning explained using FaceNet

Dhanoop Karunakaran
Intro to Artificial Intelligence
5 min readSep 27, 2018

One shot learning

Nowadays State of the art computer vision algorithms use deep learning. Standard deep learning classification required huge amount of dataset to predict with good accuracy. For example, build face recognition in an organisation. In order to do with normal deep learning method, model has to be trained on huge no. of labelled images of the employees and needs to be trained on large no. of epochs. This method may not be suitable because every time new employee comes in model needs to be trained. Another approach is model is trained on fewer images of the employees, but it can be used for newer employees without retraining the model. This way of approach is called one shot learning.

In other words, One-shot learning is an object categorization problem in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few, training images.

Siamese network[Source]

One-shot learning can be implemented using a Siamese network. This network has got two identical fully connected CNNs with same weights and accepting two different images. Normal CNN using softmax to get the classification, but here the output of fully connected layer is regarded as 128 dimensional encoding of the input image. First network output the encoding of the first input image and second network output the encoding of its input image. Finally, we can say these encodings are the good representation of these input images.

These networks are optimised based on the loss between their outputs. Loss is calculated using distance metric as shown above. Loss will be smaller, if the image are similar and will be further apart when images are not similar.

FaceNet

Currently, state of the art face recognition systems use one shot learning. I have come across FaceNet which is the backbone of many open source face recognition system like OpenFace etc.

FaceNet is introduced in 2015 by Google researchers. It transforms the face into 128D Euclidian space similar to word embedding. Once the FaceNet model having been trained with triplet loss for different classes of faces to capture the similarities and differences between them, the 128 dimensional embedding returned by the FaceNet model can be used to clusters faces effectively. Once such a vector space(embedding) is created, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors. In a way, distance would be closer for similar faces and further away for non-similar faces.

Every point in three-dimensional Euclidean space is determined by three coordinates[Source: Wikipedia]
FaceNet overall architecture[Sourec: from the paper]

The paper describes that network consists of a batch input layer and a deep CNN followed by L2 normalization, which results in the face embedding. This is followed by the triplet loss during training.

Triplet loss trainning

The Triplet Loss minimises the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.

According to the paper, they used different types of architectures and explore their trade-offs in more detail in the experimental section.

Architecture they used to test[Source: from the paper]

Their practical differences lie in the difference of parameters and FLOPS. The best model may be different depending on the application.

Formula for finding the Euclidean distance between points p and q(Source: wikipedia)

One the FaceNet model is trained, we can create the embedding for the face by feeding into the model. In order to compare two images, create the embedding for both images by feeding through the model separately. Then we can use above formula to find the distance which will be lower value for similar faces and higher value for different face.

One shot learning using FaceNet

If you think now, the comparison we made for two images in a way of Siamese network as explained above. So we can say that this is a one shot learning way for comparing two faces. Finally, we can conclude that Using FaceNet model, we achieve one shot learning.

I have created an example for face comparison using Facenet and MTCNN to demonstrate the one shot learning. MTCNN used for detect and align faces where as Facenet is used to create the embedding for the faces.

If you would like to see code in action, visit the Github repo.

If you like my write up, follow me on Github, Linkedin, and/or Medium profile.

References

  1. FaceNet: A Unified Embedding for Face Recognition and Clustering
  2. Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks
  3. One-shot Learning with Memory-Augmented Neural Networks
  4. Matching Networks for One Shot Learning
  5. https://github.com/davidsandberg/facenet(I have used this repo’s code and simplified for the purpose)
  6. https://github.com/wangbm/MTCNN-Tensorflow(I have used this repo’s code and simplified for this purpose)

--

--