Train FaceNet with triplet loss for real time face recognition on keras…

Last year I completed the coursera’s Deep Learning Specialization. In its courses I learned various state of the art deep learning techniques. One of the greatest things about this specialization is it’s assignments. Through the assignments I got to learn how advanced deep learning models are implemented. One such exciting assignment was `Face Recognition for the Happy House`. In that assignment I learned the implementation of triplet loss function and a verify function. And then we just had to run those function on a pre-trained FaceNet model to see that it works. There was no detail/code on how to train the model. So, few months after completing the course, I thought about writing a program to train the FaceNet model on our own dataset. And then use the trained model to recognize face using a webcam.

me thinking about implementation

In this post I will describe how to use the program that I have built. I will explain it in various parts. I will give reference of my github code several times in this post.

Note: It is not the state of the art Face Recognition system. It requires some hyperparameter tuning to get acceptable results.

Code : https://github.com/sainimohit23/FaceNet-Real-Time-face-recognition

Dependencies:

  • keras
  • openCV
  • tensorflow

FaceNet and Triplet Loss:

FaceNet is a one-shot model, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.

~ From FaceNet paper

FaceNet is a pre-trained CNN which embeds the input image into an 128 dimensional vector encoding. It is trained on several images of the face of different people.

Although this model is pre-trained. But, it still struggles to output usable encoding for unseen data. We want this model to generate encoding such that there is less distance between encoding of the images of same person, and more distance between encoding of the different persons.

To achieve above goal on our own images we will train FaceNet model on Triplet Loss function. The triplet loss function takes face encoding of three images anchor, positive and negative. Here anchor and positive are the images of same person whereas negative is the image of a different person.

Formula of triplet loss:

Dataset:

Create a folder named images. In this folder we will place our training data. In this folder create separate folder for each person. Now place the images of the different people in their respective folder. example-

images  

└───jack
│ │───IMG1
│ │───IMG2
│ │ ....
└───ben
| │───IMG1
| │───IMG2
| | ....

The images may have different shape, orientation, background etc. To train the model we want our images to have same size and they must contain faces only. To get training data we will use a face detection algorithm called Multi-task Cascaded Convolutional Neural Networks (MTCNN). Use the script named align_dataset_mtcnn.py to align faces. This code is taken from facenet.

python align_dataset_mtcnn.py SOURCE_Path Target_Path

In our case source path is ./images/ and for target I have used the folder named ./cropped/ in my code. Results will look something like this:

Training:

For training I will recommend you to make anchor, positive and negative triplets manually. Watch this video to get an idea on how to create good triplets.

Since I am lazy af. I created a generator code which generates random triplets. Name of the file is generator_utils.py .

Now run train_triplet.py to train the model. I have included two callbacks in my model. First one is for early stopping and second one is for tensorboard to track the training process. After training is completed, the model is saved automatically.

Using trained model with webcam for real time Face Recognition:

Run webcamFaceRecoMulti.py to recognize faces using webcam. The pipeline that I have built is very simple. I used openCV to read webcam feed. Then the webcam feed is passed through another openCV program named Haar-Cascade. Haar-Cascade will detect all the faces and it will provide the co-ordinates of the bounding box of all the faces.

The detected faces are then passed through our trained FaceNet model which will provide us face encoding. The face encoding is then compared with the encoding of the persons available in our database. Verification is done using verify(...) function.

Pipeline