FaceNet with memory vs. synthetic faces

Maciek Dziubiński

Published in

Acta Schola Automata Polonica

11 min readMay 23, 2018

We test FaceNet’s performance on a set of unconventional faces and show how to stitch it with a memory module.

Written by Maciej Dziubiński and Marta Stępniewska-Dziubińska.

Disclaimer: this blog post and the whole project was done “for fun” — it doesn’t really serve any purpose other than sharing something we think is a cool project. From the Reader’s perspective, there are two main merits coming from this post:
1. Seeing how FaceNet can be used to recognize people (or, rather, animated characters) in a one-shot manner;
2. Learning how to stitch FaceNet with another “building block”, namely: with the memory module proposed by Kaiser et al.

Additionally, if you haven’t had much experience with TensorFlow, the accompanying repository and its Jupyter Notebook may also be of value to you.

We’ve tried to keep this blog post accessible to the casual Reader, which means that we’re not going through the implementation of the models, the memory module, or any other (overly) technical details. We hoped to capture the Reader’s attention, and if that succeeded, provide him with all implementational details needed to re-create the experiments. So keep that in mind, and check out the implementation if you find this project interesting.

Introduction

OK, so FaceNet is incredibly good at face recognition of real human beings. That’s great and all, but is it capable of recognizing faces across domains: CGI, painted, drawn, etc.? We asked ourselves this critical question, and the short answer is:

Source: http://researchinprogress.tumblr.com/

The set-up
Imagine that you’re organizing a venue in a VR world (similar to that in “Ready Player One” or in another book adaptation: “The Congress”) where each participant chooses an avatar. Let’s say that, for whatever reason, you would like to register, and then recognize the participants based on their animated faces using some standard face recognition model.

This might be tricky since most face recognition models are trained on actual (human) faces, so even if the task of recognizing avatars is easy for us (humans), it might be hard for a model. The source of the problem is the difference between the domain on which the model was trained on, and the new, animated domain for which we would like to use the model. For more on this topic we refer the Reader to this blog post, but for the purpose of this project it’s only important to note that: it’s hard to know beforehand how a face recognition model will perform on animated faces.

But as it turned out, we didn’t have to do any domain adaptation —FaceNet performs pretty well without any additional work.

FaceNet

is our face recognition model of choice. As mentioned, it performs very well on real faces (reaching an accuracy of 0.99650 on the Labeled Faces in the Wild dataset).

FaceNet builds on the Inception ResNet v1 architecture and was trained on the CASIA-WebFace and VGGFace2 datasets. It’s worth noting that FaceNet’s weights are optimized using the triplet loss function, so that it learns to embed facial images into a 128-dimensional sphere. Images whose embeddings are close in this space are expected to correspond to similar faces. If you want to learn more, we refer you to the paper and if you really, really want to learn more, check out its TensorFlow implementation.

We won’t go into details concerning FaceNet’s architecture, but for the purpose of the discussion below, it’s important to highlight one particular nuance of the Inception architecture. Namely, it has a train phase (True or False), which controls whether Dropout and BatchNorm are turned ON (True) or OFF (False). Dropout is not that big of a deal, but for those of you who haven’t yet fallen into the pitfall of re-using a model with BatchNorm, here’s a good reference. We’ll see that in our case it’s better to turn the train phase OFF.

The memory module

We’ll use the memory module proposed in the Learning to Remember Rare Events paper by Kaiser et al. Roughly, the memory is composed of three elements: 1) a matrix with rows corresponding to embeddings of observations, 2) a vector storing the labels of the observations stored in this matrix, and 3) a vector of ages of those observations.

Point 1) is not entirely true: a single row may also result from combining two (or more) similar observations, which is done auto-magically by the memory module. This will be relevant later on, when we’ll argue that one of the merits of the memory module is that it requires storing fewer embedding, and thus: less memory.

Methods and Data

Data

Our dataset is tiny, comprising 12 characters, each with 8 distinct images (drawn, painted, etc.), which yields 96 images. Here they are:

6 pairs = 12 characters, 8 images each, 96 images total.

You might have noticed that the 12 characters are grouped into 6 pairs of similar faces, i.e.:
* Geralt vs. Vesemir (two witchers, as they were “portrayed” in The Witcher 3);
* Malczewski vs. Witkacy (two influential Polish artists);
* Heroin and hero of the Avatar movie (we didn’t really remember their names, so we named them avatar_female and avatar_male);
* Gollum vs. Smeagol (from the movie adaptation of the Lord of the Rings);
* Durotan vs. Hulk (from the Warcraft movie, and the Marvel Cinematic Universe);
* Thade vs. Cesar (from two versions of the Planet of the Apes; Thade from Tim Burton’s adaptation and Cesar from the reboot series).

Note, that Gollum and Smeagol are actually alter egos of the same character so we expected the model to struggle with these two. But on the other hand, humans are pretty much able to discern between the two just by looking at their expressions, so why not put FaceNet to test?

Another note: Witkacy was represented by 4 painted/drawn pictures and 4 photographies. We’ve expected this to be a problem, but as we’ll see in a minute —FaceNet was resilient to this obstacle.

FaceNet

We used the 20170512–110547model. We have also tested out the newer models (20180402–114759 and 20180408–102900) but got better results with the older one.

FaceNet was used to produce embeddings for the 96 images, and if we use the dot product between these embeddings, we can summarize the similarity between the images with a matrix (row and column indexes are grouped by characters):

Because the embeddings are normalized, the dot product is equivalent to the cosine similarity. The dark blocks along the diagonal represent high similarities between faces coming from the same character. But notice also that Gollum and Smeagol together yield an almost homogenous, dark block, which is in line with what we anticipated: these two will pose a challenge to the model.

Memory

In an upcoming blog post we’ll go through the implementation of the memory module, but here we’ll only mention several of its most important features. First: there is a “bookkeeping” aspect of the memory which is not driven by gradient propagation, but rather in a “insert-update-delete” fashion. Second: the gradient of the loss function is propagated to the model providing the embedding (FaceNet, in our case). And lastly: the prediction is done by finding in the memory matrix the best matching observation (in a 1-nearest-neighbor fashion, using the cosine distance), and returning its corresponding label.

K-shot learning and evaluation

We iteratively trained and evaluated the model. In the first iteration, the model was trained on 12 images, one face per character. Next, the model was evaluated on the remainder of the images. On the second iteration, the model was fed another batch of 12 images, then evaluated on the rest, and so on, and so forth. Actually, there’s also a 0-th iteration, where the model starts off completely blind, without seeing any of the characters, which, as expected, results in an accuracy of about 1/12 = 8.33%.

Two models

We will compare two models, both based on the embeddings produced by FaceNet. The first one will utilize the memory module and will be called “FaceNet + memory”. The second one will be simpler: it will store all embeddings and their corresponding labels, and produce a prediction in a 1-nearest-neighbor fashion. We will call this the “FaceNet alone” model.

The “FaceNet alone” model may at first seem remarkably similar to the “FaceNet + memory” model. But it differs in two respects: 1) this model won’t modify the embedding model (i.e. FaceNet), and 2) because it will store all previous embeddings, it will require more, well, memory. The memory module uses a fixed-sized matrix to store the embeddings, and our default number of rows was 32 (less than the number of images). Note that, because the “FaceNet + memory” had a cap on the (RAM) memory, its task was a bit harder. Let’s see how it influenced the performance.

Results

K-shot accuracies

Finally, let’s see the k-shot accuracies for the “FaceNet alone” and the “FaceNet + memory” models:

That’s what we initially saw. But we didn’t expect the memory module to be that much worse, so we asked ourselves: what are we doing wrong?

And it’s not that we had a bug in the code, but rather that we forgot to pay attention when applying transfer learning with a pre-trained model that uses BatchNorm. Once we turned the train phase OFF, the results were as follows:

K-shot accuracies for “FaceNet alone” and “FaceNet + memory” (with train phase OFF).

Consequently, FaceNet’s weights are updated without the correction for internal covariate shift (partially amended by BatchNorm), and without the ensembling provided by Dropout. Is this OK? Can it be done better? We are open to suggestions so if you have one — please, leave a comment.

To see what kind of mistakes the models made, we compiled two confusion matrices:

Evolution of the confusion matrices for the two models (rows represent labels, columns represent predictions).

The rows of the confusion matrix represent actual labels, and the columns denote predictions. Bu default, the 0-shot prediction of the “FaceNet alone” model is a uniform random choice between the labels of all characters. On the other hand, the 0-shot prediction given by “FaceNet + memory” is “geralt” (class 0) for every image, which corresponds to a black column in the confusion matrix.

For details regarding the implementation, the default values of hyper-parameters, and everything else, we refer the Reader to the accompanying repo.

Discussion

We will discuss the “FaceNet alone” solution first, and then move on to “FaceNet + memory” .

FaceNet alone
This approach is doing a pretty good job, with 83% accuracy right off the bat, after seeing each character just once. It even reaches 100% accuracy, which means that it was able to distinguish (among others) between Gollum and Smeagol.

From the evolution of the confusion matrix we see that the model struggles with discerning between Gollum and Smeagol, but has also problems with another pair: Hulk and Durotan. Still, that’s pretty much it, the rest of the character pairs are no match for this simple 1-nearest-neighbor approach based on FaceNet’s embeddings.

FaceNet + memory
Obviously, we would really like to sell this method like this:

but the Reader will benefit more from a self-critical approach.

The accuracy is lower than that of “FaceNet alone” but bear in mind that there’s a bit of variability in the results. If we were really fussy, we would repeat the experiment many times, bootstrapping (or rather: shuffling) the images for each character. Also, we might have spent more time tweaking the hyper-parameters. But we left that as an exercise for the Reader.

The evolution of the confusion matrix indicates that, again, Gollum and Smeagol are a problem, but more surprisingly the model mistakes Durotan (an orc, mind you) for Vesemir (a witcher). But, conversely to “FaceNet alone”, starting with the 3rd shot, there are no misclassification between Durotan and Hulk. This is weird because from what we saw in the similarity matrix, Durotan has relatively low similarity with Vesemir (at least in comparison with Hulk). We therefore suspect that this is an artifact of this particular ordering of the input images.

Now, what can be said for sure is that “FaceNet + memory” required less memory than the “FaceNet alone” model. Roughly three times less memory. But if each character occurred more frequently, the gain would have been even higher. Going a bit further in that direction: the “FaceNet + memory” is prepared for a life-long learning path, batteries included, whereas the “FaceNet alone” would require unlimited memory.

And as the learning path of “FaceNet + memory” progresses, the model adapts to the domain in which it’s being applied. Our dataset is too small to make use of this property, but it might come in handy in a more demanding environment.

There are, however, two limitations of the memory module that we wish to address in the future:
1. The prediction is made based only on the closest observation which might increases the variance of the model (in the sense of the bias/variance tradeoff);
2. During training, a row in the memory matrix may be updated by combining two embeddings: the one already in the memory and another, coming from a new, but similar observation. This “combining” boils down to adding the two embeddings. However, this way we loose information about the distributions of the embeddings.

Conclusions

We evaluated two approaches to k-shot face detection based on embeddings acquired with the FaceNet model. The first one was based on a memory module proposed by Kaiser et al., the second was a simple 1-nearest-neighbor approach.

Our dataset was composed of 96 facial images of 12 (mainly) CGI characters from movies, video games, and art — domains which FaceNet was not originally intended for.

Both models reach an accuracy of 100% after “seeing” each character 7 times, although we stressed that this result may be noisy. However, this proves that the embedding learned by FaceNet is robust and that it can be transfered across domains, e.g. for face recognition based off of portraits done by criminal sketch artists.

If you would like to learn more about the memory module, make sure to check out the paper and the implementation. We’re also working on a blog post in which we go through the implementation using the eager execution in TensorFlow, so stay tuned!