GANimation — Facial Animation

Asha Vishwanathan
Analytics Vidhya
Published in
7 min readSep 3, 2020

Ganimation was an interesting paper that came out in 2018 where the authors successfully changed facial expressions by controlling the facial movements defining a human expression.

Check out the images from the paper, where they demonstrate some examples.

What are Action Units ?

Ekman and Friesen developed the Facial Action Coding System (FACS) for describing facial expressions. It breaks down facial expressions into individual components of muscle movement, called Action Units (AUs). For eg. Fear is given by — Inner Brow Raiser (AU1), Outer Brow Raiser (AU2), Brow Lowerer (AU4), Upper Lid Raiser (AU5), Lid Tightener (AU7), Lip Stretcher (AU20) and Jaw Drop (AU26).

Image Courtsey: https://inc.ucsd.edu/mplab/grants/project1/research/face-detection.html

Generating the Action Units

We can generate facial action units using projects like OpenFace Toolkit. OpenFace takes in images of faces and generates the Action Unit Vector. It uses MTCNN (Multi-task Convolutional Neural Network) for cropping the faces, which is a SOTA in face detection and a linear kernel SVM to regress the AU intensity estimations.

Tadas Baltrusaitis has very generously open sourced the OpenFace Toolkit implementation and provides a convenient docker to generate the action units for each face.

Sample AU generated for a face would look like this -

[0.3 , 0.19, 0. , 0.02, 0. , 1.73, 0.56, 0.96, 0. , 0. , 0.03, 0. , 0.63, 0. , 0.75, 2.11, 0. ]

A value of 0 indicates an absence of that AU, while a positive number indicates the magnitude of activation.

Ganimation Approach

Ganimation is a model for synthetic facial animation which involves controlling these Action Units. Its a GAN Architecture, where the model is conditioned with a one-dimensional vector indicating the presence/absence and magnitude of each action unit. What is great is that the training process requires only the inputs of the image with its vector of action units.

Given an input image Iy(r) with action unit y(r) , the idea is to learn a Mapping M, to translate it an output image Iy(g) with action unit y(g).

There are two components to the Generator. G1 — which transforms an image Iy(r) to Iy(g) and G2 which transforms the image from Iy(g) back to an Iy(r)_predicted. For the Discriminator , there’s a D_I which evaluates an image on its photorealism and D_y which evaluates the action unit y(g)_predicted from the generated image Iy(g)

Image Courtesy: Ganimation Paper

The training is done on triplets of { Iy(r), y(r), y(g) } where the target action unit vector y(g) is randomly generated. This makes the generation of data very simple. Given a set of face images, all that is required is to get a close crop of face and generate its action units.

Generator

I(o) : input image with dimensions H x W x 3 with Action Unit y(o)
y(o) : action unit encoding of the input/original expression (1-d vector of size N)
y(f) : action unit encoding of the output/desired expression (1-d vector of size N)
I(f) : output image conditioned on y(f)

The image vector is concatenated with the action unit y(o) to get an output of H x W x (3 + N). Where y(o) of size N is represented as an expanded version to resemble H x W x N vectors.

The Generator regresses and generates two outputs. An attention mask A (H x W) and an RGB color transformation (H x W x 3).

The attention mask A dictates how much of each pixel in C contributes to the output image I(f).

I(f) = (1- A).C + A . I(o)

If Aij is 1, then we can see that the corresponding pixel of I(o) is directly copied to output image vs when Aij is 0, when the Color transformation is used for generating that pixel in the output image.

The Generator architecture is adapted from the paper “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks” which uses the two Generator method with some slight modifications.

The network contains three convolutions, several residual blocks, two fractionally-strided convolutions with stride 1/2 , and one convolution that maps features to RGB. It uses 6 blocks for 128 × 128 images and 9
blocks for 256 × 256 and higher-resolution training images.It also uses instance normalization as opposed to batch normalization.

Discriminator

The Discriminator component has two components — D_I and D_y

D_I resembles Patch GAN .This only penalizes structure at the scale of image patches. It tries to classify if each of an N x N patch is real or fake. This is run convolutionally across the image, averaging the responses to get the final output.

D_y is a regression head which estimates the AU Activations.

Loss Function

Lets look at the loss function. Its a combination 4 losses.

1.Image adversarial loss — This is the standard GAN loss using the Earth Mover Distance based on WGAN-GP. The original GAN had the Jenson-Shannon divergence loss function, but this is not continuous and can saturate.Instead , WGAN-GP uses Wasserstein loss which doesnt really output numbers bounded between 0 and 1.

Critic loss function: D(x) — D(G(x))
Generator loss function : D(G(x))

Where D(x) is Critic score for real image and D(G(x)) is discriminator score for fake image.
The discriminator tries to maximize the Critic loss function, by maximizing the difference. The generator tries to maximize Generator loss function. The actual loss is implemented by reversing the signs.

2. Attention Loss — This is a mix of two types of regularization. The first one is a Total Variation Regularization which is more of a pixel to pixel smoothing. And the second is an L2 regularization to keep the attention weights from saturating. If the attention weights saturate in the above attention equation, we can see that the output image would start mirroring the input and the generator would have no effect.

3. Conditional Expression Loss — This is the Action Unit loss. For a given Action Unit y(f), we expect the Generator to produce an image with the changed expression. So one component of this loss measures the loss between the regression loss on the generated image against the actual AU vector y(f). The other component is the regression loss between the predicted AU on the original image and the AU vector y(o)

4. Identity Loss — We had two Generators, one which transforms an image I(o) to I(f). And the other which transforms I(f) to I(o)_predicted. Which means its possible to measure the loss as an information loss, which is also called a cycle consistency loss. So this is the loss between I(o) and I(o)_predicted.

The final loss function is a combination of all 4 losses.

Loss = Image Adversarial loss + λy * Conditional Expression Loss + λa * (Attention loss with Generator 1 + Attention loss with Generator 2) + λidt * Identity Loss

The losses are finally defined as a typical GAN minimax problem. The min and the max are minimization of the Generator loss and maximization of the discriminator loss. Discriminator seeks to maximize the probability of assigning the correct labels to original samples as well as samples from the Generator.

Maximize log D(x) + log(1-D(G(z)))
where x is an actual sample and G(z) is the generated sample.

Evaluating Ganimation on Custom Images

For evaluating it on custom images set, curate a set of target face images, test image and generate the action units. Close crop of all face images is required. I used MTCNN. Using the test image, action unit of test image and action units extracted from a sample from custom images set, get the final output with the changed expression.

I tried it on a kind of surly photo of mine to generate the changed expressions :) The model takes the test image and changes it to replicate the expression of the target image. Here the first image in each row is the test image provided and the last image in the row is the target image. The changes in expression can be seen as the magnitude of the target action unit is changed incrementally.

Hope you liked the article ! Do let me know in the comments

References:

1.GANimation: Anatomically-aware Facial Animation from a Single Image — Albert Pumarola, Antonio Agudo , Aleix M. Martinez ,Alberto Sanfeliu,Francesc Moreno-Noguer
https://arxiv.org/pdf/1807.09251.pdf

2. https://www.paulekman.com/facial-action-coding-system/

3. Generative Adversarial Nets — https://arxiv.org/pdf/1406.2661.pdf

4. Wasserstein GAN — Arjovsky, M., Chintala, S., Bottou, L.
https://arxiv.org/pdf/1701.07875.pdf

5. Improved Training of Wasserstein GANs — https://arxiv.org/pdf/1704.00028.pdf

6. OpenFace 2.0: Facial Behavior Analysis Toolkit
http://multicomp.cs.cmu.edu/wp-content/uploads/2018/11/OpenFace.pdf
https://github.com/TadasBaltrusaitis/OpenFace

7. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks
https://arxiv.org/abs/1703.10593

8.J. Perceptual losses for real-time style transfer and super-resolution — Johnson, A. Alahi, and L. Fei-Fei.
https://arxiv.org/pdf/1603.08155.pdf

9.Image-to-image translation with conditional adversarial networks — Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A. https://arxiv.org/pdf/1611.07004.pdf

10. https://github.com/albertpumarola/GANimation

11.https://github.com/donydchen/ganimation_replicate

--

--