Samsung AI Makes the Mona Lisa ‘Speak’

Published in

SyncedReview

4 min readMay 23, 2019

Imagine the lips forming the Mona Lisa’s famous smile were to part, and she began “speaking” to you. This is not some sci-fi fantasy or a 3D face animation, it’s an effect achieved by researchers from Samsung AI lab and Skolkovo Institute of Science and Technology, who used adversarial learning to generate a photorealistic talking head model.

AI techniques have already been used to generate realistic video of people like former US President Barack Obama and movie star Scarlett Johansson, enabled in large part by the abundance of available visual data on these individuals. The new research however shows it is also possible to generate realistic content when source images are rare. Researchers leveraged their Few-Shot Adversarial Learning technique on one of the most widely recognized humans in history known through a single image: Lisa Gherardini, the subject of Leonardo da Vinci’s classic 16th century portrait.

The new Few-Shot Adversarial Learning method is trained on existing talking head datasets. The model extracts face landmarks from video sequences in these datasets, transforms these landmarks into a set of realistic photographs based on the target person (for example the Mona Lisa), then combines the images to synthesize a video/gif with the target person animated as if in speech.

Even one shot Adversarial Learning is possible, as shown with the Mona Lisa experiment and in the images below. Of course, more training frames will still produce higher realism.

*Comparison of multi-shot training results*

This few-shot learning superpower however does not come easy, as extensive pretraining (meta-learning) on a large corpus of talking head videos is required.

As illustrated above, the first steps in meta-learning involve translating the head images to embedding vectors with an embedder network. The corresponding results can then be used to predict the generator’s adaptive parameters. Then, a generator with updated parameters maps the input of face landmarks into output frames through a set of convolutional layers. Finally, the objective function of perceptual and adversarial losses (with the latter being implemented via a conditional projection discriminator) are chosen to compare the resulting image with the ground truth image.

Two talking head video datasets (VoxCeleb1 and VoxCeleb2) were used for model testing. The quantitative comparison of different methods with multiple few-shot learning and the corresponding generated results for both datasets are shown below.