Wav2Lip: A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild
This paper proposes Wav2Lip, an adaptation of the SyncNet model, which outperforms all prior speaker-independent approaches towards the task of video-audio lip-syncing.
The authors note that, while prior approaches typically fail to generalize when presented with video of speakers not present in the training set, Wav2Lip is capable of producing accurate lip movements with a variety of speakers.
They continue to summarize the primary intentions of the paper:
- Identifying the cause of prior approaches failing to generalize to a variety of speakers.
- Resolve said issues by incorporating a powerful lip-sync discriminator.
- Propose new benchmarks for evaluating the performance of approaches towards the task of lip-syncing.
Introduction
The authors first point regards the recent boom in the consumption of video and audio content. Alongside this, there is a growing need for audio-video translation across a variety of languages in order to promote accessibility to a greater portion of the public. Thus, there is a significant motivation for applying machine learning to such a task as automated lip-syncing of unconstrained video-audio content.
Unfortunately, however, earlier approaches commonly failed to generalize to a variety of speakers identities, only performing well when evaluated on the small subset of potential speakers which comprised their training set.
Such approaches will fail to meet the rigorous requirements of the aforementioned practical application, where a suitable mode would have to be capable of accurately syncing a variety of speakers.
Due to the demanding requirements of using such an approach in practice, one requires a model which can generalize to a variety of speaker identities. As a result, speaker-independent approaches have arose. These models are trained on thousands of speaker identities. However, even these approaches employed in prior publications fail to meet the expectations of the authors of this work. They acknowledge that prior speaker independent models, while being able to generate accurate lip-syncing on individual static images, are inapplicable to dynamic content. For applications such as the translation of television series and films, it is required that an approach is more generalizable to the varying lip shapes of the speakers in different unconstrained videos.
The authors cite that a video segment approximately 0.05–0.1 seconds out-of-sync is detectable by a human, thus implying a broad challenge with a fine margin for error.
The section concludes with a brief summary of the authors contributions:
- They propose Wav2Lip, which significantly outperforms prior approaches.
- They introduce a new set of benchmarks/metrics for evaluating the performance of models in this task.
- They release their own dataset to evaluate the performance of their approach when presented with unseen video-audio content sampled from the wild.
- Wav2Lip is the first speaker-independent approach which frequently matches the accuracy of real synced videos; according to human evaluation, their approach is preferred to existing methods approximately 90% of the time.
- They push the FID score on generating synchronous video frames for dubbed videos from 12.87 (LipGAN) to 11.84 (Wav2Lip + GAN), improving the average user-preference from 2.35% (LipGAN) to 60.2% (Wav2Lip + GAN).
Review of Existing Literature
(I’ll limit this section to the author’s reasons for mentioning these papers, and leave out the in-depth information regarding specifically the approaches of these works)
The authors acknowledge several discrepancies between prior approaches and the requirements for an approach to work fully in the real world:
- Requiring a large amount of training data for some methods.
- Limitations in terms of the extent of vocabulary learned by the model.
- Training on datasets with a limited set of vocabulary impedes upon the ability of prior approaches to learn the wide variety of phoneme-viseme mappings.
They continue to argue why prior approaches commonly fail to generate accurate lip-syncing when presented with unseen video content from the wild:
- Pixel-level Reconstruction loss is a Weak Judge of Lip-sync: Loss functions incorporated in prior works inadequately penalize inaccurate lip-sync generation.
- A Weak Lip-sync Discriminator: The discriminator in the LipGAN model architecture only has a 56% accuracy at detecting off-sync video-audio content, while the discriminator of Wav2Lip is 91% accurate at distinguishing in-sync content from off-sync content on the same test set.
A Lip-sync Expert Is All You Need
Finally, the authors propose their approach, taking into considering both of the above issues in prior works.
- Use a pre-trained lip-sync discriminator that is already accurate in detecting out-of-sync video-audio content in raw, unconstrained samples.
- Adapt the previously existing SyncNet model for this task. (I won’t go into depth about this, rather, I’ll only emphasize on the Wav2Lip architecture).
Overview of the Wav2Lip Model Architecture
Terminology
The authors use the following terms to refer to the various sections of their network, which I will continue to use following this section:
- Random reference segment: A random sample of a segment of consecutive frames used to identify a particular speaker, providing the network context of the identity specific to the aforementioned speaker.
- Identity encoder: Encodes the concatenation of the ground truth frames and a random reference segment, providing visual context for the network to adapt appropriately to any particular speaker.
- Speech encoder: Encodes the audio data (self-explanatory).
- Face decoder: Decoded the concatenated feature vectors into a series of reconstructed frames.
Methodology
At a high level, Wav2Lip inputs a Mel-spectrogram representation of a particular audio segment alongside a concatenation of the corresponding ground truth frames (with the bottom half masked) and a random reference segment whose speaker confirms to that of the ground truth segment. It reduces this input via convolutional layers to form a feature vector for both the audio and frames input. It then concatenates these feature representations, projecting the resulting matrix onto a segment of reconstructed frames through a series of transposed convolutional layers. There are residual skip connections between layers of the identity encoder and face decoder.
Wav2Lip attempts to fully reconstruct the ground truth frames from their masked copies. We compute L1 reconstruction loss between the reconstructed frames and the ground truth frames. Then, the reconstructed frames are fed through a pretrained “expert” lip-sync detector, while both the reconstructed frames and ground truth frames are fed through the Visual Quality Discriminator. The Visual Quality Discriminator attempts to distinguish between reconstructed frames and ground truth frames to promote the visual generation quality of the frame generator.
Loss Functions
The Generator
The generator aims to minimize the L1 loss between the reconstructed frames $L_g$ and the ground truth frames $L_G$:
where 𝑵 is the generally-accepted notation to denote batch size.
The Lip-Sync Discriminator
For lip-syncing, they implement cosine similarity with binary cross-entropy loss, thus computing the probability that a given two frames. More specifically, loss is computed between the ReLU-activated video and speech embeddings 𝑣 and 𝑠. This results in a list of probabilities, one for each sample, indicating the probability that the corresponding sample is in sync.
where the ReLU activation applied may be described as:
The full expert discriminator loss is computed by taking the cross-entropy of the distribution $P_{sync}$ as follows:
The Visual-Quality Discriminator
The Visual-Quality Discriminator is trained to maximize the following loss:
where the generator loss $L_{gen}$ is formulated as follows:
Accordingly, the generator attempts to minimize the weighted sum of the reconstruction loss, the synchronization loss, and the adversarial loss (recall that we are dealing with two discriminators):
where $s_w$ is the a weighting value which indicates the penalty attributed to synchronization, and $s_g$ is the adversarial loss.
These two disjoint discriminators allow the network to achieve superior synchronization-accuracy and visual generation quality.
Conclusion and Further Reading
That’s it for ”A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild”. If you’d like to read more in-depth, or about the few things I didn’t cover here:
- The proposed metric/evaluation system
- Benchmark comparisons between Wav2Lip and prior models
- Detailed training procedure and hyperparameters used by the authors
- Real-world evaluation of Wav2Lip
you can investigate further by reading the paper: https://arxiv.org/pdf/2008.10010v1.pdf
Model architecture image credit to the authors of “A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild”.