Textless NLP: the future of speech generation

Egor Voron
Product AI
Published in
2 min readOct 12, 2021

Solution from:
Facebook AI

Goal:

Modern NLP models such as BERTA or GPT-3 do an excellent job of generating realistic texts that are sometimes difficult to distinguish from those written by a human. However, these models require very large text datasets which are suitable for training, and therefore cannot be used with languages ​​for which such datasets do not exist. In addition, to train such models in colloquial speech (podcasts, audiobooks, etc.), it is necessary to use models for automatic text recognition (ASR), which usually require large computing resources and can allow a considerable number of errors. The task at hand is to create an NLP model that generates a new text from an audio recording — all while solving the aforementioned problems.

The complexity of creating such a solution lies in:

  1. Colloquial speech not following formal syntactic rules that would normally facilitate automatic work with the text.
  2. The need to recognize and remember the peculiarities of pronunciation, or “prosody.”
  3. The necessity of choosing metrics for evaluating the obtained result.

Solution:

GSLM (Generative Spoken Language Model) is the first high-performance NLP model that works directly with audio signals and does not require text input. As a baseline, we took a model consisting of three parts: an encoder, which translates the audio recording into rows of “discrete units” (S2u). These are then used to generate new sequences of such units using the language model (uLM). Finally, the decoder (u2S) translates the generated sequences into speech. This model is trained on “raw” audio data.

To evaluate the work, the final audio recording is translated into text using pre-trained ASR models. For the received text, the metrics PER (phoneme error rate) — the frequency of the phoneme recognition error, and AUC (area under the curve) — the assessment of linguistic quality and diversity are calculated.

Such a model can not cope with the peculiarities of pronunciation (tone, accent, and prosody), since discrete units, like phonemes, do not store information about them. To solve this problem, a variational autoencoder (VQ-VAE) was used which uses vector quantization. The pitch is fed to it along with the discrete units described above. In order to train the final model, combining the generation of the text itself and its prosody, a transformer was developed. The input to this is fed the generated discrete units, their duration, and their quantized height. Like the baseline, the final model is trained on raw audio from the Librispeech audiobook collection.

Technologies used:

Programming languages: Python

Encoders: CPC, wav2vec 2.0, HuBERT

Decoder: Tacotron 2

--

--