Voice Synthesis for in-the-Wild Speakers via a Phonological Loop

With this article, I am going to walk you through the design of the Neural Network and also offer an explanation on the attention used, which is absent elsewhere in a lucid manner.

VoiceLoop[1] is a Neural Text To Speech (TTS) system that offers a simpler solution in terms of number of both learned parameters and the architecture of the Neural Network compared with other state of the art solutions. They study the task of mimicking a person’s voice based on samples captured in-the-wild, such as from public speeches with substantial background noise.
It offers a solution to model multiple speakers and learn on multiple speaker data.They also introduce a novel ‘shifting buffer’ memory approach.

The Architecture of the model

The model consists of three feed forward networks denoted by attention network Na, buffer update network Nu and output network No each consisting of only one hidden layer.

  1. Encoding of the input sentence

The sentence must first be converted into its phonetic representation of length L using the CMU pronouncing dictionary. Each phoneme is then mapped (using a look up table) onto a encoding which is essentially a embedding vector representation of size dp that is trained by the neural network. This results in an encoding matrix E of size dp × L, where dp is the size of the encoding, and l is the sequence length.

Encoding of Speaker ID

The speaker id is also mapped (using a look up table) onto a speaker embedding (Z)which is also a trained by the neural network.

2. Computing the attention

The attention is computed using Graves gaussian mixture model(GMM) based monotonic attention mechanism.

(Sidetrack) What is a GMM?
A GMM of n components is a mixture of n gaussian distributions, each with its own mean and variance. This mixture model results in multiple modes and helps model data that has multiple modes(i.e. there is more than one “peak” in the distribution of data) that can not fit in a single gaussian distribution. [2][3]

At each timestep, Na takes the buffer from previous time step as input and outputs GMM priors(ɣ), log variance(β) and shift of mean(Ⲕ). Each of these is vector of dimension c, which is equal to the number of components of the GMM.

we calculate the softmax for the priors and use it to weights during summation to obtain the attention. We then increase the means of the GMM.

and the variances are computed as exp(βt). For each GMM component 1 ≤ i ≤ c and each point along the input sequence 1 ≤ j ≤ l, we then compute:

which is the equation for the normal distribution, times its weight obtained from the softmax on the prior.The attention weights αt at time t are computed for each location in the sequence by summing along all c components:

3. Updating the buffer

At each timestep a new frame is generated using Nu, which takes as input the buffer from previous timestep, the context vector calculated using attention, and the previous output. The new buffer frame is added to the buffer in a FIFO(first in first out) manner. We achieve speaker dependance by adding a projection of the speaker embedding Z computed earlier to the new buffer frame.

4. Generate Output

The output is generated using No, which takes as input the the entire buffer and adding the projection of Z

Output of the network

The output of the network are vocoder features, which when fed into a vocoder produce sound. This paper uses WORLD vocoder.

Acknowledgement and References

I thank the authors Yaniv Taigman, Lior Wolf, Adam Polyak, and Eliya Nachmani and the FAIR team for such excellent contribution towards a solution for TTS.


[1] https://brilliant.org/wiki/gaussian-mixture-model/


Like what you read? Give Abhinav a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.