Voice Synthesis for in-the-Wild Speakers via a Phonological Loop
With this article, I am going to walk you through the design of the Neural Network and also offer an explanation on the attention used, which is absent elsewhere in a lucid manner.
VoiceLoop is a Neural Text To Speech (TTS) system that offers a simpler solution in terms of number of both learned parameters and the architecture of the Neural Network compared with other state of the art solutions. They study the task of mimicking a person’s voice based on samples captured in-the-wild, such as from public speeches with substantial background noise.
It offers a solution to model multiple speakers and learn on multiple speaker data.They also introduce a novel ‘shifting buffer’ memory approach.
The Architecture of the model
The model consists of three feed forward networks denoted by attention network Na, buffer update network Nu and output network No each consisting of only one hidden layer.
- Encoding of the input sentence
The sentence must first be converted into its phonetic representation of length L using the CMU pronouncing dictionary. Each phoneme is then mapped (using a look up table) onto a encoding which is essentially a embedding vector representation of size dp that is trained by the neural network. This results in an encoding matrix E of size dp × L, where dp is the size of the encoding, and l is the sequence length.
Encoding of Speaker ID
The speaker id is also mapped (using a look up table) onto a speaker embedding (Z)which is also a trained by the neural network.
2. Computing the attention
The attention is computed using Graves gaussian mixture model(GMM) based monotonic attention mechanism.
(Sidetrack) What is a GMM?
A GMM of n components is a mixture of n gaussian distributions, each with its own mean and variance. This mixture model results in multiple modes and helps model data that has multiple modes(i.e. there is more than one “peak” in the distribution of data) that can not fit in a single gaussian distribution. 
At each timestep, Na takes the buffer from previous time step as input and outputs GMM priors(ɣ), log variance(β) and shift of mean(Ⲕ). Each of these is vector of dimension c, which is equal to the number of components of the GMM.
we calculate the softmax for the priors and use it to weights during summation to obtain the attention. We then increase the means of the GMM.
and the variances are computed as exp(βt). For each GMM component 1 ≤ i ≤ c and each point along the input sequence 1 ≤ j ≤ l, we then compute:
which is the equation for the normal distribution, times its weight obtained from the softmax on the prior.The attention weights αt at time t are computed for each location in the sequence by summing along all c components:
3. Updating the buffer
At each timestep a new frame is generated using Nu, which takes as input the buffer from previous timestep, the context vector calculated using attention, and the previous output. The new buffer frame is added to the buffer in a FIFO(first in first out) manner. We achieve speaker dependance by adding a projection of the speaker embedding Z computed earlier to the new buffer frame.
4. Generate Output
The output is generated using No, which takes as input the the entire buffer and adding the projection of Z
Output of the network
The output of the network are vocoder features, which when fed into a vocoder produce sound. This paper uses WORLD vocoder.
Acknowledgement and References
I thank the authors Yaniv Taigman, Lior Wolf, Adam Polyak, and Eliya Nachmani and the FAIR team for such excellent contribution towards a solution for TTS.