This semester, as part of my complementary school work, I worked on Text-To-Speech(TTS) problem for few months in an AI startup in Munich(Luminovo.ai). Here, I will talk about my part of work, implementation and experiments on Angela Merkel speech. I could get the results up to some level in given time, and this post is designated to help for future experimenters to speed things up and eliminate some hurdles. But first, I wish to give some background about the topic.
Generally, text-to-speech involves two steps, analysing the words to extract linguistic features, and synthesize a speech audio by these features. There are some traditional approaches such as concatenative methods, where you can attach the audio files for corresponding words to together, or parametric methods(eg. hidden Markov models), where a domain expert builds a model to represent the complex linguistic-auditory features and iteratively trains it. However these approaches are non-flexible, and sound unnatural and robotic.
The state-of-the-art AI approaches proved to be successful in capturing the linguistic details and producing a smooth, natural human sound(eg. DeepVoice3 or Tacotron2). By the help of holy deep learning, it is possible to extract the language into a meta representation, and synthesize the audio by using this representation. For the former step, a sequence-to-sequence model with attention could be used to map the words to speech information-such as pitch, accent, modulations. These speech details can be easily represented in a power spectrum, and the mel-spectrogram is a sensible choice for that, because it resolves the frequency in a more suitable way to human auditory system.
And eventually, these details should be used to generate the raw audio. Since the WaveNet vocoder proved to be successful in this job compared to previous ones, we have decided to use it. One interesting thing is that, the Seq-to-seq model and vocoder parts can be trained independently, and I worked on the WaveNet.
Initially, after reading the paper, I have tried to implement the code from scratch. I strongly suggest you to go through the paper, though I will try to describe the basics here with simple words for laymen.
The audio data is kept as a sequence of numerical values which are sampled from the sound with some frequency(for example, a typical CD audio data has 44K samples per second). As you can see in the figure above, the audio waves are represented by these sample values. Hence, compared to an image data, audio is 1D instead of 2D, is bigger in size, and the values are highly dependent to each other in time dimension. Therefore, it is harder to learn the dependencies with simple neural networks such as we use in image classification models.
The WaveNet proposes an autoregressive learning with the help of convolutional networks with some tricks. Basically, we have a convolution window sliding on the audio data, and at each step try to predict the next sample value that it did not see yet. In other words, it builds a network that learns the causal relationships between consequtive timesteps. [see below]
Typically, the speech audio has a sampling rate of 22K or 16K. For few seconds of speech, it means there are more than 100K values for a single data and it is enormous for the network to consume. Hence, we need to restrict the size, preferably to around 8K.
At the end, the values are predicted in Q channels(eg. Q=256 or 65536), which is compared to the original audio data compressed to Q distinct values. For that, the mulaw quantization could be used: it maps the values to the range of [0,Q]. And the loss can be computed either by cross-entropy, or discretized logistic mixture.
Local conditioning on Mel-Spectrograms
During training step by step, we can condition the values by the spectrogram frames. Since these frames are represented in 2D and have less dimensions, we can either simply replicate the values or use upsampling layers to match the length of audio. Then we condition by adding dilated neurons of spectrogram to that of audio.
- LJSpeech has 24 hours of English speech and text from single female speaker. It is very clear, and the silences should be trimmed.
- M-AILABS announced their huge speech dataset earlier this year. They have thousands of speech data in many different languages. I used the Angela Merkel data from the German female section, which spans 12 hours from her public speeches and interviews.
- I decided to go with pytorch for my implementation, tracked the training with tensorboard, used gcloud Tesla K80 gpus, connected to server ports by ‘ssh -NfL’, and heavily used jupyter lab during development. [life saver kit]
- There are many parameters to keep track of: number of layers, number of stacks of those layers, skips and residuals, dilation size(2 or 3)… Since I lack the computation power to exhaustively apply a search on those, I decided to choose the generally accepted ones by previous github implementations. Ideally, there are 4 stacks of 6 layers, the skips and residuals are 256, dilation size is 3.
- The filter and gate neurons can either fed by different convolutional layers, or use the output of the dilation by splitting it. The former one took much longer time to overfit the model, hence I applied latter.
- It is possible to feed the network by scalars lie between [-1,1] or discretize them by mu-law and convert to one-hot vectors. The latter showed to train faster.
- The existing experiments are applied on quantizing the data to either 8-bits or 16-bits. The 8-bit model trains much faster but it can capture the audio up to some level, whereas 16-bit slowly learns the dependencies better.
- I fed the model with 8K audio data selected from random timesteps from random audios, and applied batchsize of 1 or 2.[depends on your memory]
I have tried the English data with q=8-bit, 2 stacks of 6 layers, and 128 dimensions for rest of the layers, with cross-entropy loss. I wished to observe the results under memory constraints, and trained three models for ~11 days. The blue one is without conditioning, and the red one is with mel-conditioning. Generally, blue has lower loss in the steps for the shown steps though it ideally in the much further steps, we expect the red to go less because it add’s more information to learn it better.
Unfortunately, due to low number of dimensions in hyperparameters, the resulting audio did not sound very satisfying.
1. “even when the woodcuts are very rude indeed”
2. “All the misdemeanants, whatever their offense, were lodged in this chapel ward.”
Angela Merkel Speech
1. “In diesem Jahr mit Sicherheit auch gerade das Thema Syrien.”
2. “Und am Rande werden natürlich auch außenpolitische Themen diskutiert.”
3. “Und deshalb sind Kenntnisse über Computer, gegebenenfalls auch über Computersprachen, über die Nutzung digitaler Medien, aber auch die Nutzung der eigenen Persönlichkeitsrechte was gebe ich preis,”
During implementing the model, I have influenced by several github repositories[1, 2, 3, 4]; though decided to go in my own way. Hence, most of my time was about understanding the details from the paper, checking the existing methods for implementing, trying to come up in my own way at different parts. I lost some time due to bugs[causality, tensor calculations, transposes and wrong indexing] and wrong choices[using scalar inputs, higher layers with less stacks, higher length of input, sliding the window sample by sample while training]. With the natural complexity of the problem, the results could not get to sound natural in any way.
Finally, the quality depends on your memory, time and money constraints. One could go further trying many different parameters and train for 2–3 weeks, and it will converge and sound better. For that, the importance of a huge dataset of clear speeches should not be missed. Also, there are still a research gap of performance on different languages due to lack of good datasets. For example, comparing the evaluations for different languages would be interesting. The blindfold test can be performed for bilingual speakers and the opinion scores could be compared.
In the future, I would be interested to see results for music audio data as well. Especially by learning them on top of MIDI symbolic representations, I believe that it will be possible to synthesize raw music from the sheet in a style of a musician or a composer. Moreover, there is another direction of research on creating embeddings of different pitches and timbre features of instruments; the NSynth is an autoencoder model built by a dilated layers of encoder and a WaveNet decoder. Though the results are promising, it is computationally expensive and does not create a melodic line on its own. It would be interesting to figure out a model to learn embeddings in a continual manner, similar to MusicVAE but for generating the raw audio.
Special thanks to Luminovo.AI for their support!