Tacotron2 voice synthesis model explanation & experiments

Ellie Kang
learn ai
Published in
4 min readDec 26, 2018

Authors: Edward J. Yoon, Ellie Kang

Abstract : The natural language conversation between human and non-human shown by duplex AI at google I/O 18 is considered to have passed the turing test already. Yet, speech recognition technology among multi-speakers has a long way to go. However, deep learning based voice synthesis technology that excels the expected quality of common speech recognition (traditional concatenate synthesizer made voice) seems to be getting closer to us. In this article, we will share our experiments mainly based on Tacotron by DeepMind and Wavenet paper.

Mel Spectrogram

In Tacotron-2 and related technologies, the term Mel Spectrogram comes into being without missing. Wave values ​​are converted to STFT and stored in a matrix. More precisely, one-dimensional speech signals are two-dimensional markers. It is easy to think that the voice is converted into a photo-like picture.

This spectrogram is a Mel spectrogram that is compressed according to the Mel curve, which reflects the characteristics of the human cochlea.

* STFT (Short Time Fourier Transform) = output of extracting spectrum repeatedly at short intervals

* spectrogram = time-frequency distribution graph

A Review of Tacotron-2 Architecture

Architecture of Tacotron-2

The model architecture of Tacotron-2 is divided into two major parts as you can see above.

1) Spectrogram Prediction Network: Convert character sequences to Mel spectrograms

ㅇ ㅏ ㄴ ㄴ ㅕ ㅇ ㅎ ㅏ ㅅ ㅔ 요 → Character Embedding

→ 3 convolution Layers → Bi-directional LSTM (512 neurons) → encoded features

→ Attention Unit

→ LSTM layer (2 uni-directional layers with 1024 neurons) → Linear Transform → Predicted Spectrogram Frame

→ PostNet (5 Convolutional Layers) → Enhanced Prediction

.. and Finally → modified Wavenet

2) Modified WaveNet: Conversing Mel Spectrogram to speech

This can be summarized as follows.

Tacotron-made mel-spectrogram + WaveNet Vocoder- Griffin-Lim Algorithm= Tacotron 2

map text sequence to sequence(12.5ms 80 dimensional audio spectrogram. → 24khz wave)

* Griff-Lim Algorithm = first appeared in Tacotron1. An algorithm that predicts discarded phase information by STFT when converted to spectrogram

Changes in Speech Technology

RNN, LSTM → Tacotron(spectrogram + Grifflin) → Tacotron2 (mel spectrogram+wavenet vocoder)

CNN→ wavenet → Parallel wavenet+DCTTS+Deepwave3 → Flowavenet

History from Wavenet’s Slow Inference

Wavenet is a cnn model that is out of sequential modeling of existing RNN and LSTM. This was to introduce parallelism. Although each time-specific operation was processed in parallel, but it was masked convolution of the autoregressive structure in which the model is still processed sequentially due to the problem of dilated casual convolution. Accordingly, the train speed was fast, but the inference rate was slow.

To solve this problem, parallel wave net has constructed two training pipelines of student and teacher with IAF (inverse autoregressive flow) structure.

However, two training pipelines lowered the train speed. To compensate for this, current Flowavenet model that has replaced two training pipelines with one pipeline comes out.

Other technologies include Google Cloud’s TTS infrastructure and Baidu’s Deep Voice. First, Google Cloud’s Wavenet-based optimized TTS infrastructure is 1000x faster than Wavenet, but related technology is not available now. Also, It would be necessary to use the Wavenet-based Deep Voice open source technology, which is disclosed as a subtitle called Baidu’s Real Time TTS. [2]

In this way, Wavenet has had the problem that inference processing takes too long. Parallel WaveNet and Flowavenet, which started to solve this problem, are still developing. The world is suggesting the possibility of a new network to catch both fast training speed, fast inference and two rabbits. And many people are actually in research.

Korean Hangul Processing

In the case of Hangul, one syllable is decomposed into a constant and vowel vowel, so that we think the pronunciation rules may be complicated unlike other languages. For example, there was a concern like this. Because the same consonant ‘ㅁ’ is used as the beginning or the support of pronunciation, would it be able to learn by embedding with all disassembled like the English alphabet?

Although we are not linguist, it can not be said to be accurate, but the pronunciation rules of Hangul are shown to be composed of at least one consonant and vowel. For example, ‘ㄱ’ constant + ‘ㅏ’ (vowel) + ‘ㄴ’ (constant) = 간, ㅈ(constant) + ㅣ(vowel) = 지. In addition, it was confirmed that Hangul pronunciation rules can be learned in all cases from the actual decomposed sequence.

Finishing…

So far, we have taken a look at deep-running speech synthesis and speech recognition. Korean audio samples and their comparisons can be found in [3]. We finish this contribution with hope that the experiment and experience we have done will be helpful to someone.

--

--