Tacotron 2: A High-Quality Speech Synthesis Model Using AI for Waveform Conversion

David Cochard
axinc-ai
Published in
6 min readDec 26, 2023

This is an introduction to a high-quality speech synthesis model that performs waveform conversion using AI. By using Tacotron 2, it is possible to make AI speak with any given text. Moreover, by using models trained by ax Inc, it supports Japanese as well.

Overview

Tacotron 2 is a speech synthesis model developed by Google and implemented by NVIDIA. Since the training code for this model is publicly available, it can be retrained to support additional languages.

Architecture

Tacotron is an Encoder-Decoder model that generates Mel-spectrogram from text. In Tacotron 2, this is further enhanced by applying the WaveGlow model for waveform conversion from Mel-spectrogram, resulting in even higher quality audio.

Tacotron2 architecture (Srouce: https://arxiv.org/pdf/1712.05884.pdf)

Tokenizer

Text is converted into tokens using a tokenizer called text_to_sequence.

For the tokenizer’s input, ASCII text is used for English. In the case of Japanese, text representing phonemes is used, which is converted using OpenJTalk g2p (grapheme-to-phoneme) system.

strs[1] = pyopenjtalk.g2p(strs[1], kana=False)

The text is formatted using a cleaner. The basic cleaner converts uppercase letters to lowercase and normalizes all spaces to single spaces. The cleaner for English input goes further, converting numbers into English words and abbreviations like “Dr.” into their expanded forms.

The definition of the symbol series is as follows: for the alphabet, the characters are directly converted into their symbol values. If the CMU-style Arpabet phonetic symbols are input, dedicated symbol numbers are assigned to them.

_pad        = '_'
_punctuation = '!\'(),.:;? '
_special = '-'
_letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'

# Prepend "@" to ARPAbet symbols to ensure uniqueness (some are the same as uppercase letters):
_arpabet = ['@' + s for s in cmudict.valid_symbols]

# Export all symbols:
symbols = [_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabet

Here’s an example of tokens generated when converting simple alphabets.

Input text : Hello world.
Output tokens : [45 42 49 49 52 11 60 52 55 49 41 7]

When the batch size is 1, the tokens are used as they are. For batch sizes not equal to 1, the tokens are zero-padded up to the length of the longest batch.

Encoder

In the Encoder, tokens are converted into embeddings.

Decoder

In the Decoder, the embeddings and the results from the previous inference are used as inputs to generate Mel-spectrum one column at a time. In each decoding step, a Mel-spectrum of size (80, 1) is produced, which is then concatenated to form a Mel-spectrum of size (80, times). The decoding process is terminated when the value of gate_output falls below 0.6.

When the batch size is not equal to 1, a mask is used to ignore the padded areas, inputting a True/False value for each token to indicate whether it is valid. In this context, False is valid, and True is invalid.

After the decoding of the Mel-spectrum is complete, denoising is performed using PostNet.

Conversion from Mel-spectrum to PCM Waveform

Since a Mel-spectrum is a power spectrum and doesn’t include phase information, predicting the phase is necessary to convert it into a PCM waveform.

Traditionally, the phase was predicted using the Griffin-Lim algorithm, an iterative method. However, the phase predicted by the Griffin-Lim algorithm tends to be noisy, and recently vocoders have been used instead.

In Tacotron 2, a neural vocoder called WaveGlow is adopted. It simultaneously predicts the phase and performs the PCM conversion. Using WaveGlow allows for the acquisition of PCM waveforms with less noise.

Training

Training can be done using TensorFlow and PyTorch. However, it depends on various libraries including TensorFlow 1.15 and NVIDIA’s Apex, making version compatibility complex. For the training performed at ax Inc., we set up the environment on Windows with Python 3.6.8 and the following package versions.

Package              Version
-------------------- -----------
apex 0.1
gast 0.2.2
h5py 3.1.0
Keras-Applications 1.0.8
Keras-Preprocessing 1.1.2
matplotlib 2.1.0
numba 0.48.0
numpy 1.17.0
onnxruntime 1.10.0
protobuf 3.19.6
scikit-learn 0.24.2
scipy 1.0.0
soundfile 0.12.1
tensorboard 1.15.0
tensorflow 1.15.2
tensorflow-estimator 1.15.1
torch 1.7.0+cu110
torchvision 0.8.1+cu110

During training, pairs of text and audio files are used. If there are few available audio files, pre-training with a large dataset and then applying transfer learning can be used to improve sound quality.

Here is an example of a dataset used for training in Japanese. First, the audio file is listed, followed by the phonetic representation of the text.

../datasets/tsukuyomi/meian/VOICEACTRESS100_001.wav|mata,toojinoyooni,godaimyooootoyobareru,shuyoonamyoooonochuuoonihaisarerukotomoooi.

The text portion can be defined with any phonemes, so including accent marks in the training can enable the model to reflect accents in the synthesized speech.

It should be noted that if only Tacotron 2 is retrained, it can replicate certain voice timbres to some extent, but the voice tends to sound robotic. This seems to be caused by a misalignment in the waveform’s phase. Retraining WaveGlow as well can help mitigate this issue.

Conversion of the Trained Model to ONNX

The trained model can be converted to ONNX using NVIDIA’s samples. As the Tacotron 2 model cannot be exported directly due to LSTM errors, NVIDIA’s samples include a model splitting process to enable export using PyTorch.

cd DeepLearningExamples/PyTorch/SpeechSynthesis/Tacotron2/
mkdir output
python tensorrt/convert_tacotron22onnx.py --tacotron2 ../../../../models/tacotron2_statedict.pt -o ../../../../onnx/nvidia
python tensorrt/convert_waveglow2onnx.py --waveglow ../../../../models/nvidia_waveglow256pyt_fp16 --config-file config.json --wn-channels 256 -o output/

The converted ONNX is divided into four parts. The encoder to obtain embeddings from text, the decoder which is repeatedly executed to generate the Mel-spectrum, PostNet to denoise the Mel-spectrum, and finally, Waveglow to produce the audio waveform.

It should be noted that the Waveglow included in Tacotron 2 is newer than the Waveglow used by the TensorRT converter. Therefore, if Waveglow is retrained, the checkpoint cannot be loaded as it is.

A converter that supports the newer version of Waveglow is available at the link below.

Usage With ailia SDK

The ailia MODELS repository contains NVIDIA’s official English speech synthesis model as well as the Japanese speech synthesis model that we independently trained.

Here is an command example for the English version.

python3 tacotron2.py -m nvidia -i "Hello world"

Below is an example of inference in Japanese. The Japanese speech synthesis model uses the Tsukuyomi-chan corpus, so it is necessary to adhere to the terms of use for Tsukuyomi-chan (in Japanese only).

python3 tacotron2.py -m tsukuyomi -i "こんにちは"

For using Japanese, pyopenjtalk is required for phoneme conversion.

# mac OS, Linux
pip3 install pyopenjtalk
# Windows
pip3 install pyopenjtalk-prebuilt

Introduction to ailia AI Voice

We are currently developing ailia AI Voice, a library designed to simplify the use of AI voice synthesis.

To use Tacotron 2 in Japanese, pyopenjtalk is necessary, which makes it difficult to operate on iOS or Android.

With ailia AI Voice, the plan is to provide a library for iOS and Android that includes pyopenjtalk, enabling the use of speech synthesis on mobile devices.

ax Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.

ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.

--

--

David Cochard
axinc-ai

Engineer with 10+ years in game engines & multiplayer backend development. Now focused on machine learning, computer vision, graphics and AR