[2020] Speech Generation 0: Vocoder and RNN- and CNN-based Speech Waveform Generative Models

9 min readAug 1, 2020

In this article, I will go through basic speech generation background and the developments of recent RNN- and CNN-based speech generative models. If you are interested, you can also access the video version or Mandarin (中文) version.

Speech Synthesis

Speech synthesis (SS) is a technique to generate specific speech according to given inputs such as texts (text-to-speech, TTS). The core of SS is the controllability of speech components, and the fundamental technique is called a vocoder [H. Dudley, 1939]. Conventional vocoders such as STRAIGHT [H. Kawahara+, 1999] encode speech into ad-hoc acoustic features and then decode speech based on these acoustic features. However, many ad-hoc assumptions imposed on the speech modeling of these vocoders cause marked quality degradation. Recently, many neural-based vocoders have been proposed to directly model speech waveform without many ad-hoc assumptions of speech generation.

In this article, I will first introduce speech generation backgrounds and then go through the developments of recent recurrent neural network (RNN)- and convolution neural network (CNN)-based generative models.

Index Terms — Speech synthesis, vocoder, RNN-based speech generation, CNN-based speech generation

Speech Signal

As we know, speech is a sequential signal with a very high temporal resolution. The sampling rate (Fs) of a speech signal is usually higher than 16 kHz, so we generally model more than 16,000 sample points in one second. As a result, directly modeling speech waveform is challenging.

Vocoder and Source-filter Model

The basic technique to tackle speech generation is “Vocoder”, which means a voice coder. It includes an Encoder to encode speech into acoustic features such as spectral and prosodic features, and then its Decoder synthesizes speech waveform based on these acoustic features.

One of the most general speech generative modelings is the source-filter model [R. McAulay+, 1986]. Specifically, speech generation is formulated as a convolution of an excitation signal and a spectral filter. Since speech includes voiced and unvoiced components, the voiced and unvoiced excitations are separately modeling. The voiced excitation is a quasi-periodic signal with clear harmonic components, which is generated by the vocal fold movements. The unvoiced excitation is a white-noise-like signal without vocal fold movements. The spectral filter is a time-variant filter to model the vocal tract resonances. In the end, we usually use a high-pass filter to model the lip radiation.

One of the most popular source-filter vocoders is the mixed excitation vocoder such as STRAIGHT and WORLD [M. Morise+, 2016]. In the analysis (encode) stage, the mixed excitation vocoder first extracts the fundamental frequency (F0) of input speech and then extracts aperiodicity (ap) and spectral envelope (sp) based on the extracted F0. In the synthesis (decode) stage, the vocoder first generates the mixed excitation signal based on the F0, ap, and white noise. Then, the mixed excitation signal is filtered by the spectral filter to generate synthetic speech. However, phase information and some temporal details are discarded during the analysis-synthesis process, which causes significant speech quality degradation.

RNN-based Speech Generation

Recently, instead of using conventional vocoders, many neural-based speech generative models have been proposed. When thinking about modeling a sequential signal like speech, the straightforward method is using a recurrent neural network (RNN). However, since speech has a very high temporal resolution, the RNN model will suffer from gradient vanishing/explosion problem when modeling such long term dependencies. Moreover, the sample by sample generation mechanism of the RNN model usually results in a very time-consuming generation.

SampleRNN

To tackle this weakness of RNN for modeling a long sequence, SampleRNN [S. Mehri+, 2016] has been proposed. The dependency among speech samples has been formulated as a conditional probability. The probability distribution of each sample is conditioned on all previous samples.

SampleRNN adopts a hierarchical structure with multiple RNN layers, and each layer operates on a different temporal resolution. Therefore, the network can model the long-term dependency well by capturing the hierarchical information of speech signals. However, the multi-layer RNNs and the autoregressive (AR) mechanism still make the generation very slow.

WaveRNN and LPCNet

To achieve real-time speech generation, WaveRNN [N. Kalchbrenner+, 2018] and LPCNet [J.-M. Valin+, 2018] have been proposed. WaveRNN is not only conditioned on previous speech samples but also acoustic features h.

Since the acoustic features are very strong prior information, a single-layer gated recurrent unit (GRU) with specific hardware optimized designs achieves real-time generation. Moreover, the authors of LPCNet incorporate the source-filter model into WaveRNN and make the network only focus on generating the residual (source) signal of LPC prediction. Since the residual signal is almost speaker-independent, the burden of modeling speaker identity and spectral information has bee eased. Therefore, a very compact network is enough for LPCNet. In conclusion, the main challenge of the recent RNN-based speech generation models is how to develop a very compact model for real-time speech generation.

CNN-based Speech Generation

On the other hand, many CNN-based speech generative models also have been proposed. Since CNN has a fixed geometric structure without any recurrent mechanism, the number of conditional previous speech samples is fixed. The fixed-length segment is a receptive field. To model the very long-term dependency of speech, a very long receptive field is required, and it can be achieved by a large kernel size or deep CNNs. However, computation and space requirements are also markedly increased. Therefore, the main challenge of the CNN-based models is an efficient way to increase the receptive field length.

WaveNet

The first CNN-based model achieving a state of the art performance is WaveNet [A. Oord+, 2016]. WaveNet adopts dilated convolution neural networks (DCNN) [F. Yu+, 2016] to enlarge the receptive field efficiently. Stacked DCNNs with different dilation sizes, which are the lengths of the skips, efficiently increase the receptive field length and capture hierarchical information of the speech signal. However, it still needs 30 DCNN layers to achieve good speech quality. The huge network and the AR mechanism of WaveNet make its generation very slow. It usually takes more than one minute to generate one-second speech.

Flow-base SS

As a result, many non-AR CNN-based models have been proposed to utilize the parallel generation advantage of the non-AR mechanism to achieve real-time generation. For example, flow-based models such as invertible autoregressive flow-based parallel WaveNet [A. Oord+, 2017] and Clarinet [W. Ping+, 2019] and Glow-based WaveGlow [R. Prenger+, 2019] and Flowavent [S. Kim+, 2019] adopt invertible networks conditioned on acoustic features to transfer speech to white noise in the training stage and recovery the speech signal from the white noise input in the testing stage. However, the invertible requirement usually results in difficult training and a huge network.

GAN-based SS

Furthermore, GAN-based non-AR models such as parallel WaveGAN (PWG) [R. Yamamoto+, 2020] and MelGAN [K. Kumar+, 2019] have also been proposed, and they adopt GAN to attain very compact generative models with high fidelity speech generation. However, the stability of GAN-training is challenging. PWG adopts a multi-short-time-Fourier-transform (multi-STFT) loss module to keep GAN-training stable. The generator of PWG is a non-AR WaveNet, and its inputs are white noise and upsampled acoustic features, which have matched temporal length as the target speech waveform. The generator is trained with both the GAN loss and the multi-STFT losses with different FFT lengths, frame lengths, and hop sizes. The multi-STFT losses make the generator capture hierarchical information of speech signals and make the training stable. Instead of multi-STFT losses, MelGAN adopts multiple discriminators operated on different temporal resolutions to make its generator capture hierarchical information. In conclusion, capturing hierarchical information of speech signals is a key point for making GAN-training stable.

Vocoder Summarization

To summarize, the vocoder techniques include two main categories, source-filter vocoder, and unified vocoder. A source-filter vocoder includes an excitation generation module and a resonance filtering module. The inputs are acoustic features and the output is speech waveform. For conventional vocoders, both modules are parametric-based like AR-based LPC vocoder and MGC vocoder or non-AR-based STRAIGHT and WORLD. Because of the powerful modeling capability of neural networks (NNs), we can replace the excitation module with a NN such as AR-based LPCNet and non-AR-based GlotGAN [B. Bollepalli+, 2017] and GELP [L. Juvela+, 2019], or we can replace the filtering module with a NN such as neural source filter (NSF) [X. Wang+, 2019].

On the other hand, a unified vocoder directly models speech waveform with a single NN. The inputs can be acoustic or linguistic features and the output is speech waveform, too. Although the unified vocoders such as AR-based WaveNet and WaveRNN or Non-AR-based Parallel WaveNet and Clarinet achieve high fidelity speech generation, the pure data-driven nature with very limited prior knowledge makes them lack speech controllability, which is an essential feature of a vocoder. Therefore, we proposed a QP structure to dynamically adapt the network structure according to the instantaneous pitch. Both AR QPNet and non-AR QPPWG vocoders have better pitch controllability.

Conclusion

In conclusion, for the RNN-based models, the pro is the recurrent structure theoretically enables the network to model arbitrary length of correlations. However, the con is the model complexity should be very limited due to the real-time generation requirement. As a result, current researches of RNN-based speech generation focus on only modeling minimum essential information with a very limited model size such as LPCNet.

On the other hand, for the CNN-based models, the pro is that we can take advantage of the parallelized computation nature of CNN for the real-time generation even if the model is very complicated. However, the con is the fixed geometric structure limits the memory capacity of these CNN-based models. As a result, current researches of CNN-based speech generation first focus on non-AR modeling algorithms such as flow-based and GAN-based models and then focus on adaptive networks with a free form of sampling grid such as our QPPWG.

If you are interested in the advanced CNN-based vocoders with QP structure, please refer to the next article ([2020] Speech Generation 1: Quasi-Periodic Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network).