[2020] Speech Generation 1: Quasi-Periodic Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network

8 min readAug 2, 2020

In this article, our proposed QPNet and QPPWG with pitch-dependent dilated convolution neural network (PDCNN) and quasi-periodic (QP) structure are introduced. If you are interested, you can also access the video version or Mandarin (中文) version.

Categoria of Vocoder

In the last article ([2020] Speech Generation 0: Vocoder and RNN- and CNN-based Speech Waveform Generative Models), we summarized the vocoder techniques into two main categories. The first one is the source-filter vocoder, which includes an excitation generation module and a resonance filtering module. The inputs are acoustic features and the output is speech waveform. The second one is the unified vocoder which directly models speech waveform with a single neural network. The inputs can be acoustic or linguistic features and the output is also speech waveform.

Although these neural-based vocoders achieve high-fidelity speech generation, the data-driven nature with very limited prior knowledge of speech makes them lack acoustic controllability such as pitch controllability. To tackle this problem, we propose an adaptive network to introduce prior pitch information to the network for the improvement of pitch controllability. In this article, the proposed pitch-dependent dilated convolution neural network (PDCNN) and quasi-periodic (QP) structure are presented.

Index Terms — Neural vocoder, WaveNet, parallel WaveGAN, pitch-dependent dilated convolution, quasi-periodic structure

Problem of Unified Vocoder

As we know, speech is a quasi-periodic signal, which includes periodic and aperiodic components. The periodic component has a long-term correlation, and the aperiodic component has a short-term correlation.

Inefficient speech modeling

As a result, modeling speech using a fixed unified network without any prior knowledge of audio periodicity is inefficient. For example, as shown in the following figure, the fixed receptive filed length and sampling sparsity make the network oversample the periodic signal, which means the receptive field includes many redundant samples.

Limited pitch controllability

Furthermore, because of the unified model nature without prior knowledge of pitch, the unified vocoders do not explicitly model periodic components. Therefore, it is difficult for these vocoders to generate speech with an accurate pitch when conditioned on unseen acoustic features such as F0 is not in the observed F0 range of training data or unseen spectral and F0 pairs.

Quasi-Periodic Waveform Generative Model

PDCNN and QP Structure

The main drawback of CNN/dilated CNN (DCNN) is the fixed geometric structure. Specifically, convolution with a size two kernel can be formulated as

y denotes feature map, i denotes input, and o denotes output. W^(c) and W^(p) are trainable weight matrices for current and previous samples, respectively. t is the time index, and d is the dilation size. For CNN, d is set to one. For DCNN, d is pre-defined and time-invariant. However, for our proposed PDCNN, the dilation size d’ is pitch-dependent and time-variant. That is, the original dilation size d is multiplied by a time-variant pitch-dependent scale E to get the dilation size d’ of PDCNN.

Pitch-dependent dilated convolution neural network

For the implementation of PDCNN, since directly modifying the CNN module to dynamically change the dilation size is difficult and not hardware-friendly, we separate the CNN kernel and dynamically index the input feature map, which are equivalent to changing the dilation size of each convolution computation. Specifically, the results of a 2×1 CNN are equivalent to the summation of the results of two 1×1 CNNs, namely CNN^(c) for the current samples and CNN^(p) for the past samples. Therefore, we can index the input feature map of the 1×1 CNN^(p) to dynamically change the dilation size of each sample based on the instantaneous F0.

Implementation trick of PDCNN: CNN kernel separation and feature map indexing

With the pitch-dependent dilation size, each sample has a specific effective receptive field length corresponding to its pitch. For example, as shown in the following figure, although signals (a) and (b) have different frequencies, their effective receptive fields still contain the same number of cycles by changing the length of the convolution gaps. It means a different sparsity of the CNN sampling grids.

Since speech has periodic and aperiodic components, we propose a QP structure to simultaneously model them. The QP structure is composed of cascaded fixed and adaptive blocks. The fixed blocks with fixed network structures adopt DCNNs to model the short-term correlations of aperiodic components. The adaptive blocks adopt PDCNNs to model the long-term correlations of periodic components, and its network architecture is dynamically changed according to the instantaneous F0.

QPNet

First, we apply the QP structure to the WaveNet vocoder [A. Tamamori+, 2017]. The main difference from WaveNet is that the proposed QPNet replaces several fixed blocks of WaveNet with the proposed adaptive blocks to improve its pitch controllability and modeling efficiency. With the proposed QP structure, QPNet achieves similar speech quality and higher pitch accuracy than WaveNet while the model size is halved of that of WaveNet. However, although the model size is reduced by 50%, it is still large. The large network and autoregressive (AR)-mechanism make the generation is still far away from real-time.

QPPWG

As a result, we apply the QP structure to parallel WaveGAN (PWG [R. Yamamoto+, 2020]), which is a compact and non-AR generative model. The proposed QPPWG inherits the discriminator and multi-STFT loss module of PWG, and the main improvement is applying the QP structure to the PWG generator. Even if the model size of PWG is only 3% of WaveNet, QPPWG still further reduce 30% model size of PWG while achieving similar speech quality but higher pitch controllability.

On the other hand, compared with QPNet, the main difference of QPPWG is the Non-AR mechanism, Gaussian noise input, raw waveform output, and only 5% model size. According to our objective and subjective evaluations, the QP structure shows the effectiveness in both WaveNet- and PWG-like models. The QP structure makes these models have better pitch controllability with a smaller model size while keeping similar speech quality.

Discussion

Understanding of QP Structure

Since PWG and QPPWG directly output raw waveform samples, we can easily dissect the model to understand the internal generation mechanism. According to the visualized cumulative intermediate outputs, we can find that PWG gradually generates both harmonic and non-harmonic components. However, the QPPWG model with an adaptive-to-fixed order first generates the harmonic components and then the non-harmonic components. That is, the first 10 adaptive blocks focus on modeling pitch-related components and the last 10 fixed blocks focus on modeling spectral-related components.

In contrast to the QPPWG model with an adaptive-to-fixed order, the QPPWG model with a fixed-to-adaptive order first generates the non-harmonic components and then the harmonic components. The visualized results confirm our assumption that the adaptive blocks modeling the pitch-related components with long-term dependencies, and the fixed blocks modeling the spectral-related components with short-term dependencies.

Visualized intermediate outputs of QPPWG (adaptive → fixed)

Visualized intermediate outputs of QPPWG (fixed → adaptive)

QPPWG and NSF

Although QPPWG is a unified vocoder, its cascaded network structure is very similar to the source-filter model. The adaptive blocks analogous to the excitation generation and the fixed blocks analogous to the spectral filtering. Compared with the neural source-filter (NSF [X. Wang+, 2019]) model, which also adopts a non-AR DCNN architecture, the main difference is that the excitation generation of QPPWG is performed by a neural network.

PDCNN and Deformable CNN

The idea of a dynamically updated attention mechanism making a network know “where to look” at each time step is not new. Deformable CNN [J. Dai+, 2017] is an example, which achieves markedly improvement in the object detection task. Specifically, for CNN, the sampling grid is fixed, so the coverage of a single kernel is also limited. The straightforward way to enlarge the coverage while keeping the computation cost the same is by increasing the dilation size like DCNN. However, the fixed offsets of the sampling grid are inefficient, the enlarged coverage may also contain many undesired parts. Therefore, the authors of deformable CNN proposed learnable time-variant sampling offsets to make the network focus on the desired areas.

Comparison among CNN, DCNN, and deformable CNN

The idea is very similar to our proposed PDCNN for changing the index of the input feature map. The main difference is that deformable CNN adopts a neural network to predict the index, but the index of PDCNN is parametrically determined by the input pitch and sampling rate. Therefore, PDCNN is a special case of deformable CNN with prior knowledge of pitch. Moreover, deformable CNN indexes the input feature map once in each layer, but the proposed PDCNN changes the dilation size for each kernel sampling, and it is the reason why we need the kernel separation.

Comparison between PDCNN and deformable CNN

Conclusion

For our proposed modules and models, PDCNN is very simple and can be easily integrated into any CNN-based model. The QP structure makes a pitch-dependent adaptive network available, and its source-filter like architecture is more tractable and interpretable. Compare with original WaveNet and PWG, the proposed QPNet and QPPWG respectively achieve better pitch controllability and similar speech quality with smaller models.