GOING BEYOND TACOTRON

Deepanshi
OffNote Labs
Published in
8 min readSep 18, 2020

Speech synthesis is the process of generating speech from input text. Despite decades of research on the subject, the task is still pretty challenging and offers a vast world of opportunities and experiments. The first complete text-to-speech system was built in 1968. Various Speech Synthesis models are currently being employed in spoken dialog systems for machine-human communication. Recently, there has been a flurry of research on speech synthesis using neural models.

In this article, we will be discussing a prominent Text to speech (TTS) neural model, Tacotron, and talk about tweaking the architecture to generate speech for even unseen target speakers.

Tacotron

Tacotron is a seq2seq model with attention, which includes an encoder, an attention-based decoder, and a post-processing net. At a high-level, the model takes a character as inputs and produces spectrogram frames, which are then converted to waveforms. The predecessors of this model, like DeepVoice, trained each block separately, which can lead to accumulating losses. The Char2Wav predicts vocoder parameters and not raw spectrograms. The Tacotron model is an end to end TTS system. It does not require hand-engineered features and adapts to new data easily. The model is more robust and can be trained from scratch with random initialization.

Now, let’s discuss the architecture of the Tacotron model. It consists of an encoder, decoder, and a vocoder.

Architecture

Fig1: Tacotron Architecture

Encoder: The encoder is composed of a Prenet and CBHG block. The encoder extracts the sequential representation of text. The input to the block is a character sequence as a one-hot vector. The Prenet block is a fully connected neural network with dropouts, which helps convergence and improves generalization.

The encoder takes in input sequentially and not the entire sentence at once. Hence the context of the word needs to be modeled by looking at the complete text. For eg: bank can mean a riverbank or a financial institute, here other words in the sentence like river or money can help to identify the context of the word. The CBHG block helps to get this contextual information of the sentence.

The CBHG block consists of a 1-D convolutional filter, highway networks, and a bidirectional gated recurrent unit (GRU) recurrent neural net (RNN). The input is first fed into a set of 1-D convolutional filter. The output is then stacked together and max pooled along time. The processed sequence is passed through a bank of 1-D convolutions, whose output is further added with the input sequence via a residual connection. All the convolutional layers make use of batch normalization. The output is then fed into a multi-layer highway network. Finally, a bidirectional GRU RNN is stacked on top to extract sequential features from forward and backward context. The RNN block takes into account the output from previous sequences to generate the context.

Fig2: CBHG block

Decoder: The decoder block is basically an attention mechanism to get Mel-spectrograms. The sub-blocks in the decoder are Pre-net, attention RNN, and Decoder RNN. The adjacent Mel-spectrograms are usually very similar. Hence the Prenet is fed with the last Mel frame from the previous time step as input to extract some information and provide continuation in the generated speech. Similarly, instead of feeding the RNN’s with just the output from the previous blocks, it is concatenated with the previous time-step RNN’s output. The output from Prenet is concatenated with the previous time step’s attention RNN block and is fed into the attention RNN. The output of this attention RNN is concatenated with the context vector from the previous timestep’s decoder RNN and fed into the decoder RNN block. The decoder RNN module generates r Mel spectrogram frames.

Post-Processing: The post-processing net makes use of another CBHG block with a Griffin-Lim synthesizer to generate waveforms from Mel spectrograms.

On comparing the Tacotron model with the vanilla seq2seq model, it was found that the seq2seq model produces speech with bad naturalness. The CBHG module prevents overfitting and generalizes well to long and complex phrases. On comparing with a model with the CBHG encoder replaced by a 2-layer residual GRU encoder results in a noisy output.

The mean opinion score (MOS) is a measure representing the overall quality of the speech generated. The MOS for tacotron is 3.82. The main issue with Tacotron was the speech was very robotic which was mainly caused by the Griffin-Lim synthesizer.

Tacotron 2, was developed to take into account all the shortcomings of the Tacotron model.

Tacotron 2

Tacotron 2 combines the best of two approaches: a sequence-to-sequence Tacotron style model that generates Mel-spectrograms, followed by a modified WaveNet vocoder. The Tacotron model makes use of Bahdenau Attention which is a content-based attention method. However, it does not consider location information, therefore, it needs to learn the text to speech alignment just looking into the content which is hard. Tacotron2 uses Location Sensitive Attention which takes account of the previous attention weights. By doing so, it learns the monotonicity constraint during speech generation.

Let’s briefly go over the architecture details of tacotron2.

Architecture

The network is composed of an encoder and a decoder with attention. The encoder takes in character sequence and converts it into a feature representation. The input characters are represented using a 512-dimensional character embedding, which is passed through a stack of 3 convolutional layers. The convolutional layers help in modeling long-term context. The output of the final layer is then fed to a single bi-directional LSTM, generating encoded features. The encoder output is consumed by an attention network which summarizes it into a fixed-length context vector.

Fig3: Tacotron2 Architecture

The decoder is an autoregressive recurrent neural network that predicts a Mel spectrogram from encoded sequences. The prediction from the previous time step is passed through 2 fully connected layers or pre-net. The pre-net output and attention context vector are concatenated and passed through a stack of 2 unidirectional LSTM layers. The concatenation of LSTM output and the attention context is projected through a linear transform to predict the target spectrogram frame. Finally, the predicted Mel-spectrogram is fed to a 5-layer convolutional Postnet which predicts a residual connection to add to the prediction to improve the overall reconstruction.

The Tacotron 2 model makes use of Minimum squared error (MSE) or L2 loss instead of Minimum absolute loss (MAE) or L1 loss (used in Tacotron). The MOS for Tacotron 2 is 4.526. One of the issues with both the Tacotron model is that it cannot produce speech for different speakers. In other words, we cannot pass the speaker’s characteristics as input to the model.

Voice cloning

Voice cloning refers to a generating speech in the voice of a given new speaker. Both TTS and voice cloning models generate speech as output but the TTS models do not pay much attention to the speaker characteristics.

While the conventional approach for voice cloning is to train the model with available data, it faces a lot of roadblocks. Training model for generating speech for different speakers requires hours of training data, without any noise. Besides that, it cannot produce speech for unseen targets.

To solve these issues, voice cloning approaches rely on a number of approaches like speaker encoding and speaker adaptation.

  • The speaker adaptation method is basically training the pre-trained multi-speaker speech generative model with the available data. This is the most common way to clone the voice of a speaker, but still, the model cannot generate speech for unknown target and the quality of speech produced from the limited data is questionable.
  • Speaker encoding is training a separate model to generate speaker embedding, which can be fed to a multi-speaker speech generative model. The idea behind the current approach is to encode the speaker’s characteristics like pitch, speech rate, and accent in a vector and process it accordingly. A pretrained multi-speaker model can be tweaked to generate speaker embedding, which can be fed into the speaker encoder as ground truth.

Speaker Adaptation with Tacotron model

Tacotron can generate speech with a good MOS and naturalness but fails to generate speech for multiple speakers. This is because the Tacotron model processes the linguistic and speaker features together and not separately. Changing the model to incorporate the generation of speaker embedding can give a complete speech to text model with multiple speakers. The Tacotron model has an encoder and decoder network which basically extracts the text embedding and generates mel-frames with common speaker characteristics. Not having a speech encoder makes it hard to extract speaker-specific features. Adding a speech encoder that can model a given speaker’s characteristics can solve the problem.

One way to extract speaker characteristics is by using x-vectors. X-vectors are basically compressed embedding or vectors from a deep neural network-based speaker classification model (a and b in the Fig3). A speaker classification model takes in speech sample and predicts the probability of the speech sample to be from one of the speakers. The compressed vector from one of the pre-final layers is extracted as x-vectors.

Fig3: X-vector extraction method
Fig4: Speaker adaptation with Tacotron model

Concatenating (or adding) the output from both speech and text encoder and feeding it to the decoder can help us generate speech for any unseen target. The decoder architecture has to be changed a bit to generate only Mel-spectrograms and not learn speaker embeddings.

There are several recent developments in speech synthesis and voice cloning. We discuss two models briefly.

  • Nautilus model: This paper talks about modeling the speech and speaker features with separate encoder and decoder blocks for speech and speaker characteristics.
  • MultiSpeech model: This paper discusses a transformer-based model, with some architectural changes for text to speech alignment. Voice Cloning results from this model outperformed all the TTS models. The model can generate speech for any unseen target with high naturalness.

A number of startups are trying to develop natural speech synthesizers. Replica, a recent startup is developing a new generation of speech synthesis technologies and offers it’s services to developers and companies through an API. The company promises studio-like quality of the cloned speech. An interesting opportunity offered by the company is the ability of individuals and artists to license their own voices. A large number of podcasts have already been made using the Replica API.

The evolution of speech synthesis relies heavily on the machines to generate features and process it to produce the best possible results without human assistance. Today, speech synthesis is an appreciated facility, primarily applied to aids for people with disabilities. New synthesis techniques under development in speech research laboratories will play a key role in future man-machine interaction.

--

--