Introductory of Speech and Signal Processing | Lifelike Speech Synthesis

Basic Knowledge about Speech Processing and Tacotron2

Prim Wong
Super AI Engineer
4 min readAug 31, 2021

--

In this emerging technology, we are Leveraging Machine Learning with Text to Speech model which is the preferred tools in many services.

Benefits of Text to Speech

Accessibility is Essential

Text to Speech allows a sensational and lifelike conversation in the nature of how human speaks. Text to Speech can be applied in a wide range across all industries, that are aiming to enhace customer experiences and expand to the global market. With the use of Text To Speech, it is more user friendly and efficient that could save your time and money. Furthermore, text to speech boost up the effective branding across all the touchpoints and with the growth of Eldery users and people with literature issues.

Benefits of Text To Speech

Signal Processing and Wave Knowledges

Once we are synthesizing the speech, let’s look at the signal preprocessing stage, it is crucial to understand our dataset before training it.

We will covered mainly 3 types of the signal waves :

  1. The waveforms and frequency
  2. Fourier Transform
  3. Spectrogram and Mel-Spectrogram

Waveforms

Waveforms are the signal that have amplitute (loudest) on the y-axis and time domain on the x-axis. Raw audio is a type of the waveforms.

Waveform | Credit

Frequency

Fourier Transform

We are decomposing frequency from the original sound wave, to extract more features, learn significant insights from the wave sound and it is much easier to process the wave sound.

In a millisecond of the wave signal, there are monochannel and multiple channels (stereo). In order to extract different properties of the sound, we use “Fourier Transform”.

Spectrograms

A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. — wikipedia

Credit

The mel-spectrogram is manipulating the idea how our ears works, we are taking log or multiplying it to the mel-filterbank, to get the mel — spectogram.

mel spectrogram

The journey of the Wave :

  1. Raw audio
  2. Spectrogram ( Raw audio take Furior Transform )
  3. Mel-Spectrogram (multiply by a log-alike scale called mel-scale which mimics the frequency range of how human’s ears perceive)
Journey of the Wave

One Full History of TTS

History of Thai TTS

Tacotron2

Tacotron 2 is the very natural sounding of synthesizing the speech by using the advancing AI and ML technology by just using the text, without any redundant specifications of the acoustic features.

The Tacotron2 is the neural network model that researchers around the globe aimed for, we are manipulating the most nature, human-liked sound, that was synthesized and without having sophisticated linguistic and acoustic features as input. The input of Tacotron2 is only the sound waves and the script files.

Input of Tacotron

Why do we choose tacotron : The Ease of Data Preparation

ALL WE NEED to train the model
is the AUDIO WAVE and the SCRIPT!

Tacotron 2 Architecture

Tacotron 2

Tacotron2 split in 2 main parts :

  1. Spectrogram Prediction
    Spectrogram Prediction Network is a seq2seq attention mechanism which combines with Short Time Furior Transform (STFT) and Autoencoder.
  2. WaveNet vocoder
    Vocoder (Voice Encoder) is a synthesizer and analyser of the input mel-spectrogram. History of Vocoder
Tacotron 2 Architecture

Conclusion

Signal Waves

There are 3 types of the signal waves :

  1. The waveforms and frequency
  2. Fourier Transform
  3. Spectrogram and Mel-Spectrogram

One Full History of Text To Speech

Starting from the Traditional Method and then enhancing with the Machine Learning approach of text synthesizer with Wavenet. During this modern days, we have Tacotron2 with an exceptional performance and ease of data preparation.

Lifelike Speech Synthesis | Thai Text To Speech with Tacotron2

--

--