What is speech synthesis technology? And what are the categories of speech synthesis?

artificial intelligence
5 min readOct 20, 2022

--

1. What is speech synthesis technology?

Speech recognition and speech synthesis technologies are two key technologies,which can realize human-computer speech communication and establish a spoken language system with listening and speaking capabilities. It can make computers have the ability to speak and understand human-like speech. It is an important competitive market for the information industry in the 1990s. Compared to speech recognition, speech synthesis is relatively more mature and it is one of the most promising technologies in this field for recent breakthroughs and industrialization. Language synthesis or computer speech involves two possibilities. First, the machine can reproduce a pre-stored speech signal. It is like an ordinary tape recorder. Besides,digital storage technology is used. Simply stitching together pre-stored monosyllables or phrases can also be done as machine speech, but “one word at a time” is very difficult to accept because of its machine-like flavor. However, if enough speech units are deposited in advance and the required units are selected and spliced together during synthesis using appropriate technology, it is possible to produce highly natural speech. And this is the waveform splicing method of speech synthesis. In order to save storage capacity, data compression can also be applied to the speech signal before it is deposited into the machine. Another possibility is to use a digital signal processing method, which treats the human vocal process as a source that simulates the vocal gate state to excite a time-varying digital filter characterizing the resonant properties of the vocal tract. This source may be a sequence of periodic pulses, which represents the vocal fold vibration in the case of a turbid tone or a sequence of random noise, which represents a clear tone that is not voiced. It also can adjust the parameters of the filter equivalent to changing the shape of the oral cavity and vocal tract for the purpose of adjusting the parameters of the filter. It is equivalent to changing the shape of the oral cavity and vocal tract to control the purpose of producing different sounds, while,adjusting the period or intensity of the pulse sequence of the excitation source will change the pitch and accent of the synthesized speech. Therefore, as long as the excitation source and filter parameters are correctly controlled, this model can synthesize various utterances flexibly, so, it is also called as the method of parameter synthesis. Depending on the structural form of the time-varying filter, there are LPC synthesis and resonant peak synthesizer.

Speech synthesis technology has been studied for more than two hundred years. But the modern speech synthesis technology with real practical significance has been developed with the development of computer technology and digital signal processing technology. The main purpose is to enable the computer to produce continuous speech with high definition and high naturalness. In the process of the development of speech synthesis technology, the early research mainly used the method of parameter synthesis.

2. Parameter synthesis

In the development of speech synthesis technology, the early research mainly adopts the method of parameter synthesis. It is worth mentioning that Holmes’s parallel formant synthesizer and Klatt’s series. As long as the parameters are carefully adjusted, both synthesizers can synthesize very natural speech. The most representative text-to-speech system is DEC’s DECtalk (1987). However, after years of research and practice, it is difficult to extract formant parameters accurately, although many realistic synthetic speech can be obtained by using formant synthesizer. However, the sound quality of the overall synthesized speech is difficult to meet the practical requirements of the text-to-speech system. Since the late 1980s, new progress has been made in language synthesis technology, especially the pitch synchronous superposition (PSOLA) method (1990), which greatly improves the timbre and naturalness of speech synthesized based on time-domain waveform stitching. In the early 1990s, text-to-speech systems for French, German, English, Japanese and other languages based on PSOLA technology have been successfully developed. The naturalness of these systems is higher than that of the previous text-to-speech synthesis systems based on LPC method or formant synthesizer. And the synthesizer based on PSOLA method is simple and easy to be implemented in real time, so, it has a great commercial prospect.

3. Speech synthesis can also be divided into three levels according to speech function

According to the different levels of human speech function, language synthesis can also be divided into three levels. The first one is the synthesis from text to speech (Text-To-Speech). The second one is the synthesis from concept to speech (Concept-to-Speech). The third one is the intention-to-Speech synthesis.

These three levels reflect the different processes in the human brain that shape the content of speech and involve the higher neural activity of the human brain. In order to synthesize high-quality language, besides relying on various rules, including semantic rules, lexical rules and phonological rules, one must also have a good understanding of the content of the text, which would involve the problem of natural language understanding. From this point of view, a text-to-speech system can actually be regarded as an artificial intelligence system. The process of text-to-speech conversion involves converting a text sequence into a phonetic sequence and then generating a speech waveform by a speech synthesizer. The first step involves linguistic processing, such as word division, word-sound conversion and a set of effective rhyme control rules. The second step requires advanced speech synthesis technology that can synthesize a high-quality speech stream in real time as required. Therefore, in general, text-to-speech synthesis systems require a complex set of conversion procedures from text sequences to phoneme sequences, which means that the text-to-speech conversion system must not only apply digital signal processing techniques, but also be supported by a large amount of linguistic knowledge. Of course, speech synthesis is still the most basic part, it is equivalent to artificial mouth. Any language synthesis system, including language conversion system, can not be separated from the speech synthesizer.

For more information on speech synthesis, please check: What are the techniques and methods of speech synthesis?

--

--