What are the techniques and methods of speech synthesis?

artificial intelligence
5 min readOct 9, 2022

--

The research of speech synthesis technology is the focus of speech synthesis research at home and abroad in recent years. With the development of computer science and technology and network society, speech information service system has been widely used. But speech information service system needs the support of speech synthesis technology.

1. Introduction

Speech synthesis technology enables people to understand the content of information when they are listening. The application of this technology can be described as converting the text information generated by the computer or input by other external devices into speech signal output according to the rules set by speech processing, which will make text file content, mobile phone text message content, WORD file content and other text information. In this way, we can read text messages smoothly through the computer. This kind of high and new technology that converts text into speech is called text-to-speech conversion technology, which is referred to as TTS.

2. Research on speech synthesis techniques and methods.

Recording editing method, waveform synthesis method, parameter synthesis method and rule synthesis method are the main methods to study speech synthesis technology. Of course, there are other methods, which we will not introduce here.

2.1 Recording editing method.

This method is to record a person’s voice through a certain medium. And then, it can properly connect the recorded voice and edit it into the desired text. The disadvantage is that it does not do any compression or other technical processing in the computer. Instead, it is output directly, which requires a large amount of memory.

2.2 Waveform synthesis.

One method is waveform editing and synthesization. at present, many special speech synthesizers basically adopt this method. its principle is to select the natural language synthesis unit waveforms that was adopted by the speech database, edit and assemble these waveforms, and finally output them. In a word, the waveform editing technology is used in speech synthesis. This method is quite common. For example, there are automatic time-telling devices, bus voice announcements and so on. The other is waveform coding synthesis, in which the synthesized speech pronunciation waveform is directly stored or processed by waveform coding and compression technology. And then, this can be stored, decoded and combined to output speech when the synthesis is replayed. This method is similar to the waveform coding and decoding method in speech coding. However, this method needs to be further upgraded technically. And this kind of speech synthesizer is only a device for speech storage and playback.

2.3 The method of parameter synthesis

The early research of experts is mainly using the method of parameter synthesis, which is more complex. Next, we mainly introduce several methods, such as vocal organ parameter synthesis, formant synthesis, LPC synthesis and so on. The sound wave of speech should be calculated in the parameter synthesis of vocal organs. How to get the sound wave of speech? We should first define the relevant parameters of lip, tongue and vocal cord. And then, we need to estimate the cross-sectional area function of vocal tract from these parameters and obtain the sound wave of speech. This synthesis method has the advantages we hope to get. It directly simulates the process of human pronunciation and can produce speech close to human voice. However, at present, there is a lack of effective means to accurately determine these parameters. Because the physiological process of pronunciation of each person is more complex. It is generally believed that the research on the synthesis technology of vocal organ parameters is not mature enough. And it will take some time to get out of the laboratory. Formant synthesis regards the human vocal channel as a resonant cavity. And the resonant characteristic of the cavity determines the spectral characteristics of the speech signal emitted. It is the simulation of the sound source-channel model, which we call formant characteristics. We can obtain speech with different characteristics by modifying the peak synthesis parameters. The synthesized speech with high intelligibility can also be produced at a lower cost, but only if the parameters of vibration peak synthesis are set reasonably. Later, a synthesis system based on acoustic parameters was produced. In many synthetic systems, LPC is a linear predictive coding method. It uses 10~25ms as a frame to sample the speech waveform. The parameters of each frame are time-varying. And within a frame is a linear time-invariant system. The parameters of each frame are stored in the memory. The parameter acquisition method is to extract several prediction coefficients based on least square of the tone period, voiced and voiced of the original speech in the frame, and synthesize the speech with these parameters during synthesis. In the LPC method, the obtained parameters are encoded with 3–7 bits. And the values can be interpolated automatically, because the speech synthesized by this method is soft and beautiful.

2.4 Rule synthesis.

Since the end of 1980s, pitch synchronous superposition (PSOLA) method has been developed, which greatly improves the timbre and naturalness of speech synthesized based on time-domain waveform stitching method. The synthesizer based on PSOLA method has the advantages of simple structure and easy real-time implementation. The proposal of this method indicates that the research of language synthesis technology has made substantial progress, caused a sensation in the scientific community, and has broad commercial value. The principle and characteristic of PSOLA technology is to make the prosodic features of the splicing unit meet the requirements of the context. And at the same time, we can keep the main segment features of the original pronunciation. It is necessary to use the PSOLA algorithm to adjust the prosodic features of the splicing unit, such as fundamental frequency, sound length, sound intensity, etc. And then, we can splice the speech waveform fragments to obtain high definition and natural speech. With the increasing demand for naturalness and sound quality of speech synthesis, a speech synthesis method with good sound quality, strong adaptability to duration and tone and flexible adjustment of prosodic parameters has been put forward again. So, a speech synthesis method based on LMA channel model is proposed. Technically, this new method overcomes the shortcomings of PSOLA algorithm, which is difficult to deal with co-pronunciation and weak ability to adjust prosodic parameters. And this new speech synthesis method has higher synthetic sound quality than PSOLA technology. It solves the problem that PSOLA algorithm is difficult to solve. To sum up, there are many ways of computer speech synthesis. Scientists have made a comparative study from all aspects of software and hardware. And they found that people can choose speech synthesis methods that suit their own needs according to different uses and different purposes of use.

For more information, please check: https://en.speechocean.com/Cy/549.html

--

--