The applications of TTS are wide and the structure of TTS

artificial intelligence
4 min readAug 15, 2022

--

1. What is TTS?

TTS, which stands for Text To Speech, is part of the human-machine dialogue that enables machines to speak. TTS technology converts text files in real time。 And the conversion time is as short as a second. TTS speech synthesis technology is about to cover the first and second level of national standard Chinese characters. With English interface, automatic recognition of Chinese and English support mixed reading of Chinese and English. All voices use real Mandarin as the standard pronunciation, achieving a fast speech synthesis of 120–150 Chinese characters and a reading speed of 3–4 Chinese characters. So that users can hear clear and pleasant sound quality and coherent and smooth intonation. A small number of MP3 Walkman have TTS function.

TTS is a type of speech synthesis application that converts files stored on a computer. It can help files or web pages into natural speech output. TTS not only helps people with visual impairments read information on a computer, but also increases the readability of text documents. TTS applications include voice-driven emails and voice-sensitive systems. And they are often used in conjunction with voice recognition programs.

TTS text-to-speech conversion is widely used, including e-mail reading, voice prompts for IVR systems, etc. IVR systems have been widely used in various industries (such as telecommunications, transportation, etc.). The key technology used in TTS is speech synthesis. Early TTS generally uses special chips, such as Texas Instruments’ TMS50C10/TMS50C57, Philips’ PH84H36, etc. But this is mainly used in household appliances or children’s toys.

2. The application of TTS is wide

The TTS based on microcomputer applications is generally implemented with pure software, which mainly includes the following parts.

● Text analysis-Linguistic analysis of the input text, sentence by sentence, lexical, syntactic and semantic analysis to determine the low-level structure of the sentence and the composition of the phoneme of each word, including text break, word cut, processing of polyphonic words, processing of numbers, processing of abbreviations, etc.

●Speech synthesis- Extracts the single word or phrase corresponding to the processed text from the speech synthesis library. And it converts linguistic descriptions into speech waveforms.

●Rhyme processing-Synthetic speech quality refers to the quality of speech output by a speech synthesis system. And it is generally evaluated subjectively in terms of clarity or intelligibility, naturalness, and coherence. Clarity is the percentage of meaningful words that can be heard correctly. Naturalness is used to evaluate whether the sound quality of the synthesized speech is close to that of human speech and whether the intonation of the synthesized words is natural. And coherence is used to evaluate whether the synthesized speech is smooth.

The algorithm used to synthesize high quality speech is extremely complex. And therefore, it is very demanding on the machine. The complexity of the algorithm determines the system capacity of the microcomputer to perform multi-channel TTS concurrently.

In addition to TTS software, many vendors also offer hardware products, including the Quick Link Pen from WizCom Technologies of Israel, which is a pen-like device that can scan and read text. The Road Runner from Ostrich Software, which is a hand-held device that can read ASCII text. And the American DTS, which is a handheld device that can read ASCII text. There is also the DecTalk TTS from DEC, which is an external hardware device that can replace the sound card and contains an internal software device that can work with the PC’s own sound card.

3. The basic structure of TTS in CTI application

In a general CTI application, there is an IVR (Interactive Voice Response System). IVR system is an important part of a call center, through which users can input information using an audio press-ken phone and get pre-recorded digital or synthetic voice information from the system. IVR with TTS function can speed up the service and save the service cost, so that IVR can provide 7*24 hours service for callers.

Most of the common IVR systems are composed of voice boards inserted on a common IPC platform and support technologies such as Chinese voice synthesis TTS. A typical telephone service process that includes TTS service can be divided into. The system IVR responds and obtains information such as the user’s keystrokes. The IVR requests relevant data from the database server based on the user’s keystroke information. The database server returns the text data to the IVR. The IVR sends the text information to be synthesized to the TTS server through its TCP communication interface. The TTS server sends the segmented voice data synthesized from the user’s text to the IVR server via the TCP communication interface. The IVR server assembles the segmented speech data into individual speech files. The IVR plays the corresponding voice files to the telephone users. Most of the general public network access (IVR) uses IPC and voice board, while, the synthesized voice data is sent to IVR through LAN. this structure is only suitable for simple applications.

For more information, please check: https://en.speechocean.com/Cy/500.html

--

--