ICARUS Solution
6 min readDec 5, 2017

How speech recognition software is changing the way we communicate?

To sell suckers one uses deceit and offers glamour, wrote John Pierce of Bell Labs in 1969 as he discussed speech recognition. That was his advice against mad inventors or untrustworthy engineers that weren’t using a scientific approach.

Mobile phone trend of the past 15 years has been the slow decline of calls in favor first of SMS, then of messaging. Then it’s been all about text over talk, but the creeping use of more and more emojis as shorthand is quickly transforming into using speech recognition as a way to save time on text input. We may not be chatting to each other as much as we used to, but the new trend is set; we’re talking to our phones again.

Speech recognition is the ability of devices to respond to spoken commands. Speech recognition enables hands-free control of various devices and equipment, provides input to automatic translation, and creates print-ready dictation.

Sundar Pichai, chief executive of Google, claims that 20 per cent of Google searches on smart phones are now entered by voice. Sending messages,creating appointments, getting directions and updating social media all can now be done using the spoken word, and with ever-increasing accuracy.

HOW IT WORKS?

Before any machine can interpret speech, a microphone must translate the vibrations of a person’s voice into a wavelike electrical signal. This signal in turn is converted by the system’s hardware — for instance, a computer’s sound card — into a digital signal. It is the digital signal that a speech recognition program analyzes in order to recognize separate phonemes, the basic building blocks of speech. The phonemes are then recombined into words. However, many words sound alike, In order to select the appropriate word, the program must rely on the context. Many programs establish context through trigram analysis, a method based on a database of frequent three-word clusters in which probabilities are assigned that any two words will be followed by a given third word. For example, if a speaker says “who am,” the next word will be recognized as the pronoun “I” rather than the similar-sounding but less likely “eye.” Nevertheless, human intervention is sometimes needed to correct errors.

Programs for recognizing a few isolated words, such as telephone voice navigation systems, work for almost every user. On the other hand, continuous speech programs, such as dictation programs, must be trained to recognize an individual’s speech patterns; training involves the user reading aloud samples of text. With the growing power of personal computers and mobile devices, the accuracy of speech recognition has improved markedly. Error rates have been reduced to about 5 percent in vocabularies containing tens of thousands of words. Even greater accuracy is reached in limited vocabularies for specialized applications such as dictation of radio-logical diagnoses.

Speech Recognition Applications

Among the earliest applications for speech recognition were automated telephone systems and medical dictation software. It is frequently used for dictation, for querying databases, and for giving commands to Computer-based systems, especially in professions that rely on specialized vocabularies.

It also enables personal assistants in vehicles and Smart Phones, such as Apple’s Siri. From virtual assistants such as Siri, Alexa and OK Google to apps including Dragon Anywhere, Swype, Swiftkey and Baidu’s new TalkType, speech-activated functions and speech-to-text services are growing.

Speech recognition applications may be classified into three categories: dictation systems, navigational or transactional systems, and multimedia indexing systems. Each category of applications has a different tolerance for speech recognition errors. Advances in technology are making significant progress toward the goal of any individual being able to speak naturally to a computer on any topic and to be understood accurately.

Dictation Applications

Such applications are those in which the words spoken by a user are transcribed directly into written text. Such applications are used to create text such as personal letters, business correspondence, or e-mail messages. Usually, the user has to be very explicit, specifying all punctuation and capitalization in the dictation. Dictation applications often combine mouse and keyboard input with spoken input. Using speech to create text can still be a challenging experience since users have a hard time getting used to the process of dictating. Best results are achieved when the user speaks clearly, enunciates each syllable properly, and has organized the content mentally before starting. As the user speaks, the text appears on the screen and is available for correction. Correction can take place either with traditional methods such as a mouse and keyboard, or with speech.

Transactional Applications

Speech is used in transactional applications to navigate around the application or to conduct a transaction. For example, speech can be used to purchase stock, reserve an airline itinerary, or transfer bank account balances. It can also be used to follow links on the web or move from application to application on one’s desktop. Most often, but not exclusively, this category of speech applications involves the use of a telephone. The user speaks into a phone, the signal is interpreted by a computer, and an appropriate response is produced. A custom, application-specific vocabulary is usually used; this means that the system can only hear the words in the vocabulary. This implies that the user can only speak what the system can hear. These applications require careful attention to what the system says to the user since these prompts are the only way to cue the user as to which words can be used for a successful outcome.

Multimedia Indexing Applications

In multimedia indexing applications, speech is used to transcribe words from an audio file into text. The audio may be part of a video. Subsequently, information retrieval techniques are applied on the transcript to create an index with time offsets into the audio. This enables a user to search a collection of audio/video documents using text keywords. Retrieval of unstructured multimedia documents is a challenge; retrieval using keyword search based on speech recognition is a big step toward addressing this challenge. It is important to have realistic expectations with respect to retrieval performance when speech recognition is used. The user interface design is typically guided by the “search the speech, browse the video” metaphor where the primary search interface is through textual keywords, and browsing of the video is through video segmentation techniques. In general, it has been observed that the accuracy of the top-ranking search results is more important than finding every relevant match in the audio. So, speech indexing systems often bias their ranking to reflect this. Since the user does not directly interact with the indexing system using speech input, standard search engine user interfaces are seamlessly applicable to speech indexing interfaces.

Conclusion

In speech recognition technology have progressed to a point that it is practical to consider speech input in applications. Speech recognition is also gaining acceptance as a means of creating searchable text from audio streams. Dictation applications have the highest accuracy requirements and must be designed for efficient error correction. Transnational applications are more tolerant to speech errors but require careful designing of the constrained vocabulary and cueing of the user. Multimedia indexing applications are also tolerant to speech errors since the search algorithm can be adapted to meet the requirements of the application.