Automatic Speech Recognition Technology

suraj mishra
Alexa Developers SRM
8 min readSep 26, 2020

Speech is the primary mode of communication among human beings. On the other hand, prevalent means of input to computers is through a keyboard or a mouse. It would be nice if computers could listen to human speech and carry out their commands.

Automatic Speech Recognition or ASR, is a technology which allows “us”, humans to conversate with machines such as mobile phones , televisions , personal computers ,etc. It makes life easy for us in many aspects , specially in this era of AI emerging.

You may be confused about what actually does ASR mean?….. So , you must have also thought of the answer yourself too. You must have thought that as the name suggests , ASR is some technology which recognises our voice , which talks to us as an AI bot, which is a bunch of complex codes , etc. Then , you are not absolutely wrong. ASR is very similar to the above mentioned points.

What actually does ASR refer to ??

Automatic Speech Recognition (ASR) is the process of deriving the transcription (word sequence) of an utterance, given the speech waveform. Speech understanding goes one step further, and gleans the meaning of the utterance in order to carry out the speaker’s command.

In simple words , ASR systems facilitate a physically handicapped person to command and control a machine. Even ordinary persons would prefer a voice interface over a keyboard or mouse. The advantage is more obvious in case of small hand held devices. Dictation machine is a well known application of ASR.

Types of ASR Speech Recognition based on the nature of input speech :

ASR is of many types based on their uses or the constraints imposed on the nature of the input speech.

What does this actually mean ?

This means that there are many different types of ASR on the basis of the use and environment. For example , there are different types of ASR depending upon the number of speakers , another one for nature of utterance of speech , etc.

Number of speakers: A system is said to be speaker independent if it can recognise speech of any and every speaker; such a system has learnt the characteristics of a large number of speakers. A large amount of a user’s speech data is necessary for training a speaker dependent system. Such a system does not recognise other’s speech well.

Nature of the utterance: A user is required to utter words with clear pause between words in an Isolated Word Recognition system. A Connected Word Recognition system can recognise words, drawn from a small set, spoken without need for a pause between words. On the other hand, Continuous Speech Recognition systems recognise sentences spoken continuously. Spontaneous speech recognition system can handle speech disfluencies such as ah, am or false starts, grammatical errors present in a conversational speech.

Vocabulary size: An ASR system that can recognise a small number of words (say, 10 digits) is called a small vocabulary system. Medium vocabulary systems can recognise a few hundreds of words. Large and Very Large ASR systems are trained with several thousands and several tens of thousands of words respectively.

Spectral bandwidth: The bandwidth of telephone/mobile channel is limited to 300–3400Hz and therefore attenuates frequency components outside this passband. Such a speech is called narrowband speech. In contrast, normal speech that does not go through such a channel is call wideband speech; it contains a a wider spectrum limited only by the sampling frequency. As a result, recognition accuracy of ASR systems trained with wideband speech is better. Moreover, an ASR system trained with narrow band speech performs poorly with wideband speech and vice versa

The most advanced version of currently developed ASR technologies revolves around what is called Natural Language Processing, or NLP in short. This variant of ASR comes closest to allowing real conversation between people and machine intelligence and though it still has a long way to go before reaching an apex of development, we’re already seeing some remarkable results in the form of intelligent smart phone interfaces like the Siri program on the iPhone and other systems used in business and advanced technology contexts.

However, even these NLP programs, despite and “accuracy” of roughly 96 to 99% can only achieve these kinds of results under ideal conditions in which the questions directed at them by humans are of a simple Yes or No type or have only a limited number of possible response options based on selected keywords (more on this shortly).

Now that we’ve covered the wonderful future prospects of ASR technology, let’s take a look at how these systems work today, as we’re already using them.

A Basic Primer on How Automatic Speech Recognition Works

The basic sequence of events that makes any Automatic Speech Recognition software, regardless of its sophistication, pick up and break down your words for analysis and response goes as follows:

1. You speak to the software via an audio feed.

2. The device you’re speaking to creates a wave file of your words.

3. The wave file is cleaned by removing background noise and normalizing volume.

4. The resulting filtered wave form is then broken down into what are called Phonemes. (Phonemes are the basic building block sounds of language and words. English has 44 of them, consisting of sound blocks such as “wh”, “th”, “ka” and “t”.

5. Each phoneme is like a chain link and by analyzing them in sequence, starting from the first phoneme, the ASR software uses statistical probability analysis to deduce whole words and then from there, complete sentences.

6. Your ASR, now having “understood” your words, can respond to you in a meaningful way.

Some Key Examples of Automatic Speech Recognition Variants

The two main types of Automatic Speech Recognition software variants are directed dialogue conversations and natural language conversations (the same thing as the Natural Language Processing we mentioned above).

Directed Dialogue conversations are the much simpler version of ASR at work and consist of machine interfaces what tell you verbally to respond with a specific word from a limited list of choices, thus forming their response to your narrowly defined request. Automated telephone banking and other customer service interfaces commonly use directed dialogue ASR software.

Natural Language Conversations (the NLP we covered in our intro) are the much more sophisticated variants of ASR and instead of heavily limited menus of words you may use, they try to simulate real conversation by allowing you to use an open ended chat format with them. The Alexa interface developed by Amazon is a highly advanced example of this System.

Automatic speech recognition (ASR) is technology that converts spoken words into text. In short, it’s the first step in enabling voice technologies like Amazon Alexa to respond when we ask, “Alexa, what’s it like outside?”

With ASR, voice technology can detect spoken sounds and recognize them as words. ASR is the cornerstone of the entire voice experience, allowing computers to finally understand us through our most natural form of communication: speech.

Lets look at what Amazon’s Alexa is and how important ASR is for Alexa to function properly and according to the expectations of the company and the people :

The Amazon device using ASR.

The above photo represents a device created at Amazon which makes perfect use of the ASR technology to receive instructions through the user and to perform the actions accordingly.

Working of alexa using ASR technology.

Moving back to the main topic : ASR ::

How Does Natural Language Processing Work?

Given its importance as the future direction of ASR technology, NLP is much more important than directed dialogue in the development of speech recognition systems.

The way it works is designed to loosely simulate how humans themselves comprehend speech and respond accordingly.

The typical vocabulary of an NLP ASR system consists of 60 thousand or more words. Now what this means is over 215 trillion possible word combinations if you say just three words in a sequence to it!

Obviously then, it would be grossly impractical for an NLP ASR system to scan its entire vocabulary for each word and process them individually. Instead, what the natural language system is designed to do is react to a much smaller list of selected “tagged” keywords that give context to longer requests.

Thus, using these contextual clues, the system can much more quickly narrow down exactly what you’re saying to it and find out which words are being used so that it can adequately respond.

For example, if you say phrases like “weather forecast”, “check my balance” and “I’d like to pay my bills”, the tagged keywords the NLP system focuses on might be “forecast”, “balance” and “bills”. It would then use these words to find the context of the other words you used and not commit errors like confusing “weather” with “whether”.

The Tuning Test: How ASR is made to “Learn” from Humans

The training of ASR systems, be they NLP or directed dialogue systems, works on two main mechanisms. The first and simpler of these is called Human “Tuning” and the second, much more advanced variant is called “Active Learning”.

Human Tuning: This is a relatively simple means of performing ASR training. It involves human programmers going through the conversation logs of a given ASR software interface and looking at the commonly used words that it had to hear but which it does not have in its pre-programmed vocabulary. Those words are then added to the software so that it can expand its comprehension of speech.

Active Learning: Active learning is the much more sophisticated variant of ASR and is particularly being tried with NLP versions of speech recognition technology. With active learning, the software itself is programmed to autonomously learn, retain and adopt new words, thus constantly expanding its vocabulary as it’s exposed to new ways of speaking and saying things.

This, at least in theory, allows the software to pick up on the more specific speech habits of particular users so that it can communicate better with them.

So for example, if a given human user keeps negating the autocorrect on a specific word, the NLP software eventually learns to recognize that particular person’s different use of that word as the “correct” version.

Fig: An infographic about Automatic Speech Recognition (ASR)

--

--