Speech Recognition 101

Brief introduction to automatic speech recognition concepts and how to apply it

EnableData
CodeX
5 min readSep 1, 2022

--

Photo by Will Francis on Unsplash

Tools that use voice resources, such as Siri, Alexa, and Google Voice, have been changing how people interact with their with devices present in their cars, homes and jobs. Consequently, this rise also amplify the possibilities on analyzing user data.

By talking to the devices, emitting commands, the data is collected and the possibilities are endless such as to understand the person talking to the device by stipulating their age, gender and accent or to understand their mood by their voice or the words they are using.

Besides, there is the possibility on improving the model to better understand all types of voices and dictions. This video is an explicit example of the kind of improvement that can be made.

What I'm trying to express here is that this type of data is extremely valuable and can help improve the user experience — by understanding them better and even answering them with a response that will make them more comfortable.

But to be able to do it all, is foremost important to understand how speech recognition algorithms handle audio data (feature extraction), how they work and how to apply this techniques in a real world problem.

This is an introduction article to a large and complex subject.

How speech recognition algorithms handle audio data (feature extraction)

First of all, it’s important to understand is the concept of phonemes. Phonemes, as the Cambridge dictionary defines:

"one of the smallest units of speech that make one word different from another word"

It is the smallest unit on the speech. The algorithms, to understand speech, have to "slice" the audio waves from speech in order to match to its written form. In the image below, a sketch of a wave form being sliced represent the understanding of this type of data by computers and its algorithms.

Representation of phonemes [Image by author]

In a speech directed to Siri such as the phrase “What is going to be the weather tomorrow?”, the algorithm has to first transform this wave into something understandable to it — such as a written text (to use other algorithms to “understand” and respond to it, but lets not deep dive here).

How speech recognition algorithms work

After establishing the concept of phonemes, is important to understand how an algorithm decides what is the right written form to tie to this set of wave slices that form the speech. There are two main types of algorithms: one that is based in a set of rules (based on language rules) and one that is based on neural networks.

In the first algorithm, based on a set of rules, after the feature extraction ("slicing the audio"), an acoustic model is used to map the audio features into the phonemes — verifying the statistical probability of one phoneme being adjacent to the previous — according to the language. After, using the acoustic and language models, the decoder searches for the sequence of words that best match the input feature — also verifying the statistical probability determined by the model's language. That way, an output is formed as a set of words matching (or trying to) the input speech. The image below represents the pipeline of this rules.

This algorithms need the acoustic model and the language model that have to be previously built regarding the language's grammar and its rules.

Statistical speech recognition model [Image by author]

The second type of algorithm is a neural network. This type of algorithms are fed by a huge amount of data, and the neural network basically understands statistically the probability of a wave format being a written form. In this case, the more data the merrier. Nowadays, this is the algorithm that gets the best results.

How to apply speech recognition techniques

As you could suppose, these algorithms can get very complex to implement and use in an application in real-life situations. It is quite interesting building the algorithm from zero, but there are some tools available in the market that can supply the necessities of businesses (and there are some free options). I will briefly describe these tools below accordingly to a work I previously done to compare them using audio instances in the Portuguese language with its pricing at the time (March 2022).

  • Wit.ai: an open-source free natural language processing service provided by Meta. This tool has a relatively high quality, being free of charge (even for businesses) service. The average time to transcribe 10 seconds with this tool is 2,31 seconds.
  • Amazon Transcribe: a paid service provided by AWS in its cloud. This tool also has a relatively high quality, but it’s expensive (USD 6,42 per hour transcribed). The average time to transcribe 10 seconds with this tool is 17,56 seconds.
  • Azure Speech Service: a paid service provided by Azure. This is a low-priced cloud service — being USD1,13 per hour transcribed. Also, this cloud provides a budget for first-timers on its platform to test its tools — being able to test this tool free of charge for some time. To transcribe 10 seconds, this tool takes about 2,14 seconds.
  • Google Cloud Speech API: a paid service provided by Google Cloud Platform. This tool has a great quality, but it is the most expensive (USD 6,57 per hour transcribed). This cloud also provides a free tier to those first trying it, being possible to test the service free of charge. To transcribe 10 seconds, this tool takes about 2,25 seconds.
  • Vosk API: this is an open-source python library, that contains some statistical models available (you can also replace them). The quality is very low (depends a lot on the model, and I used the Portuguese one they make available). The average time to transcribe 10 seconds of audio was 0,59 seconds, but as it runs locally, depends a lot on the hardware.

Furthermore, for each of these tools/services, there is a python library, but also there is a library that centralizes them all: Speech Recognition. I found the library to be quicker to understand, but less documented than the official libraries.

In the right context, all of these tools are great.

In this article, I covered some basic concepts to understand Speech Recognition algorithms, the basic models, and how they can be applied. Additionally, some tools and its advantages were cited as means to apply speech recognition techniques.

Soon, I will write about my experience applying these tools and will describe how to measure the success of an algorithm.

I’m Aline, the author of this article. Find me here and here!

--

--

EnableData
CodeX

I write about data engineering, data analytics, data science and infrastructure. Find me on YouTube: https://www.youtube.com/@EnableData