2017: An Odyssey into Voice Recognition

Martin Vetterli
Digital Stories
Published in
3 min readSep 24, 2018

Machines that talk have been the staple of science fiction since the beginnings of the genre, but only today we are starting to actually be “understood” by our cell phones.

Photo by John Reign Abarintos on Unsplash

Recently I looked back at the movie “2001: A Space Odyssey” by Stanley Kubrick again. It was shot in the heyday of the space race and will soon celebrate its the 50th anniversary. The movie is an optimistic projection of technology. HAL 9000, an all-powerful computer, doesn’t have a keyboard and all exchanges with the crew are in the form of conversations. However, only recently did applications like Apple’s Siri start to put something that resembles HAL 9000 into our pockets. How do you teach a machine to recognise words? And why did it take so long?

As often, when trying to design a machine that mimics a human capability, the first step is to distance ourselves from what we know and to try to understand “computationally” how human speech works. Obviously, speech is a sequence of basic sound units produced by the vocal tract, and spoken words are composed of successive sound units (much in the same way as written words are sequences of letters). To understand a spoken word, we must thus try to identify the underlying sound units. This is a tricky process, since some basic sounds, such as the vowels, must be analysed on the basis of their pitch (like musical notes), while consonants are recognised by looking at how the sound changes in time. In humans, this step is performed by our inner ear, while inside your cell phone this is done immediately when you talk to Siri.

But in fact, a list of possible sound units is calculated, not only one unit, and this list is then sent to some big computer server at Apple through your internet connection. And that’s where the interesting part begins.

Let’s take written words and letters now, instead of sounds, to illustrate what happens next. Consider for instance the two words “TIME” and “World”. You will have had no trouble reading the words, even if a closer look reveals that the capital letters “I” in “TIME” and the small “L” in “World” are identical copies of each other! However, in the context of the other surrounding letters you can easily identify which letters are meant (after all, “Worid” makes simply no sense). So the two possibilities are evaluated based on our priors on language knowledge. And by doing so, the words are correctly identified. In a similar procedure, the words can be joined into sentences and finally meaning, always based on prior knowledge. Thus, a sophisticated language model helps to make sense of certain constructs, while trashing others, and this on the level of letters, words, sentences and meaning.

So why did it take so long to develop these machines? Well, an enormous amount of data is needed to construct the language models. And only recently has this become doable. Furthermore, large and fast computers are needed. This is why you need an internet connection when you talk to Siri (the real “understanding” takes place remotely). We thus might finally be reaching the level of HAL 9000 in the famous movie, but a bit after 2001.

--

--