How Some Vibrations in the Air Become a Request to Your Smart Devices

You may have, if you’re as discontent as I am not knowing how something works, at times found yourself wondering about the inner methods of your various virtual assistants. Alexa, Siri, and countless others have become somewhat household names at any given time, and we need but call for them to know the average lifespan of a frog, why people learning a second language have accents, or whatever passing thought crosses our mind at that moment. It can seem like virtual magic, but of course, we know what’s really behind this: ones and zeroes!

More specifically, the process can be broken down into a few different steps (perhaps more than you might think). Let’s use Alexa, the virtual interface for Amazon’s Echo devices, as an example. When you ask Alexa: “Alexa,“ — she’ll light up at this, as she always does, and you’ll think about the fact that if she’s able to listen for her keyword to awaken, she’s technically listening in all the time. But it’s not as creepy as it sounds, probably, and you brush it off — “Why do people usually have an accent when they speak a second language?”, the following will happen:

  1. Your device picks up the sound waves of your voice
  2. These sound waves are analyzed and broken down into individual speech sounds
  3. Those speech sounds are laid out to determine what words you’ve said
  4. Those words are formed into a sentence as a request
  5. This request is sent out to various servers
  6. A response is received
  7. Alexa will read this response out to you.

Now, there is certainly a lot going on here, and while we could sum it up as easily as that, the linguist in me wants to dig down deeper, and specifically for this blog today, I want to talk about these first two steps. It’s easy to take for granted that things you say can be understood because as humans, we’re exceptionally good at understanding and parsing out speech sounds. Machines and programs, at least to start with, are not. They have to be taught what to look for, and they don’t hear “Why do people usually have an accent when they speak a second language?”; they are presented with something more akin to the following:

The waveform for “Why do people usually have an accent when they speak a second language?” (Source)

Now if you can read this, congratulations! You’re most likely a machine! Most of us can’t (at least not without a good cheat sheet and some time), but much like how our brains as humans do, your device ‘hears’ the sound waves and records them, then analyzes the file to parse it out. This seems like a difficult task as a layman, and honestly, sometimes it is! You’ve surely experienced a number of misunderstandings from your various devices because it IS difficult. Differences between sounds can be slight, people have accents and dialects and different tones and timbres of voice, to say nothing of background noise, and what if it’s a day when your allergies are acting up and you’re all stuffy?

Build better voice apps. Get more articles & interviews from voice technology experts at

…You get the point.

What’s really happening under the hood is that — slight though the distinctions may be — these different sounds can be classified into categories: little buckets that we call phonemes.

Differences in sound waves for different, but similar, sounds

Linguistics tangent time! Phonemes are meaningfully distinct sounds in a language, such as the /p/ sound in ‘pin’ vs. the /b/ sound in ‘bin’. Not all different sounds are phonemes, however — the /p/ sound in ‘pin’ is actually a /pʰ/, whereas the /p/ in ‘spin’ is not (put your hand in front of your mouth and you may feel the aspiration, the little puff of air, that is expelled when you say ‘pin’, but not ‘spin’) . We can’t necessarily tell the difference off-hand in English, because the two sounds are of the same phoneme. That’s not always the case! In Hindi, and Korean, and many other languages, those two sounds are completely distinct.

There are 44 recognized phonemes in English, so all your device really has to do is categorize each segment of that sound wave to whichever phoneme “bucket” matches best.

Then those phoneme blocks are put together, and you’re looking at a string of, not quite letters, but sounds, that are then parsed out into words and a full sentence. But we’ll talk about that later! Next time you ask Alexa or Siri for some help, take a moment to appreciate all their hard work!