Google Assistant

Farras Faddila
3 min readSep 21, 2021

Hey, Google! Play me a Yoasobi song!

If we aren’t restricted from roaming outside freely due to this pandemic situation, I would have spent my time with the Google Smart Speaker in the secretary room of my student organization in campus, chilling with my friend together while asking the smart speaker to play us some music. This smart speaker is actually a speaker with a ‘brain’ planted inside. Hence, it is smart. That’s why I could ask it to do something, like playing music or informing the weather forecast for today. The ‘brain’ here, is called Google Assistant, and is an implementation of something you would have definitely heard of: Machine Learning.

Smart Speaker

Machine Learning’s usages and performances are expanding in a really great acceleration. Something which we couldn’t possibly imagine one year ago might have been possible to be done today by utilizing some kind of learning algorithm. I am always amazed by the new technologies this field has been producing, including this product, Google Assistant, which emerges from a branch of Machine Learning, which is NLP (Natural Language Processing). The core idea of NLP is to translate what we are saying from our natural language (which sometimes is ambiguous) into the language that can be understood by the machine. The machine must have some kind of memory that stores the context of the conversation, otherwise it cannot conduct a conversation naturally with us.

An example of a conversation with Google Assistant. The source is from Google itself though, and as a wise old man once said, never trust the ads until we see the product by ourselves.

There is one additional step before doing the translation task. It must convert our voice into text. The voice, as we all know, is actually a wave that has some properties such as amplitude and frequency. Just like a standard voice recording device, the speaker can just convert our voice from analog wave into the digital one. This digital wave is then converted into token of words (which hopefully will be the same with what we say to the speaker), by using again a learning algorithm. This branch of Machine Learning is called Speech Processing. It has different purpose with the NLP we discussed before. A Speech Processing model might not need a memory to have context, but it must learn how a series of wave is translated into a word, or maybe a fraction of wave is translated into a phoneme.

So, a smart speaker is basically a combination of two Machine Learning models. Imagine having it answer us quickly, just barely 1 second after we ask. Other than the two processes above, it might also need to get some data from calendar (if we are asking for a schedule), web (if we are asking for news), or even call an external API. This is a complex process completed in short amount of time, and I think it’s really cool to have experienced this kind of cutting-edge technology.

The music stops playing.

Okay, google! Thank you for playing music for me. Hey what are you… are you blushing? oh f-

--

--