Hey Siri, who are you? — Product Case Study of Apple’s AI Voice Assistant

Ola Bąk
10 min readOct 12, 2022

If you have ever used Apple products then you probably had a chance to ‘meet’ Siri. But have you ever been wondering how Siri actually works? How is it able ‘understand’ what you are saying and then respond accordingly? Let’s find out! 🔍

What are we going to talk about?

🤖 Who is Siri?

🖐🏼 What can Siri do?

🦴 Siri’s X-ray: underlying mechanisms of Apple’s voice assistant

🛏 How does Siri ‘wake up’?

🧠 How does Siri ‘understand’?

🗣 How does Siri generate responses?

🌓 Is Siri a ‘good guy’ or a ‘bad guy’? — Siri’s bright and dark sides

What (or rather who) is Siri?

  • Siri was created by Apple and revealed in the year 2010 as a standalone app, and then further implemented into the iOS system in 2011 with the premiere of iPhone 4s.
  • Siri was the first major AI-powered voice assistant popularized on a large scale that was capable of interpreting human speech, generating responses, and performing multiple tasks.
  • Siri’s main ‘rivals’ include Alexa (created by Amazon), Cortana (created by Microsoft), and Google Assistant (quite obviously, created by Google).
  • Siri can be interacted with through Apple devices such as iPhone, iPad, MacBook, Apple Watch, or HomePod.
  • By default, Siri is assigned a female voice. And although it is difficult to talk about a computer-generated agent as having a specific gender, the fact that the default setting makes Siri sound like a female causes most people to refer to it as a ‘she’ (even though it is possible to switch Siri’s voice to male).

What can Siri do?

With Siri, users can carry out actions such as asking questions, sending messages, calling, asking for directions, setting alarms, playing music, or even controlling home appliances (such as air-conditioners, doorbells, locks, lights, sprinklers, and many more).

Siri can also communicate in an astonishing number of 21 languages (as of the year 2022).

In addition, Siri adapts to the behavior of the user over time and yields increasingly personalized results. This personalization is enabled through the system that analyzes users’ general habits, frequently used keywords, and language choices.

Siri’s X-ray: underlying mechanisms of the voice assistant

To understand Siri’s functionality, it is critical to outline a couple of key terms standing behind its mechanisms. The key enabling technologies that lie at the ‘heart’ of this AI-powered conversational agent are:

  • Natural Language Processing (NLP): an area of artificial intelligence that focuses on the interaction between humans and machines through language¹.
  • Speech Recognition: technology that integrates grammar, syntax, structure, and composition of voice signals to understand and process human speech².

These technologies enable Siri to learn, understand, and produce human language owing to Deep Learning techniques that imitate the way humans gain certain types of knowledge through learning from large amounts of data.

How does Siri ‘wake up’?

Siri can get activated either by holding a button on the Apple device or hands-free through the verbal command “Hey Siri”. It is possible to ‘wake up’ Siri with just spoken words due to the speech recognizer that is continually listening in the search of these two words. The voice detector leverages a deep neural network to convert the user’s voice acoustic pattern into a mathematical formula, which is then analyzed and assigned a confidence score that determines whether Siri should activate. Below we can see a ‘voiceprint’ of the phrase “Hey Siri what…” between the blue vertical lines — when Siri ‘hears’ the phrase that ‘draws’ a similar pattern in the voice recognition system, it starts calculating whether it should ‘wake up’.

The acoustic pattern of the phrase “Hey Siri what…” between the blue vertical lines. Link to the original Apple source here.

If the score calculated on the basis of the voiceprint is close to the determined lower threshold of waking up, the system enters into a high-sensitivity mode and ‘listens more attentively’ for a while — this ‘second chance mechanism’ prevents Siri from not being activated when the user actually intended to³. Additionally, Siri was trained to be able to activate not only within arms’ reach but also across the room, which is why Apple prepared it to work in various conditions by training it in different environments such as the kitchen (close and far), car, bedroom, restaurants, and even in different accents⁴.

In addition, Siri leverages the ‘federated learning’ technique, which makes it possible only for the owner of the device to activate Siri. This privacy-ensuring method trains the algorithm solely on the locally-available data that never leaves the owner’s device and sends back the trained models to central servers to perfect the ‘master model’, which renders the assistant continually better at determining the correct speaker⁵.

How does Siri ‘understand’?

After activation, Siri proceeds to ‘listen’ to the user’s spoken query. Siri’s ability to ‘understand’ is enacted through the speech recognition mechanism. During this process the words uttered by the user are converted into speech patterns and broken down into segments, segments are converted into syllables, and lastly, separate syllables are assigned to particular wave patterns individually which enables Siri to decode what has been said by the user⁶.

As the comprehension of the word’s meanings is essential for inferring the logical sequence of consecutive words, Siri’s understanding is possible through the concept of word embeddings — a technique that assigns vectors with specific values to words, where each value measures the degree to which a particular term is linked to a specific context⁸. This concept can be illustrated with the word embeddings graph below showcasing how words with similar meanings tend to occupy similar areas on the matrix:

Exemplary words embeddings matrix showcasing how words with similar meanings tend to occupy similar areas on the graph. Link to the source here.

The powerful understanding capacities of Siri are enabled by its capability of interpreting words in context owing to the enormous database containing millions of text patterns and keywords⁶. It enables Siri not only to comprehend different accents and tones, but also makes it possible for it to understand particular concepts without the need of using specific words. As an example, if we ask “Hey Siri, should I take an umbrella with me?”, Siri will still understand the concept of the ‘weather’ without asking specifically if it will rain. This mechanism is enabled through Siri’s capacity to search for patterns and correlated words.

It is also worth emphasizing that the idea of Siri’s ‘understanding’ cannot be interpreted in the same way as the traditional definition of human comprehension. For example, when we ask Siri to add a “liter of books” to a grocery list, it will perform this task without hesitation, which attests to Siri’s lack of ontological understanding of solid and liquid objects and demonstrates that we are still quite far from achieving Artificial General Intelligence that is capable of performing as complex intellectual tasks as humans⁷.

How does Siri generate responses?

For generating the response, two modules are vital: the encoder (taking in the request and processing it into the mathematical formula), and the decoder (using the data from the encoding process for response generation).

Siri’s speech generation is enabled through the preceding process of capturing 10–20 hours of voice recordings by a professional speaker in a studio. It is also important to note that the recordings contain varied materials ranging from manuals to jokes to cover the whole spectrum of vocal intonations. Then, the response generation is enacted through text-to-speech synthesis that is based on slicing this pre-recorded speech into basic elements and rearranging them to create new sentences, as presented in the image below³:

Visualization of the speech generation process on the basis of slicing and rearranging the pre-recorded audio. Source here.

As a response to user’s inquiry, Siri can also perform different tasks, that can be classified as local (that are within the scope of Apple’s ecosystem — such as setting an alarm) and global (which are beyond this scope, such as checking the weather in Tokyo). Each user’s query is sent to the central computing system through the Internet connection and then analyzed, processed, and met with a proper response.

Is Siri a ‘good guy’ or a ‘ bad guy’?

According to Moore’s law, every two years the doubling of computational power occurs, which implies that over time AI agents like Siri will outperform humans in all (or most) tasks⁷. As the level of advancement and the popularity of voice-based intelligent assistants is growing rapidly among consumers and businesses, it becomes increasingly important to consider the societal implications that come with this phenomenon — both the positive and the negative. And since communication between humans and machines constitutes a cultural process and not the sheer exchange of information, it is important to investigate such aspects as the context in which the human-assistant communication takes place, potential biases embedded in those systems, and the impact this technology can exert on the individuals.

How is Siri contributing to social good?

There are many aspects attesting to Siri’s positive impact on society. Firstly, Siri can assist individuals in performing multiple tasks, which renders Siri a valuable tool for increasing efficiency in people’s daily lives. The ability of touchless interaction with Siri enables its users to carry out actions such as finding directions without the excessive occupation of the person’s attention — which can be especially valuable in situations that require the user’s focus, such as driving a car.

Furthermore, voice-based agents such as Siri can mitigate the barriers in human-computer interaction for people who, for instance, have difficulty with mobility or vision. For such individuals, the possibility of activating Siri hands-free and without the need to read from the small screen of a device can be extremely beneficial. For the users for whom the ability to read and type constitutes an obstacle, such conversational assistants can “bridge the information gap” and aid them with acquiring knowledge through the Internet.

Voice-based conversational agents have also important business implications, as it has been proven that the possibility of providing users with personalized assistance in real-time is intrinsically intertwined with customer satisfaction. Nevertheless, apart from the benefits stemming from incorporating conversational agents into business models, companies also need to take into account the peril of the lack of trust from consumers towards AI-powered assistants.

Other cases in which conversational agents like Siri can contribute to social good include its potential to revolutionalize language translation, provide companionship, or aid the elderly in interacting with technology.

Siri’s Dark Side(s)

It must not be forgotten that Siri has also its dark sides that need to be taken into consideration when talking about AI-enabled voice assistants.

Firstly, as mentioned before — for Siri to activate hands-free it is inevitable that it is continually listening for the “Hey Siri” phrase, which tends to raise security and privacy concerns among consumers. As Siri collects information about users’ personal lives, it is important that the data of individuals is appropriately secured. At times of rising concerns about the privacy of data (further exacerbated by such events as the Cambridge Analytica scandal), Apple strives to reassure its customers by implementing multiple security measures such as allowing only the device’s owner to unlock Siri, previously explained federated learning system, or techniques such as adding noise into raw data prior to applying it into machine learning models to hinder the ability of malevolent actors to decipher the original audio recordings⁹.

Secondly, in accordance with a Computer As Social Actor (CASA) paradigm assuming that people assign AI agents human personalities and apply stereotypes to them just as in traditional human-human interaction¹⁰, it can be stated that the sheer fact that Siri has a female voice by default also contributes to the formation of bias. As noted in the study on “The gendering of Virtual Personal Assistants”¹¹, the act of assigning a female gender to voice-based agents can cause societal damage since it contributes to the replication of stereotypical assumptions about the woman’s role as submissive and blindly following commands. In a face of this, it can be argued that each decision about the design of the functionality of conversational agents like Siri entails an array of societal implications.

The underlying mechanisms that stand behind Siri are so complex that it would be possible to write a whole book about it and it would probably still not be enough. Nonetheless, I hope this short overview of Siri’s functionalities will enable you to better understand what is happening ‘behind the scenes’ the next time you interact with an AI-enabled voice assistant!

Thank you for reading! I truly appreciate all feedback regarding the article, so feel free to share your thoughts in the comments below.
And if you liked it, you might also enjoy my other article on the topic of text-based generative AI 🤖

--

--

Ola Bąk

🧠 Associate Product Manager with a knack for Product Design & strong interest in Psychology and User Research.