Can Siri Help Chatbots Become More Human?

Mastering contextual intonation and location-specific colloquialisms: how Siri tries to ‘get language right’

Isabella Mandis
Wikitongues
8 min readJun 22, 2020

--

In October of 2011, Apple debuted the Siri application on its iOS for the iPhone 4S. Representing the latest developments in mainstream AI technology, Siri was intended to make users’ lives simpler by offering voice-prompted shortcuts, such as placing a call or sending a text, as well as facilitating quick and effective information retrieval. Since then, many other personal assistant applications have emerged, such as Amazon’s Alexa and Google’s Assistant. Furthermore, chatbot features, which rely on the same underlying technologies, have become a regular and effective addition to many websites that seek to streamline customer service. Siri is no longer a novelty — in fact, its competitors have outpaced its technology terms of functionality and market share. That being said, Siri remains the leader in terms of understanding the variations of human speech patterns and producing an interactive experience that feels less robotic to its users, and more like a natural conversation.

How Does Siri Work?

Siri’s development was made possible thanks to the great boom in artificial intelligence and machine learning techniques that emerged in the early 2000s. The New York Times technology reporter Clive Thompson observed that Yann LeCun’s experiments with neural networks, a key component in Siri’s functioning, attracted a great deal of attention to the field of AI during the 1980s, but “after some small bursts of excitement…neural nets lapsed into their own ‘AI winter.’” The initial problem was that these neural nets, which simulate the way the brain processes thoughts, relied on computers that had processing powers that exceeded the limits of the vast majority of computers in existence at the time.

By the early 2000s, however, limits on computing were pushed exponentially, so researchers no longer required supercomputers to work on complex problems such as language modeling, parsing, and semantics. The world took notice of these advances when IBM’s AI bot Watson defeated two long-term Jeopardy! champions at their own game. By 2010, the entire groundwork for Siri’s functioning was in place.

Watson stage replica in Jeopardy! contest, Mountain View, California via Wikimedia Commons, credit https://www.flickr.com/photos/atomictaco/12935316785/

Upon receiving a vocal prompt from a user, the Siri application transcribes the spoken words, which it captures in sound waves, into a textual format. It ‘finds’ the words in the user’s spoken command by detecting the micro-pauses between utterances. From there, it distinguishes which word each of these sound particles represents through algorithmic calculation. In short, the application maps the sound of consonants and vowels into syllables and words by comparing inputs to a vast storage of data and then calculating the probability of a word match. Siri does not only consider a word’s sound wave — it also intuits grammar and syntax based on the probability and logic of a string of words that are put together to form a statement or command. It is particularly attuned to recognize command words, such as ‘send,’ ‘write,’ ‘call,’ and ‘find,’ in order to construct the logic of the rest of the statement that accompanies the user’s request. All of these calculations are executed in Siri’s server, operated by Apple, which returns a textual response in milliseconds. Once it completes this process, Siri converts the text into speech which the user experiences as a spoken reply to their request.

The two main technologies used in Siri’s operation are voice recognition and natural language processing (NLP). The challenges to voice recognition are obvious: human beings have an incredibly broad and complex way of pronouncing and phrasing even the most seemingly basic statement or command. To address this complication, Siri relies on Apple’s servers, which contain an enormous data sample to account for variations in pronunciation and vocabulary. The NLP algorithms, which are executed on the server side, work with decades’ worth of data and calculations stored in both public databases and proprietary compilations of language in order to compute and produce meaningful interaction. Over the last few years, the entire architecture on the server side has been reworked to quicken responsivity and broaden its understanding of human accents. As David Pierce reported for WIRED, “A few years ago, the team at Apple, led by Acero, took control of Siri’s back-end and revamped the experience. It’s now based on deep learning and AI, and has improved vastly as a result. Siri’s raw voice recognition rivals all its competitors.”

The Origins of Siri

Siri existed in a rudimentary form before its appearance on Apple’s iPhone 4S in 2011. In fact, Siri began to take shape as early as 2005, and was initially available as a downloadable app through the App Store in 2010. Its original purpose was to serve as a ‘concierge’ for travel and entertainment. Apple acquired SRI Artificial Intelligence, Siri’s maker, in 2010, and immediately began incorporating it into the iOS architecture. From there it became proprietary software and over time Siri underwent a complete architectural restructuring of its code, expanding into more languages and becoming a more human-sounding, conversational virtual assistant. As ZDNet reported in 2018, Apple is playing the “long game” with Siri: “Indeed, the company has gone to significant lengths to ensure that the products it does integrate with…understand a relatively broad vocabulary.”

What Makes Siri Different from Other AI Assistants?

As Siri’s competitors have made great strides in terms of knowledge base and connectivity to an ever-growing number of apps, services, and devices, Siri’s developers have remained more focused on having Siri ‘get language right.’ Due to a larger integration with cross-platform services and a greater level of mining individual’s data produced on devices, many other AI chatbots are able to beat Siri at trivia or find more precise results in an internet search. Nevertheless, while Siri’s list of new functionalities has not grown extensively over the course of ten years, users may have noticed that the application’s voice has become significantly less robotic-sounding.

Contextual intonation is a great example of Siri’s ability to emulate human speech: depending on a given syntax, humans might put a completely different emphasis on the syllables of a given word, making the same word sound entirely different. Speakers of American English, for instance, tend to inflect their voices upward when using a word to mark the end of a question, while they speak with flatter intonation when that same word appears in the middle of a sentence. Whereas earlier chatbots failed to mimic these changes in inflection, tone, and syllable stressing, the current version of Siri has learned to sound more human: Siri takes more natural pauses in sentences and elongates the last syllable before introducing a pause, as humans typically do in all of the languages that Siri speaks. Furthermore, the application’s developers are at the forefront of distinguishing the user’s voice and matching accent, local vocabulary, and dialectic specificities to Siri’s operations.

Photo by Rahul Chakraborty on Unsplash

In fact, in its current form, Apple claims that Siri identifies 95 percent of its users’ speech, a far greater yield than its competitors. Although Google is catching up to Siri’s capabilities, for the moment no other company puts greater emphasis on producing data sets that teach the application about linguistic variability. This effort often takes the form of data compilation and mining: when planning to expand Siri’s reach into a new market that carries with it a local dialect and colloquialisms, Siri’s team will first identify and access any pre-existing databases of speech in that region. They then hire local voice talent and ask them to read material out loud so as to capture the patterns of local speech.

Once Siri has expanded into a given area, Apple continues to collect data from its users to check Siri’s understanding and emulation of speech patterns. Apple’s teams also regularly update Siri’s system with evolving sayings unique to a given location so as to remain relevant to the locality’s vocabulary, rather than simply applying the conventions of other areas which share the same dominant language. Thus, Apple and, more specifically Siri, distinguish themselves from competitors by having their systems conform to the contours of human speech, rather than putting the onus on humans to form phrases that are easily understood by computers.

As of June 2020, Siri operates in 21 languages and supports dialects for 7 of those languages. Google Assistant has surpassed Siri in terms of the number of languages it “understands” (as of 2020 it understands up to 44 languages); however, Google Assistant is a more rudimentary chatbot function available only on phones, and Google Home, which is the more robust and fullest version of Google’s AI chatbot, supports only 13 languages and offers dialect support in only 4 of those languages.

One can hope that the research and efforts that Apple has dedicated toward both understanding and producing more human chatbot speech may ultimately provide a foundation for other projects aimed at sustaining languages across the globe.

Is Siri’s superiority due to the fact that it is more ‘sensitive’ to human concerns that link language to identity? The answer is both yes and no. Apple’s work on Siri is undoubtedly focused on building a better product according to its company’s priorities — namely, its profits. While far ahead of its competitors, Apple’s multilingual capacities are focused on markets where the company’s comparatively high-priced products have the greatest potential. Furthermore, Apple’s work on accents and dialects is concerned more with variations within a native speaker’s language than with non-native speakers who address the chatbot with an accent from another language. Without a doubt, chatbots still fall far short in terms of accessibility in our linguistically diverse landscape. Siri may be the most linguistically sophisticated chatbot at the moment, but its use is limited to Apple products, which are available only to the most economically privileged people in the world.

Regardless of company, the fact remains that chatbots are created on research that is done mostly for profit. The tasks of language learning and the languages that are featured are related to commercial endeavors, rather than cultural matters such as collective memory or community values. The languages that Siri learns are consequently based on questions of market trends, rather than interest in accessibility or inclusion. In this way, Siri is yet another example of how AI, in its robotic neutrality, can ultimately reproduce the inequalities of this world. Still, although promoting linguistic diversity may not be the company’s focus, the application’s contribution to linguistics should not be overlooked. One can hope that the research and efforts that Apple has dedicated toward both understanding and producing more human chatbot speech may ultimately provide a foundation for other projects aimed at sustaining languages across the globe.

If you would like to donate to support the work of Wikitongues or if you would like to get to know our work, please visit wikitongues.org. To watch our oral histories, subscribe to our YouTube channel or visit wikitongues.org to submit a video.

--

--