Why digital assistants are only good for car commuters and lonely wolves

Jaroslav Gergic
8 min readDec 2, 2022

--

Google Assistant, Siri, Alexa, Bixby, and Cortana do not seem to live up to their promise. Why is that?

There was big hype around digital assistants a few years ago and every major technology company scrambled to introduce their own digital assistant and create an ecosystem around it. Despite huge investments over the years, the most common usage of digital assistants seems to be a modern-day equivalent of a (1) jukebox, a (2) handsfree cooking timer and a collection of (3) neat party tricks to showcase our smart gadgets to friends.

I believe there are two sets of reasons for this situation, one being social and the other technological. But before we dive in into those reasons, let’s go back for a history lesson.

One upon a time…

It is hard to believe it’s been twenty years ago, since I was working on multi-modal interaction systems combining GUI and speech as member of IBM Research team in Prague, CZ and Yorktown Heights, NY. We were developing frameworks allowing to design and develop applications and user experiences combining GUI on desktop and mobile devices with speech recognition and speech synthesis.

The idea was to develop an application and its business logic once using a multi-modal authoring framework and render it in multiple user interfaces (modalities), be it a desktop web HTML UI, PDA UI such as iPaq, mobile WAP, and voice (VoiceXML) separately, or in combination of multiple modalities at a time. For example, we would do a GUI with touchscreen and the ability to search, navigate and fill in forms using voice.

IBM T. J. Watson Research Center (2002)

Back in the day, we would often struggle with speech recognition accuracy and the fact that most of the devices available were not designed with speech interaction in mind. The audio quality suffered and did not work well in environments with background noise and other real-world conditions.

The voice-controlled interaction blocks usually did not employ natural language understanding (NLU), but rather relied on declarative grammar formats such as SRGS or JSGF. But with enough care for grammar design, we could design and implement apps which demoed really well and made NLU-like impression to the uninitiated. It was for example possible to say sentences like:

“I want to fly from New York to San Francisco on Monday morning.”

and the grammar would capture from, to, and when information in one turn. It surprises me, that despite the limitations at the time, the applications we developed and prototyped were not that much different from what we experience today.

All the most popular use-cases today, such as voice-enabled navigation, voice jukebox or movie search, phone book search, or a touchscreen and voice powered information kiosk were those we had the biggest success with twenty years ago. The combination of voice and speech recognition to search large datasets to retrieve top N entries and then disambiguate those results by tapping on a touch screen proved to be a very promising interaction pattern back in the day.

Nexus One — Wikipediaen.wikipedia.org
The Nexus became available on January 5, 2010, and features the ability to transcribe voice to text, an additional microphone for dynamic noise suppression, and voice guided turn-by-turn navigation to drivers.

…and back to the present future

Many things have changed since then. The most improvements came in the area of speech recognition and speech synthesis. It started with the very first Google Nexus One phone back in 2010, which introduced a secondary microphone (now common on almost all cell phones), which allowed it to remove background noise and record higher clarity spoken audio — a precondition for better speech recognition accuracy.

The next big innovation came with vastly improved statistical models for both recognition and synthesis and the ability to process natural spoken language and translate it to text and vice versa, take a written text and synthesize speech in a reasonable quality, which does not annoy people out of the gate.

The technical advances above resulted in the wave of digital assistants launching on consumer devices, starting with Apple Siri (2011), followed by Microsoft Cortana (2014), Google Assistant (2016), and Samsung Bixby (2017). At the time, it looked like digital assistants were going to take over the world and become the new way we interact with our digital devices.

Now, when the hype is over and we are sliding down the hype wave, let’s take it as an opportunity to pause for a moment and think about what is preventing digital assistants from living up to their full potential.

Technical limitations

Lack of true “general AI”

While speech recognition, in other words, audio to text translation has improved its accuracy and flexibility through advanced statistical natural language models, the tricky part remains the semantical understanding, i.e. the actual intelligence of digital assistants and their ability to engage in a dialog with humans. There are too many nuances and context-sensitive aspects of actual human conversation, which are hard to capture and represent in today’s AI frameworks.

Notice, that all common digital assistants are turn based (sometimes also known as utterance based) which means that the dialog flow is cut into separate turns like when playing a chess or other board game. We got used to it and expect this behavior from digital assistants, but it is unnatural. Technically, it makes sense to apply a turn based (request / reply) pattern when running the digital assistant as a service in cloud-based datacenter. Individual requests are handled by a pool of stateless worker nodes and all state between turns needs to be explicitly externalized and stored between individual requests.

The turn-based approach is unlike the early dictation systems, such as IBM Via Voice which were running on desktops and used event-driven streaming pattern. There were no explicit turns or interruptions, and the engine would often reach back in text to correct itself upon learning more context and realizing it got the earlier part of the paragraph wrong. It would be more natural if today’s assistants were also listening and adapting with full situational awareness.

In the absence of general AI, designing a good user experience of a conversational system is by no means easier than designing a good user experience in let’s say a mobile or a desktop GUI app. Attention to detail matters in UX and it is expensive to get the details right because the designers need to craft interactions for individual usage domains and use-cases. The currently available set of fine-tuned use cases and dialog flows is limited and inherently limits the general utility of digital assistants.

Problem of ecosystems

But there is even a bigger problem than the lack of true general AI to drive intelligent conversation: there is a problem of ecosystems built around the above digital assistants. All major platforms offer APIs to third party vendors to integrate their apps and gadgets into the digital assistant ecosystems and thus expand their utility. The problem is the availability and quality of these integrations.

Availability: Most of the music player apps on cell phones can perform similar actions: play an album, play a playlist, play a song, or play a band radio, skip to next song, etc. An intelligent entity (such as most humans) can freely switch between music apps. Not digital assistants. For example, Samsung Bixby supports Spotify and native Samsung Music app, but can’t play music on Tidal. So, while I can use Bixby to change some very obscure system settings which I don’t even know exist, I can’t ask her to play my favorite Tidal playlist.

Quality: Many device manufacturers develop digital assistant integrations plainly as checkbox features to be able to slap Alexa, Google, Siri, or whatever assistant logos on their packaging and claim their devices smart. Many times, those skills, as Alexa calls them, are half-baked. For sample we have external blinds from Somfy, which have a companion Tahoma app for smart home control, and it features Google Assistant integration. But I can only fully open or fully close the blinds or run a pre-configured named scenario. So, a typical summer use case to roll the blinds down but flipping them to let’s say 40 degrees to block the sun heat but let indirect light in can’t be done using voice. It can only be done in the Tahoma app or using a physical switch. Sure, I can pre-program a scenario on desktop and make it available to the digital assistant, but that’s not what most people including myself would be willing to spend their time doing.

Hidden roadblocks

The technical limitations described above create hidden roadblocks which hinder usability and utility of digital assistants. By encouraging to use freeform natural language input to talk to the assistants our brains are tricked into assuming we talk to intelligent machines, but then we keep suddenly falling into unexpected traps and dead ends when the digital assistant does not implement a desired workflow at all, or a seemingly common feature is missing in the third-party extension. Therefore, the coolest scenarios are destined for product marketing demos and party tricks. They all need to be carefully memorized and tested prior to the event to make sure they actually work. Otherwise, you might easily end up like me when pressing a speech button on the driving wheel in my car:

“Increase the temperature to 22 degrees Celsius, please.”

“Sorry, I can’t do that right now.”

Social aspects

Now, imagine for a moment, that we overcame all technical challenges. We have a general AI in place, which can operate all apps and devices like we can, so if we can perform a desired action in a mobile app, the assistant can do the action as well on our behalf; GUI and speech interfaces become equally powerful.

What are the social contexts in which you would not hesitate to speak to your digital assistant and let your surroundings hear what you are saying? Recall how people are looking at someone talking aloud to their phone on public transport. Or do you want the entire open space office to know that you are planning one-on-one with your boss’s boss on Monday afternoon? The existing forms of communication and interaction using text messaging and GUI offer much more privacy in most social contexts. When we want to talk over the phone, we usually seek a separate room to conduct the interaction.

This leads me to the statement I put as the title of this article. Even if we solve all technical issues, talking to digital assistant would feel awkward to most of us unless we can be alone enjoying the state of audio privacy. Social norms might change with future generations, what do I know, but at this stage of development of our society, social norms not technology can become the limiting factor for digital assistant adoption in near future.

--

--

Jaroslav Gergic

Always busy building the next big thing, now living in the confluence of cybersecurity, machine learning, and cloud computing.