9 flaws that make voice assistants fundamentally wrong

There’s a lot of buzz on the ‘Virtual Assistants’ from Google Home and Amazon Echo. I have been studying both the devices for a good amount of time now, since I wanted to see how they handle the user experience, and must say that I was pretty disappointed.

Both Google Home & Amazon Echo are built on some fundamentally flawed designs, detailed below. Note that I am not comparing the product specs, speaker technology or the aesthetics of these devices — but purely talking about the underlying features of the virtual assistance, and whether the device is designed to truly understand and answer the user’s question or not. My observations are common to both the devices, and I am using the term “the device” to refer to either or both of the devices in this write-up.

Flaw #1: Applying search principles is wrong

The device is built on an assumption that was originally constructed by our search engines. It answers the same way a search engine answers a query; however, it returns just the first answer. It neither tries to validate the user’s question for completeness, nor does any disambiguation on the question itself. For example, a question like “tell me about George Bush” gets responded with information on the Junior Bush because his search ranking is higher than his father’s (probably no one did the SEO for the Senior Bush).

How can the device assume of a particular George Bush? Why doesn’t it ask the users which George Bush they are referring to? This minimalist approach is just not how a teacher answers a student’s question, or a parent would answer a child’s question.

This is a fundamental design flaw — a computer’s search on Google might return 12 results on Page 1 and the user gets to pick answers with a scroll bar. There is no scroll bar in the voice assistant — and there is no way to know the alternate possibilities!

The device can’t assume that the first answer from the search is the right answer, and this is totally wrong. There needs to be an interaction before answering the questions, or a communication on alternatives along with the current answer.

Hey Google & Amazon, please don’t create a falsified world for the next generation who might pretty much assume that these devices are answering correctly to a question.

Flaw #2: Inability to narrow down the context

While the device tries to get into answering mode very quickly, it doesn’t tell what it is answering. This is a different problem compared to the previous one, because the context of the query object can vary.

For example, for the question “tell about Gandhi”, the device quickly responds with “Gandhi was released in India on 30 November 1982, and in the United States on 6 December. It was nominated for Academy Awards in eleven categories, winning eight.” It is so unfortunate that ‘Gandhi’ is assumed as a movie title when he is more than a person, and is often treated as a common noun.

Why did my question get answered as if the question of ‘Gandhi’ was all about a movie? I was asking about a person. And the device doesn’t even rephrase the question for me of how it is interpreting?

It would have been great to start the response with “Gandhi is a movie and was released in…”. Otherwise, it is a bad idea to assume that the question was above a movie and not even tell the user about it. It is again a fundamental problem as the underlying technology is just indexing all of Wikipedia, local businesses, restaurants and some top search results, and just throwing the top most (SEO’ed) answer. The user just can’t go by the answers and be fooled!

Another example for the same flaw… try asking “when did Iron Man release”, and the device starts responding when the first sequel of Iron Man movie was released. Hmm, I didn’t say that I was referring to the movie, I could be thinking about the book “Iron Man”, or the movie’s sequel 1 or sequel 2. Even if it is a movie, why the sequel 1, and not sequel 2?

Flaw #3: Inability to articulate words being ignored

I have seen many scenarios where there is no way to ask a question. For example, “what are the nearby pizza joints” returns “Pizza my heart, Big Apple Pizza, Amici’s”, and not “Pizza Hut, Round Table Pizza”. I wasn’t sure why did it not give details of Pizza Hut which is far closer to my home than the 3 restaurants that it responded with. I tried asking “what are the nearby pizza fast food restaurants” and it responds with “KFC, Subway and McDonald”. Really? What just went wrong?

Users can ask very direct questions or use overloaded words. The device can’t just ignore some words and answer the rest. If ignoring words — say what is being ignored.

Users ask give more details either to confirm what they are asking, or they don’t know how to ask. Clarify with the user on how to ask. I could have simply got an answer to the above question by searching for “pizza” in Google Maps, but asking for “pizza” in Google Home takes it for a spin.

In this particular case, I couldn’t figure out why McDonald came in the answers because it doesn’t sell any pizza product at all, and the user doesn’t have a way to respond for “search instead for” questions.

Flaw #4: Inability to understand comprehensively

The context of the question is never understood comprehensively. For example, I asked the question “how many songs do I have in my music library”. It gets responded with “shuffling your music” and then one of the songs starts playing.

Oops, my context of the question was an inquiry on my library and not to play the song.

It is very evident that there is a rush to do something for every question, and not really understand the question itself. What sort of assumptions do they make on the user while building these products? Is it an adult or a child; a tech savvy or a non-tech savvy; an active or a lazy person? I would like to understand the assumptions on the emotional part of the user that either Google or Amazon has built this for.

Even in English, depending on the country — people call “movie” as a “cinema”, or a “film”, or a “picture” or even a “show”. As of now, because there is no communication, this question doesn’t arise, but if in future the device communicates back to the user — it would be interesting to see how its product management has addressed this.

Flaw #5: Disregard to meaning beyond keywords

The device doesn’t go beyond keywords and ignores the complete meaning of the question. For example, try “tell the names of 5 American presidents”, and all the device responds with one president’s name (thankfully, as ‘Barrack Obama’). The device understands what an American president is, but not 5 of them. Similar example can be “tell me 5 jokes” which just returns only 1 joke.

Sorry, but there is more to understanding of the question than just some of the keywords. The right answer for this question should have been “I can tell one joke, and not 5. Here is the joke…”

Too bad guys — there is a lot more maturity that is expected from this device. Instead of looking my example in isolation, look at the big picture — the device understands only the keywords and not beyond them. On a different note, the question “tell the name of Russian president” is not understood :-

Flaw #6: Inconsistent messages

I was first intrigued when I got different status messages for the not-understood questions. I am not blaming the device for not understanding the question itself, but disappointed about how it handled the responses when it did not understand.

Try to ask the same question that is not expected to be understood at all multiple time, like “what is John Martin’s phone number” and you can see that answers randomly vary from “My apologies, I don’t understand”, or “Sorry, I can’t help with that yet”, or “Hmm, something went wrong”, to “Sorry, I don’t know how to help with that yet”, or “Sorry, I am not sure how to help with that yet. I am still learning”, or “Hmm, I wasn’t able to understand the question I heard”.

Too bad, Google & Amazon. Just by changing the apologetic messages randomly doesn’t make the user believe in you. Stay consistent.

I would have loved if the device responded in this case as “Sorry, personal phone numbers are not supported. Only business listings are supported at this point.”

For any enterprise product, it is almost mandatory to document how to handle an error or an exception message. In the consumer world, companies assume that not having a user guide is a fashion and they just take customers for a ride. These devices are not simple enough and need trouble-shooting for the user to know how to ask questions better.

Don’t confuse the users, and stay consistent.

Flaw #7: Not designed for personal data

Every question is assumed to be about an Internet data (public data) with no personalization whatsoever. The public indexed data such as Wikipedia, movies, local businesses, music albums, weather, or news, far dominate the personal data in these devices and they are just not designed to answer questions on your own personal data. For example, ask “tell me about Jack Reacher” and the device assumes it is a movie.

Oops, what about the Jack Reacher who is my colleague? Even if the device ignores my colleague, what about the 100+ people with this name on LinkedIn.

If all that the user is looking for a voice response to a google.com search box, this device is not worth of even 1 cent! It should read my personal data for my questions.

Flaw #8: Forced actions

Anytime I ask about a movie or a book in Alexa like “tell about Harry Potter”, it answers from wiki, and continues to either order the book/DVD for me (forcing me with a purchase), or says that it is not in my library.

Don’t keep trying to upsell other services with voice assistants. If all I want is an easier way to order (I am not that lazy), I will use my mobile phone.

Treat the voice assistant as an independent product, that the customer has paid money for, and respect the customer. Don’t try to bully or fool the customer.

Flaw #9: Lack of interactivity on capabilities

Can the device features be inquired? For example, a question like “what’s your current volume” doesn’t get understood at all, while “increase the volume” or “decrease the volume” gets understood. Similar failures with questions like “how many alarms can you set?” or “what music sources do you support?”

Too bad, I am communicating with an audio device and I can’t even calibrate. I am communicating with an alarm device, and I can’t inquire on the currently set alarms.

When the skills or the vocabulary give only a couple of choices, the device makes it difficult to interact for the user.

In the end, I felt these voice assistants are more like remote controls with lots and lots buttons, which you can press by voice instead of using a tactile input. You need to remember all the buttons though :-)

[disclaimer: These observations are noted at the time of writing this article. I would be glad to be proven wrong in the future.]

Ramesh Panuganty is the Founder & CEO of Drastin, the world’s first conversational analytics platform.