Where are Siri, Alexa & Co. going next?

A brief history of voice assistants + future outlook

Florian Quirin
SEPIA Framework
12 min readOct 12, 2018

--

Since Apple introduced Siri in 2011 voice assistants are available to basically every smartphone user on the world and with the release of Alexa skill-store and Amazon Echo in 2015 development of custom voice-based applications became possible for every programmer and voice-enthusiast (in the US). Looking at the numbers of 2017/2018 one can safely say that voice assistants have arrived and are here to stay:

Some even compare the “rise of voice AI” to the adoption of smartphones and expect “voice AI”-applications to have the same transformational impact that mobile apps had 10 years ago. This is a pretty bold claim and to be honest I’m not sure about the exact definition of “voice AI”-application but let’s have a closer look at the 4 most famous (western) competitors to see where they are coming from, what they have achieved so far and where they might be going next.

Apple

Apple first introduced speech recognition and voice-commands to consumers as early as 1993 with the release of PlainTalk for Mac but it took another 18 years (and a smartphone revolution) until the technology was mature enough for the next step: Siri. Since the iPhone 4S (iOS 5) Siri is present on every new iPhone and I think it’s safe to say that the release was a huge success for Apple at least from the marketing perspective. Everybody was eager to try it out and the Big Bang Theory even made a TV episode about it (S05 E14). Besides all the funny conversations people had with Siri it’s unclear though how many were actually using it for everyday tasks. I for myself was thrilled by the idea and loved to play around with it but used it mainly for navigational purposes (“show me the way to [contact xy]”) and quickly felt limited because there was no way to extend it’s functionality. This was also the reason why I started my first, own voice assistant project called ILA (later succeeded by SEPIA) in 2013.

Apple improved Siri from time to time with new services and some other details, but most of the changes were done quietly under the hood until iOS 10 and the introduction of SiriKit in 2016, Apple’s answer to the (apparent) success of the Amazon Alexa skill-store for voice-based apps (or maybe it was just overdue). For the first time it was now possible to connect your own app to Siri … provided one of the pre-defined domains (e.g. messages, payments, ride booking etc.) fit to your service. SiriKit implementations came without much hype though (most prominent apps I remember: WhatsApp, Uber and PayPal) probably because the demand was actually rather low, but paved the way for the recent changes to iOS 12: Siri Shortcuts. With Shortcuts users can define workflows similar to IFTTT and start them with a single Siri voice-command (or a button).

Apple’s smart speaker called HomePod was released in February 2018 and was the company’s first device to promote voice-control as primary input method to control your music or smart home.

Outlook and opinion:

Overall Apple’s strategy for voice technology seems to be focused on augmenting the different i-Devices and apps. In devices with a small screen (Apple Watch, 2015) or no screen at all (HomePod) voice-control is more dominant while it remains mostly in the background for iPhone, iPad and Mac (btw Siri is on the Mac too since 2016). I like this approach because it does not try to force voice-control onto something that just isn’t a perfect fit, but tries to apply it where it really makes sense. There is one issue though that hasn’t improved much over the years and is most obvious in the Siri-focused HomePod: Apple’s speech recognition is still producing too many errors, especially in foreign languages like German. It is hard to quantify, but as a heavy user of Apple’s, Google’s, Microsoft’s and Amazon’s speech recognition (both in German and English with German accent) my experience is that Siri fails too often where it should work best namely music titles and names! However common vocabulary and short voice-commands are usually no problem and that might be enough for the new Shortcuts function. Looking at apps like Scriptable that have become possible with Shortcuts you can only begin to imagine how Siri will change in the next few months especially because Shortcuts will work on the HomePod and Apple Watch as well. Short voice-commands triggering complex actions is probably what we’ll see most from Siri in the future while new artificial intelligence (AI) and machine learning (ML) features will be implemented independently into iOS.

Amazon

Amazon’s product developers accomplished something quite unique with the Echo in 2015. They took a technology that seemed to be hibernating inside the iPhone (Siri) at that time, removed the display part completely, added a large speaker and released it as a stand-alone product … with success. I think there were 4 key reasons why this worked:

  • Amazon’s main marketing message was: Control your music via voice across your living room (which just worked) … oh and it can do smart home and other fun stuff too if you are into that. This message is still dominant on today’s Echo product page.
  • Developers were involved right from the start, creating your own “skill” (app) was easy and it was fun to play around with.
  • It came at a time where innovations in the smartphone market were basically just bigger screens.
  • Amazon used its enormous market and marketing power to constantly push it and promote the echo on the shop’s front page.

After the release Amazon focused on extending developer tools and platform reach by offering Alexa technologies as cloud services. Beginning of 2016 the Alexa Voice Service (AVS) went online followed by Amazon Lex and other “AI” cloud-offerings later the same year. With AVS developers can integrate Alexa in basically every hardware that has a microphone and speaker and with Lex it’s possible to build custom conversational interfaces and chatbots that work for example inside your own app or a chat messenger (Slack etc.) … connection to Amazon Web Service (AWS) required of cause.

In the following 2 years Amazon extended the Echo-line and created a whole ecosystem of Alexa enabled devices. Today (almost) every WiFi/Bluetooth speaker can be bought with Alexa integration (and fridges and microwaves and mirrors and …), #VoiceFirst is a valid design-pattern and the skill-store is flooded with new skills each year.

Outlook and opinion:

From the marketing perspective things worked out great for Amazon and market expansion was pretty successful but it’s worth taking a closer look at the skill-store story. Alexa skills have a big problem of keeping their users engaged. A report by alpine.ai from 2017 claimed that the average skill only had a retention rate of 3% meaning 97% of the users who started a skill once will not start that skill again the week after. In fact, the only skill that was used in April 2018 by almost every (active) user was music followed by news and traffic (every 2nd user) in safe distance and reminders/alarms (every 3rd) even further away. I believe there are two reasons for that:

  • Lack of utility and a clear use-case
  • A large gap between user-expectations and user-experience

Many skills you find in the skill-store seem to be motivated by the question “how can I be part of the marketing hype?” and not by “can I offer my customers additional value?” similar to the early days of the app-store where companies thought they could make an app that resembled their homepage. Some other skills are just about developers experimenting with the new technology and practicing for real use-cases. Developing for a device without screen requires a whole new set of design and best-practice principles. Amazon knows that and has since tried to improve the situation by organizing workshops for skill-developers, making development of skills even easier and introducing new devices with screens and cameras. Especially the latter is interesting since it is blurring the line between skill-store and a conventional app-store (“how is my Echo Show different from a tablet with Bluetooth box again?”).

The gap between user-expectations and user-experience is something that has been plaguing conversational AI (e.g. voice assistants) since the dawn of time (probably marked by ELIZA) and peaked 2017 when it was revealed that Facebook’s chatbots had a 70% failure rate. It is an inherent problem of marketing for AI/ML-based technology, giving the false impression that machines now understand everything and can talk like humans (most prominent example at the moment: IBM Watson commercials) while in reality we are still struggling with speech recognition accuracy and even the simplest multi-turn dialogs. To be fair Amazon itself never really pushed the all-powerful-AI image that hard and focused on simple dialogs (or voice-commands) for the echo, but skill developers had to learn it the hard way.

Amazon’s strategy for voice technology is way more aggressive than Apple’s: Integrate it into everything possible (hard- and software) to gain market reach! Since every interaction with Alexa requires a connection to Amazon Web Services (AWS) and skills can be hosted conveniently on the AWS-cloud it is a smart way of binding products and developers to the own platform eventually cross-selling additional cloud-services and Amazon products. The evolution of the skill-store will be interesting to watch especially with Amazon’s attempts to offer in-skill purchasing and monetize skills directly. I believe that we might see a change of Amazon’s #VoiceFirst principle to #VoiceAndScreen depending on how compelling and diverse future #VoiceFirst skills can still become.

Google

Google was one of the first companies to offer voice-search on a large scale with its 2008 iPhone app followed 3 years later by “Voice Search on desktop” a Chrome browser integration. In fact, Google Chrome is still the ONLY! browser with a fully working integration of the Web Speech API (the W3C standard for integrating speech recognition; shame on you Apple and Microsoft!). The story of voice assistants and Google is a little more confusing. They introduced Google Now in 2012, probably as kind of an answer to Siri, but it was more like a fancy way of doing voice-search with no support for conversations (but it worked well as an “assistant”). Google Now eventually evolved into Google Assistant in 2016 marking the first release of a dialog-based voice assistant for the company. In the same year they introduced the Amazon Echo competitor Google Home and from May 2017 on developers were able to create their own “actions” for Google Assistant (working on phones and speakers). Google Actions are the equivalent to Alexa skills and there even is an Android version of SiriKit coming, called App Actions … needless to say, that everything is now available as cloud-service too.

Outlook and opinion:

Google seems to be following a “try everything, be everywhere” strategy for voice technology. Their product portfolio basically includes everything Apple and Amazon do (highly integrated smartphone app like Siri, smart speakers in every size like the Echo and a cloud-service for third-party integrations and developers) and they are constantly pushing it to the limits (latest example: Google Duplex). Since Google is already ruling the smartphone market (Android has ~75% global marketshare) and sold more smart speakers in Q1 2018 than any other vendor (according to some analysts) its likely they will dominate the market for voice assistants for the next years. The deep integration into Android is Google’s main advantage against Amazon. In addition, Google does not depend that much on voice-only concepts, actually they’ve just updated their Assistant to show even more buttons and info on the screen because they found out that “half of all interactions with the Assistant [on mobile] include both voice and touch”. Apple is only offering Siri integrations for their own i-Devices and due to their high price tag will probably not slow down Google’s expansion considerably.

The interesting question that remains is: Will Google add more advertising to their voice assistants and will that effect user-acceptance? (it still makes 86% of the company’s revenue). An (unconfirmed) test with voice-based ads in 2017 already led to a lot of resistance and it might be pretty hard to introduce something like this to the mainstream leaving Google with the usual screen-based recommendations and selling of digital goods/subscriptions via Google Actions.

Microsoft

Microsoft is experimenting with speech recognition since the 90s and with the release of Windows Vista end of 2006 (half a year before the first iPhone was released) the company made voice-commands and voice-dictation for Windows available to millions of users (with a very funny introduction, but obviously with some success stories as well). Since then it was part of every new windows version but it took until 2014 and the release of Windows Phone 8.1 for the technology to evolve into the full-blown voice assistant Cortana. Rumors say that the foundations for Cortana were laid in 2009 and may even have some connections to early Siri research at SRI International. Since 2015 Cortana is available on Windows 10, Android and iOS and frequently appears in connection to the HoloLens (Microsoft’s augmented reality goggles). October 2017 Microsoft together with Harman Kardon released a Cortana powered smart speaker very similar to the Amazon Echo. With the Microsoft Bot-Framework developers can build “skills” for Cortana that also work as chatbots in e.g. Skype, Slack or Facebook Messenger.

Outlook and opinion:

All in all Microsoft is well positioned, they have the technology (their speech recognition is in my experience only outperformed by Google), they have the cloud-services and possibilities for 3rd-party integrations, with Windows 10 and Xbox they have millions of potential(!) Cortana devices, Cortana as assistant usually gets the job done and they’ve proven that they can do smart speakers as well, yet it looks like they are currently just spectators to the voice assistant game, similar to the early days of app-stores. The Harman Kardon smart speaker remained the only Cortana enabled 3rd-party hardware so far (to my knowledge) and user-reach must have been so low that Microsoft decided to make Cortana available via Amazon Alexa in 2018 (and vice versa). Microsoft made a similar decision in 2015 when they released Cortana for iOS and Android to remain present on smartphones after Windows Mobile was discontinued. It is probably a smart move to make their services available on as many platforms as possible instead of focusing on Windows or own hardware since about 2/3 of their earnings come from cloud, productivity and business processes (that includes the Office suite for example). Business-to-business has always been one of the most successful strategies for Microsoft so its not a big surprise that they’ve recently introduced the Cortana Skills Kit for Enterprise to expand their voice assistant and AI offerings further into this field.

Outlook Summary (tl;dr)

While Apple is constantly improving the AI features of iOS and MacOS, voice-based Siri interactions will most likely stay in the background as an augmentation to existing apps. Siri Shortcuts will be an interesting feature to observe but a microwave with Siri integration is probably not something we’ll see soon unless it is using HomeKit via iPhone or HomePod.

In contrast to Apple, Amazon is focused on expanding their newly (2015) created ecosystem of voice-driven Alexa cloud-applications and is trying to integrate voice-commands directly into 3rd-party devices like ovens, cars, fridges and many more. Voice alone as we saw it in the first Echo might not be enough for the optimal user-experience at least not with the current state of the technology, so it will be interesting to observe how devices and skills shift more and more to voice-with-screen approaches.

Google already rules the mobile market and will probably offer the same features as Apple on smartphones (voice augmented OS and apps) maybe even pushing the AI part a bit harder. Since cloud is becoming more important for their business they will continue to drive 3rd-party integrations to become the default assistant and voice-command interface on as many devices as possible. The craziest AI/voice experiments will probably be coming from Google as well.

Microsoft Cortana has the problem of being the only voice assistant that does not have a dedicated hardware device right now except maybe Microsoft Surface devices and the Xbox. Desktop PCs usually lack a proper microphone and besides that were never really the first choice for voice assistants (neither on Win10 nor on MacOS), it is just more convenient to use the device in your pocket or an always-on/always-listening smart speaker. Because of that Microsoft seems to focus on offering their services on all other popular platforms as well and to integrate their AI and voice technology further into the enterprise business (e.g. customer support technology for companies or Cortana support for Outlook via Amazon Alexa and mobile apps).

Another interesting topic for future voice assistants will be data-privacy. Not everybody is comfortable with the idea of having a device at home that is constantly streaming data to the servers of large corporations especially not when its suddenly broadcasting this data to random people (O_o). Open-source projects like SEPIA (my own project) and Mycroft (largest community driven project) are trying to offer alternatives that put the control of all data in the hands of the user.

Today no voice assistant can offer a user-experience close to a real, meaningful conversation with a human and this is something developers need to take into account when developing voice-driven applications to avoid disappointment, but there are many places where voice can be more convenient and fun to use than other input methods. Music control with smart speakers, scheduling appointments or reminders on the smartphone, combining multiple tasks into one voice-command or goal-oriented question-answer systems are some examples.

I am curious in any case, what comes next :-)

--

--

Florian Quirin
SEPIA Framework

Physicist, data scientist, Java and web developer, AI enthusiast, Raspberry Pi hacker and bot (framework) builder.