Over the past few months, it’s been all but impossible to escape the growing buzz around voice technology. From Amazon’s Echo to Google Home to Apple’s HomePod , it seems like everyone’s getting into the game.
SNL even jumped into the conversation by releasing its own parody featuring an “Echo Silver” for the “greatest generation.” But as we look beyond the hype, we’re forced to confront a more fundamental question: what value is voice technology adding?
As a venture investor, my goal is not only to identify transformative technology trends, but also to recognize the applications within them that can lead to big businesses.
Doing so requires having a prepared mind about the true state of a technology, as well as a thesis on opportunities for the greatest value creation. With that in mind, the following article presents an overview of how the space has evolved, and a few areas we’re excited about for investment.
Smartphones Are a Stepping Stone
Voice-enabled technology has been accessible to us for quite some time via our smartphones. Apple introduced Siri in 2011 and the use of smartphone-based voice assistants has been growing steadily ever since.
In 2015, 65% of smartphone owners used voice assistants, up from 30% just two years earlier. Although many of these interactions were fairly simplistic (e.g., “call mom” or “search for nearby restaurants”) there is also evidence that these interactions are becoming more complex.
One analysis from Baidu shows that API calls for text-to-speech services increased by over 20X between 2014 and 2016. This suggests that people aren’t just asking more questions by voice, but also expecting more answers.
As a result, designing voice-enabled apps requires a fundamental rethinking of user experience and search functionality, with a changing emphasis on graphical user interfaces (GUIs). For more on this, see my last article Giving Advertisers a [Literal] Voice.
Hardware Drives Mainstream Awareness
Over the last two years, devices such as the Amazon Echo and Google Home have driven voice-enabled applications further toward the mainstream. Thus far, Amazon has been a clear leader in the market, with approximately 8 million Echo units shipped through the end of 2016 (compared to Google Home at approximately 500,000 units). Newer models of these devices have only continued to fuel excitement.
Amazon’s early success can be attributed to its head start on the technology’s development, and its aggressive marketing push. Even so, as of Q1 2016 ownership of the Echo was only at 5% of US customers, despite having 60% awareness.
This suggests that there is significant room for growth as awareness moves toward conversion. In fact, estimates indicate that the footprint for all voice-enabled devices may climb to 33M units by the end of this year.
Growing Number of Third-Party Skills
The rapidly expanding footprint of voice-enabled devices has led to a growing ecosystem of third-party skills and applications. As of May 2017, there were more than 12,000 “skills” in the Amazon Echo marketplace, compared to nearly 100 integrations for the Google Home.
These figures have grown quickly thanks to both player’s extensive investments into their developer relationships and the democratization of content creation tools.
However, few of the applications on the Amazon Echo or Google Home have much substance: only 30% of Amazon Echo skills have more than one consumer review, and most feature a 3% average retention one week after download.
The most commonly used applications for these devices are generally the simplest, such as setting a timer, playing a song, or controlling in-home IoT devices.
There are signs that these engagement statistics will improve in the near future thanks to two major changes: (1) the creation of payment features that will allow developers to monetize applications; and (2) an improved app discovery process.
As of today, developers have not been able to charge users for downloading or using their voice-enabled apps, which likely disincentivizes developers from creating more robust experiences.
In addition, the discovery process has been cumbersome for users, requiring them to first learn about applications via other mediums and then download apps online. Both Amazon and Google have signaled their intention to address these issues, but the timing and impact still remains unclear.
Growing Number of Voice-Enabled Transactions
While monetizing voice applications isn’t yet possible for the broader community of developers, Amazon, a company that has been masterful at encouraging recurring purchase behavior, has already demonstrated that consumers are willing to spend more via voice. Owners of an Echo spend about 10% more and purchase 6% more often on Amazon than they did before they had the device. Revenue from purchases on Amazon made via the Echo will outstrip revenue earned from selling the devices by next year, and the gap will continue to widen from there. By 2020, estimates indicate that Amazon will generate over $7 billion from transactions, on top of the estimated $4 billion from the devices.
This demonstrated purchase behavior is critical for two reasons.
The first is that developers now have a compelling reason to invest in reaching audiences via voice devices.
The second is that a large segment of voice devices, particularly those created by Amazon, will be attached to customer credit cards.
This is a major advantage over chatbots and other messaging apps, which have struggled to gain access to payment details. The value of this integration will likely expand further as Amazon pushes to have its Alexa operating system power a larger number of third-party devices — a trend we’re already seeing.
Speech Recognition is Improving
Another important driver of mainstream adoption has been the rapid advancement in speech recognition.
As of 2016, error rates on speech recognition have fallen to about 5% versus nearly a third in 2012.
Deep learning approaches have been a significant catalyst to these gains, and will likely push us into even higher rates of efficiency over the next few years.
Despite advances in speech recognition, however, true natural language understanding (NLU) is still a long ways off. Voice assistants often fail to comprehend the meaning of our requests, even if they manage to transcribe them perfectly.
As a result, consumers often get frustrated with offerings like Siri, which are advertised as having a very broad set of applications, but actually only deliver on a handful. That handful is the result of extensive training to understand all the possible variations of a request, and the creation of tools to quickly identify specific responses.
Investing in the “Voice-Stack”
To identify opportunities for the biggest value creation in the category, it helps to start by visualizing the underlying tech stack (see below).
From an investment perspective, not all layers are created equal. Thanks to extensive low-cost infrastructure development by Google and Amazon, the bottom rungs have moved quickly towards commoditization.
As a result, investment interest has been pushed higher in the stack, where we’re likely to see far more competition and value creation among startups.
In particular, the two layers that seem most promising are AI Software Tools and Applications. The first represents products aimed at developers, and focused on the creation / deployment of voice-native applications. The second represents services aimed at end-users, across consumer and enterprise verticals.
We expect to see new businesses focus on these two layers, especially as traditional AI software and applications (such as shopping, search and entertainment) have struggled to make the jump toward voice.
Exploring Voice’s Native Advantages
As with any new technology, there will be a flurry of excitement around its possibilities. However, the businesses most likely to succeed will be those that truly understand the native advantages of voice-enabled technology and are able to create the services or tools that push the boundaries of our expectations.
For example, today’s booming rideshare market was only made possible by understanding the mobility and location awareness inherent in smartphones. This, coupled with a seamless payment system, made for a magical product experience. Similar examples will emerge in voice.
The challenge, of course, is actually understanding how voice’s native advantages will manifest in real-world applications. Below are a few that I’m particularly excited about. These are areas I’ll be keeping an eye out for as I meet with entrepreneurs in the space.
Native Advantage #1:
Increased Interaction Speed & Efficiency
Americans type at an average rate of 40 words per minute, but speak at an average of 150. Notwithstanding the manual dexterity of today’s millennials, voice-driven interfaces will be a far faster way to input data than banging away on a keyboard.
Though this may not seem like a significant UX improvement when checking the weather or looking up sports scores, it will be extremely valuable in more complex use cases.
For example, doctors spend an average of one to two hours per day manually entering data into electronic health record (EHR) systems. Much of this valuable time could be recaptured with better dictation software.
Another advantage is reducing the time it takes to navigate to information. Rather than using the embedded menus provided by modern GUIs, voice opens up the possibility for unstructured search.
For example, let’s say you wanted to filter listings on an ecommerce site based on a dimension that isn’t usually indexed (like % of reviews citing defects or recency of launch), a natural language voice interface could interpret that request and organize the results accordingly.
Under current circumstances, ecommerce sites wouldn’t be able to incorporate such changes, as this would risk overcomplicating the user experience for the majority of users.
Potential Startup Applications: improving enterprise workflows, personalizing digital experiences, automating data transcription / summarization.
Native Advantage #2:
In some physical environments, such as industrial worksites or behind the wheel of a car, access to screens may be limited. In these cases, voice-driven interfaces don’t just speed up access to information and services, but also enable it or make it safer. More than a third of voice users already cite their vehicle as the primary location for using voice apps.
Several companies have emerged with specialized applications for hands-free environments. Startups like Guardhat and RealWear have incorporated voice technology into form factors purpose-built for industrial settings.
Potential Startup Applications: facilitating communication in industrial settings, managing a distributed workforce, expanding information accessibility, increasing personal productivity.
Native Advantage #3:
Enterprises track and record millions of hours of customer service and sales calls each year. Currently, these records are primarily used to monitor general statistics such as call volumes, resolution times, and survey scores. However, focusing solely on stats over conversational substance risks overlooking critical insights.
By actually listening to these calls, enterprises can discover new customer-driven product recommendations, figure out which product descriptions resonate most, or automatically generate a playbook based on the scripts of top performers.
With voice-driven analytical tools, such insights can be extracted at scale from what would otherwise be considered “dark data.”
As platform players like Amazon begin to eliminate the need for direct engagement between consumers and brands, the few remaining touchpoints will have even greater significance.
Any opportunity to speak to a customer (even an irate one) should be treated as a chance to not only communicate, but also to learn.
Potential Startup Applications: generating business intelligence, enhancing employee training, improving customer service / sales.
Native Advantage #4:
Ambient Computing & Contextual Awareness
Since its inception, Google has had a relentless focus on search speed. Within the engineering group, entire teams were dedicated to shaving nano-seconds off information retrieval times, after users clicked the search button.
Eventually, someone realized that one of the biggest remaining opportunities for improvement would come from removing a step — delivering search results before a user even finished typing.
The next wave of search will go beyond delivering faster answers to your questions, and may make asking unnecessary in the first place.
This idea represents the future of ambient computing, where a network of smart devices respond in real time to what’s actually happening in the environment and surface information when it’s most relevant.
Such is the ultimate vision for voice devices like the Amazon Echo or Google Home. These devices are intended to operate in the background, but they have permission to listen in at all times and can interject when it’s most helpful.
Right now, that happens around a limited set of pre-programmed activation commands, but over time, those triggers are likely to expand. This will open up new opportunities for developers looking to capitalize on “always-on” access to consumer attention.
Potential Startup Applications:
Enhancing productivity, training models to understand conversational context, facilitating voice-driven commerce and contextually-relevant advertising.
Although it may feel like we’re still in the early days of voice technology, the ecosystem has matured significantly over the past few years. Not only has it become easier to build voice applications, but also to train them to deliver unique value at the right moments.
This year we will see startups begin to hit their stride around sustainable business models and products in the category, with the potential for large venture-backed winners to emerge.
If you’re one of those talented entrepreneurs building something cool in voice, please feel free to reach out, I’d love to talk!