Why enterprise voice AI is about to turn an important corner

Orange Silicon Valley
AI and machine learning
6 min readFeb 13, 2018
Image credit: aerogondo — stock.adobe.com

By Mark Plakias

(Editor’s note: A version of this post previously appeared on our blog at OrangeSV.com.)

As 2018 begins, it’s time to update enterprise developers’ list of Must Haves. The old paradigm of “Every company needs a mobile strategy” has now evolved into “Every company needs a voice/AI strategy.” Another way to understand this evolution is through UI’s progression from tapping a touch screen to swiping to speaking. Those changes in turn have triggered new demands for conduct and code.

There’s no doubt human behavior is adapting. SoundHound VP of Product Marketing Mike Zagorsek made the point during a January event at Orange Silicon Valley that GenZ will be the “Voice First” generation, and the data and anecdotal evidence backs this up. A recent Walker Sands survey showed 22% of 18–25-year-olds have an Alexa for Google Home device — and that rate goes up to 46% for 36–45-year-olds. The growing circuit of conferences focused on voice assistants makes it clear that normal family technology experiences are exposing younger members of the GenZ cohort to voice interactions at an early age. This means that an emotionally-encoded relationship (i.e., one using the human voice) with artificial (but sentient) agents are becoming widespread in the home in 2018.

According to Dashbot.io — which has built an analytics platform that has already looked at 22 billion conversations, both written (Messenger, Slack, Kik) and spoken (Alexa and Google Home) — the average session time for Alexa is just under five minutes, while Google’s assistant gets and average of two minutes and 38 seconds. Engagement is strong for the home, with 50% of Alexa users reporting multiple sessions per day, although only 25% of surveyed users use more than three apps (and yes, music apps are still №1).

So what about the enterprise? Well, before we go there, let’s take a look back at history — to do so we don’t need to go further than Orange in 1999. That’s roughly two years after a startup based in Cambridge, Massachusetts, named Wildfire received patent approval for its design for a “Network based knowledgeable assistant.” The assistant basically a fancy, voice-based, voicemail system with the ability to place, take, and connect phone calls. The UI described included this: “… in response to receiving a summoning command, switching the electronic assistant into a foreground mode … and when in its foreground mode responds to a second set of commands where the second set of commands is larger than the first set of commands.”

Orange was the first major carrier to deploy Wildfire in its network and went on to buy the company for $143 million (in the process adding one of the co-founders, named Rich Miner, to its payroll). At roughly the same time, on the other side of the country, an Apple-inspired startup called General Magic was working with AT&T to develop a similar voice-powered message platform, that would eventually create OnStar for General Motors.

Flash forward to January 2018 when Orange Silicon Valley hosted a fascinating conversation about Conversational. What made this feel new-yet-familiar was the injection of voice-powered virtual assistants into the core of the modern enterprise: The Meeting.

Three different approaches to the conversations we have at work emerged from the event, entitled “We Hear You: New Use Cases for Natural Language Processing,” co-sponsored with the long-standing speech recognition working group, AVIOS. The event featured two startups focused on enterprise applications, and arguably the incumbent that has the most massive access to enterprise meetings of any player, Cisco.

Voicera is the next project from two of the co-founders of Blue Kai, and Voicera CPO David Weiner can speak from the experience of becoming absorbed into Oracle’s business culture, which is where Blue Kai ended up. Voicera’s ambition is large: to become the “system of record for voice in the Enterprise.” It is telling that it’s 2018 and a startup can assert that “no one owns that asset.”

AISense, another startup focused on transcribing the enterprise conversation, is betting that conferencing bridges are the way in. It has announced a conferencing deal with Zoom Video Communications, and indeed the product is headed towards beta, with integration into the Zoom dashboard for turning it on as part of Zoom’s Cloud Recording infrastructure. Voicera also lists Zoom as an integration.

Well, hold on: Conferencing is Cisco’s bread-and-butter, and it has already done massive investments to bring its Spark messaging platform front-and-center as the collaboration ecosystem. As part of that investment, last year Cisco acquired MindMeld, one of the leading natural language development platforms for $125 million, and has integrated the asset into the Cisco Cognitive Collaboration Group. Thanks to its embedded base of microphones and cameras in corporate meeting rooms, Cisco has (on paper anyway) the most frontage on converging conversations into enterprise workflows. Because the room is pre-wired, activation is a simple matter of walking into the room and waking up the Spark voice assistant. This joins an existing ecosystem of chatbots and third-party add-ons in the Cisco Spark Depot appstore (for example Zoom.ai).

AISense is probably the closest to an atomic view of the Conversation as a just that, two people conversing. AISense exec Seamus McAteer shows his use of it for phone calls and conversations between him and another person, not necessarily as a meeting. That said, we already mentioned the company’s deal with Zoom (not to be confused with Zoom.ai) to record conference calls (audio, video, and even text chat). It makes total sense that Bridgewater Associates — founded by hedge fund trader Ray Diallo and famous for recording and publishing every meeting to its employees — was in on last November’s $10 million Series A round.

But listening in is just the beginning. Weiner describes meetings as the biggest productivity killer in the Enterprise — worse than email itself. They see meetings not just as words but as collections of meanings: action items, follow-ups, sentiment. Having recruited from Facebook’s Applied AI group, they are preaching the idea of extracting follow-ups and insights for the team post-meeting. The use cases for mined conversations are potentially rich: summarization, coaching, customer analytics, and hiring interviews are just some examples that make AI’s business case.

But we’re still walking, not running. Voicera’s idea of activation is still pretty manual — either a GUI web interface or a voice command is used to wake up the recorder and identify highlights, as opposed to just brute force recording everything. But Voicera seems to have cracked an important design pattern: getting to the meeting in the first place. By integrating natively to the calendaring system (The persona Eva has her own email address, which is added to the meeting invite) and then ingesting and actively dialing into the conference bridge autonomously, Eva demonstrates that 80% of success is showing up, even if you are an AI.

Where’s the elephant in the room? You must be thinking of AWS and its Amazon Transcribe ASR-as-a-Service, which was announced in November 2017 and is still in preview mode. For all its accomplishments in the home with Alexa on Echo, Amazon Transcribe still lacks individual speaker identification in the transcripts it produces. Other examples in the Conversational Enterprise space from big players include SAP’s CoPilot, and Oracle’s more constrained Voice app for its CRM platform.

In an ideal world, pride of place for enterprise conversational AI’s should go to the swiftest and the smartest — speed and accuracy are the most obvious performance measures for both command-driven routines (scheduling a meeting, setting up a conference call) and pure transcription. We all know how painful inaccurate transcriptions look — and that’s why most of the world is still paying something like $0.79 a minute for humans when readable transcriptions are necessary. Nevertheless, accurate text retrieval with the added benefit of playing the audio in synch is the new enterprise search engine.

We aren’t there yet. But just as the consumerized enterprise transformed software into a service, the march of conversational technologies like natural language understanding, and AI will transform meetings into information.

Disclaimer: The views and opinions expressed in this article belong to the author and do not necessarily reflect the position or views of Orange or Orange Silicon Valley.

--

--

Orange Silicon Valley
AI and machine learning

We’re part of Orange. We collaborate with startups, brands, investors, and others on new innovations. Home to Orange Fab. Email newsletter: www.bit.ly/MainCable