Deconstructing AI and the Curious Bell Curve of Voice Computing

Understanding conversational computing, AI, and the definition of over-solving

Published in

The Pointy End

9 min readJul 8, 2018

Voice computing has been on the radar for quite some time and in recent years has reached a level of mainstream penetration to be almost pedestrian. Nobody is amazed by Siri anymore, in fact people are more annoyed that she’s not better than she is. The steady spread of home assistants such as Google Home and Amazon Alexa, coupled with the standard inclusion of ‘speech to text’ as a standard element of any SDK has turned voice computing into a broadly unexciting assumption.

There is an obvious distinction between voice computing and what we broadly deem as ‘AI’, which is the crux of what made the announcement of Google Duplex all the more groundbreaking. The real game changer is that Duplex seems to be a technology that can talk to people, as opposed to something that is purely reactive against human commands.

The demo of Duplex shown at Google I/O, as well as commentary from those who have been provided early access to the product, illustrates a conversational agent with an impressive ability to weave through changing conversational contexts, ask for clarification, and defer the conversation to a human when abjectly confused. Impressively, according to Lauren Goode’s experience with the product, Duplex was so effective at simulating humanity that when a human did interject and complete the conversation, Goode was mostly oblivious to the fact. Duplex really does seem like one of those defining moments to bring us ever closer to that illusory ideal of ‘general AI’.

But if for a moment we take the blinkers off and be skeptical about what something like Duplex means for AI and computing generally, its utility is in fact highly questionable.

To understand that is to understand the entire ‘scope’ of the AI concept, and in tech there appears to be a notable lack of agreement on what different people are thinking when using the term ‘AI’. For folks in IT, the ‘AI’ designation is generally used to denote enablers for automation and big data analysis, in consumer technology it tends to be equated with conversational computing, and in other circles such as deep learning statistics or some realms of engineering ‘AI’ is used effectively as a synonym for machine learning. Individually each of these interpretations is a fragmented way of viewing the concept, but in totality start to paint quite a holistic picture.

AI Deconstructed

I think a good way of deconstructing AI is by demarcating the tasks of learning and execution, and the relevant sub-elements that sit within.

Machine Learning is the heartbeat of the AI revolution. The idea of AI — bringing a degree of human-like intelligence to silicon and wires — is a fantasy as old as computers themselves. The reason why we talk about it with such fervour now is that the framework of machine learning— computers gaining knowledge through organic learning and neural nets instead of hard coded and inflexible logic — is getting us tangibly closer to realising that fantasy.

En route to this ideal we need data: that is, the inputs from which machines will actually learn from, and we need the algorithms which encompass the logic for continuous improvements that help machines make sense of these ‘sensory’ inputs. To run all of it we need the infrastructure such as the server farms (cloud services or on-premises) or local processing units, as well as the learning environments or ‘runtimes’. Learning environments define whether we are conducting learning for a car (by capturing relevant driving data for autonomous vehicles), across a web-based service (data mining across email or social accounts), on a local device (such as iPhone Siri Suggestions), or any other specific environment.

Then, once we get machines to learn, how does that actually manifest? AI execution encompasses a variety of things that AI will be useful for: these four buckets are a pretty good summation of what those things could be.

Voice computing: this will enable us to talk to our computers (and computers to proactively talk to us) in a natural vocal manner. The ability to talk to our computers like we do with humans will have tangible impacts on the structure of the technology industry moving forward. For example, what realms of technology may become integrated as a result (potentially voice platforms and ecommerce) or disaggregated (service platforms and proprietary application development).
Natural language understanding: alongside understanding voice, computers will have access to other forms of ‘natural language’ such as images and video. We’re already seeing this applied in things such as learning for autonomous vehicles, and image intelligence in products such as Google Photos, but the potential applications will be bounded only by imagination.
Automation: one of the advantages of AI is increasing efficiencies for users and organisations by automating complex tasks. Technology has existed for a long time to automate repetitive tasks, however AI will be able to perform activities which have generally required greater cognitive flexibility.
Inference: learning and analysis over large datasets will help us unlock insights and inferences that previously weren’t available. This is opening up areas such as scientific and health research.

Beneath all this is the concept of ‘general AI’ which we are still some distance away from. Reaching ‘general AI’ would require connecting all of the dots together right across machine learning and execution. We’re still on the path to getting these right individually. Before that happens, there’s still a bunch of fascinating developments to play out in how the whole industry might be structured and what different technologies will emerge gradually as AI matures.

But do we need, or want ‘general AI’? The undertone to this commentary is that I believe there is a broad over obsession in society with the concept of ‘general AI’ — a machine with intelligent qualities which matches or surpasses human intelligence. What we really want from technology is things that help make our lives better, and we’ve achieved this historically not by building replicable human simulations, but building appliances that excel at their core task (like manufacturing robots and washing machines). The merits of a ‘general AI’ in servicing this goal has many question marks, and there are well-publicised schools of thought that believe adamantly that it will do quite the opposite.

Voice Computing and the definition of ‘Over-Solving’

Voice computing really is a product of this over obsession with ‘general AI’, it being such an obvious, material way to demonstrate human likeness. Exactly two years ago I wrote a story discussing how machine intelligence coupled with voice computing might disrupt the make-up of the incumbent mobile application paradigm. The core reasoning was that a ‘Conversational User Interface’ (CUI) encompasses a wholly different set of priorities and UX best practices than the traditional GUI (Graphical User Interface). Before that, the GUI encompassed a similar set of structural changes against its incumbent the CLI (Command Line Interface) and was largely responsible for creating the huge mobile application industry and the rise of UI/UX design as a discipline in its own right.

But what kind of penetration can we expect for the CUI? And broader to that, what kind of penetration can we expect for voice computing of the type demonstrated by Google Duplex? The below diagram might shed some light.

There is much literature on semantics and anthropology which analyses the historical development of all sorts of communication methods. Without reading the academia, it’s a fairly agreed upon understanding that humans developed language because it was the best way for us to talk, and we built computers with binary logic and information models because it was the best way for them to talk.

It goes without saying that before computers existed there was no room for voice computing but as computers were developed and have matured, there are now a variety of ways for us to communicate with computers: CLI, GUI, and now CUI.

With the CLI being largely deprecated amongst mainstream consumers, the GUI and CUI have become the predominant methods for human to computer interaction, both of which have their circumstantial merits.

GUI: suitable for high involvement, high precision activities. Imagine having to ask Photoshop everytime you wanted to make a detailed adjustment. Mouse and pointer (or finger and touch) remain incredibly efficient input methods.
CUI: for low-involvement tasks which are simple to articulate. Things like asking for the weather, sending a message, or broad searches/requests are particularly suited to natural language conversation.

But what about computer to computer interaction? I would never expect my computer to ask a SQL database in plain English for a piece of data, and retrieve it in conversation. Computers do this stuff, and have been doing so for years in their own language! In this case, SQL, but other cases, maybe JSON, or any low level assembly language or machine code. Computers don’t speak English because within the bounds of binary architecture English is a horrendously inefficient manner of communicating.

The media last year exploded after Facebook’s AI allegedly built its own language. The less exciting reality was that Facebook’s AI was simply learning and implementing ways to make its conversation more efficient, and suffice to say, it didn’t come up with English.

So here goes the really curious thing about Google Duplex and why it exists. Google Duplex exists to make calls for us, where it will call and converse convincingly with an answering human. Logically, the penetration of the technology should make it able to answer calls for us too which would result in a situation whereby a computer would be talking to another computer…in English. Through an amazing technological feat, we’ve ‘solved’ a problem we never had, or more accurately, we’ve inefficiently re-solved a problem we already solved decades ago when we invented databases and relational information models. Herego the bell curve.

This is the most curious, but oddly brilliant thing about Duplex. Google Duplex was demoed as a product that can make bookings for us at restaurants and salons. Yes, that is a problem that is easily solved with a database and an integrated bookings platform. But if Duplex’s ambition is to become ‘generally’ intelligent, it should scale to become capable of many more things. Then it becomes a question of what is more efficient: one solution that can scale to rule them all, or building a more technically efficient bespoke solution for every potential use case? Purely from Google’s personal vantage, the former option is obviously superior, and leverages Google’s technical moat. John Gruber mentioned that Duplex appears to be a technology in search for a product, which could well be its defining master stroke.

But this isn’t just about Google, this is about the whole industry. It is easy to look at Google’s advantage in technical brilliance (which it is arguably unsurpassed) and think competition would be futile. Here though is the very definition of Clayton Christensen’s theory of disruption; pure brilliance is not a necessary pre-cursor to success, rather success is usually a product of something much easier: product-market fit. Brilliance tends to result in ‘over-solving’ for a particular problem, when simpler, cheaper solutions often satisfy solutions*.

Source: Harvard Business Review — https://hbr.org/2015/12/what-is-disruptive-innovation

So to compete with Google isn’t to build a better generalist conversational AI, it is to fill in all the white space and gain sufficient market penetration with fit-for-purpose solutions to ensure that a generalist conversational AI doesn’t even make any sense. Outside of being one of two input methods for humans to communicate with computers, voice computing is wildly inappropriate for almost any other task.

The fervour over Duplex and voice computing generally is certainly overrated. Voice is the most superficially impressive of AI’s four execution points, but the least substantial.

*Of course the caveat here is that Google Duplex may entail such tremendous fixed cost spread as to be more efficient than bespoke industry solutions, even if not technically so. Having said that, I doubt very much that even the most intelligent general AI would surpass the capability of an appliance built for one specific task only. To use an example of a robot built specifically to attach a door to a car on a production line, yes a generalist robot could be asked to perform this task but would not be *optimised* to achieve this one task with the same speed and accuracy. Yet you could not ask this production line robot to wash a dish. This is as much an evolutionary conversation as it is a technological one and I do believe there to be natural limitations to how *good* something can be when its purpose is diverse. After all, humans are the most *complete* species known, but we are neither the fastest, strongest, or most well adapted to climatic extremes.

Deconstructing AI and the Curious Bell Curve of Voice Computing

Understanding conversational computing, AI, and the definition of over-solving

AI Deconstructed

Voice Computing and the definition of ‘Over-Solving’

Written by Jeremy Liu