Making the Leap from Speech to Dialogue: The Challenge for Human to Machine Communication

Robots are everywhere and doing virtually everything. We have even begun conversing with them in situations that are beginning to resemble interpersonal communication. Right now these spoken dialogue systems (SDS) tends to be limited to a “command-based” approach, which can be seen with a number of recently introduced commercial implementations, like Apple’s Siri for the iOS, Amazon’s Echo/Alexa, and the social robot Jibo.

The command-based approach to SDS design works reasonably well, as it predetermines much of the semantic context, communicative structure, and social variables by keeping conversational interactions within manageable boundaries. Yet, the development of more robust SDS will rely not only on advancements in engineering, but will also require better understanding and modeling of the actual mechanisms and operations of human-to-human communicative behaviors.

Unfortunately, the two disciplines that deal with these subjects — engineering and interpersonal communication — have not recognized and/or exploited this interdisciplinary opportunity and challenge. Engineers, for their part, either have tried to reinvent the wheel themselves or have sought advice from research and researchers in other disciplines, like social linguistics or psychology. Communication scholars have often limited their research efforts and findings to human communication. When they have dealt with computers or bots, they have typically considered the mechanism as a medium of human communicative exchange — what is called “computer mediated communication” or CMC.

From the beginning, it is communication — in the form of conversational interpersonal dialogue — that provides AI with its definitive characterization. This is immediately evident in Alan Turing’s “Computing Machinery and Intelligence,” which was first published in the journal ‘Mind’ in 1950. “The idea of the test,” Turing explained in a BBC interview from 1952, “is that the machine has to try and pretend to be a man, by answering questions put to it, and it will only pass if the pretense is reasonably convincing. A considerable proportion of a jury aren’t allowed to see the machine itself. So, the machine is kept in a faraway room and the jury are allowed to ask it questions, which are transmitted through to it”.

According to Turing, if a computer is capable of successfully simulating a human-being in communicative exchanges (albeit exchanges that are constrained to the rather artificial situation of typewritten questions and answers) to such an extent that the jury cannot tell whether they are talking with a machine or another human being, then that device would need to be considered intelligent.

Derived from this original proposal of Turing, all chatterbots, irrespective of design, inherit two important practical limitations:

  1. The mode of interaction is restricted to a very narrow range of interpersonal behaviors. Chatterbots have been designed as question answering systems. That is, their social involvement is intentionally limited to situations where human interrogators asks questions and the machine is designed to provide responses.
  2. These Q&A interactions are restricted to typewritten text. For Turing, and the chatterbots that follow his lead, the use of textual interaction is a necessary and deliberate element of the imitation game’s design. The main reason for limiting the interrogation to text form is to level the playing field: “In order that tones of voice may not help the interrogator the answers should be written, or better still, typewritten.

Recent developments in SDS implementations, especially commercially available products like Siri, Echo/Alexa, and others, are not one technology but consist of an ensemble of several different but related technological innovations:

  • “automatic speech recognition (ASR), to identify what a human says;
  • dialogue management (DM), to determine what that human wants; actions to obtain the information or perform the activity requested; and
  • text-to-speech (TTS) synthesis, to convey that information back to the human in spoken form.”

Despite their apparent complexity and technical advancement beyond text-based chatterbots like ELIZA, SDSs are still designed for and operate mainly with text data.

A good deal of conversational interaction is negotiated through nonverbal elements, which can include, visual cues, or “body language”. Right now commercially available SDS applications, like Siri and Echo/Alexa, are only attending to what is said. How it is said and in what particular fashion it is articulated is not necessarily part of the current implementations.

Such an effort requires the development of an interface between the fields of engineering and communication studies. Doing so will involve making theory computable so that the insights that have been generated by decades of communication research are not just human readable but are also rendered machine executable. At the same time, and on the other hand, engineers will need to learn to recognize and to appreciate how this so-called “soft science” can speak to and contribute the data necessary to address many of the open problems in SDS development.

Human-like conversation generally is considered to be a natural, intuitive, robust and efficient means for interaction. The ability to handle phenomena commonly used in human conversations could ultimately make systems more natural and easy to use by humans, but they also have the potential to make things more complex and confusing.

Modeling human-to-machine (h2m) communication on human-to-human (h2h) communication might be the wrong place to begin, just as modeling “machine intelligence” on human cognition turned out to be a significant impediment to progress in artificial intelligence (AI). Identifying this assumption, however, does not mitigate against the argument for including interdisciplinary collaboration in SDS development.

Originally published at on August 2, 2018.