“Laughing through tears” at the railway station
Chekhov as the Voice Interface
Ding-dong! The train. 503! To. Stockholm! Has a new departure time. 15:37! Ding-dong! The train. 442! To. Gothenburg! Is cancelled. Ding-dong! The train. 334! To. Malmo! Is. Just coming in on. Platform 3! Please note the platform change! Ding-dong!
Disruptions to train services are particularly grim experiences — with “extensive signal failures”, hours of delays and countless cancelled trains — but the synthetic voice carries on regardless, with scarcely a second between announcements.
It becomes very hard to interpret, perhaps even register, what the voice has actually been saying. “That was something about my train — but what was it this time?!?”
The advent of the web — over twenty years ago — introduced an important new feature. Instead of deciding that a word should be bold or italic — as in print media — you provided a semantic description of the word’s importance, tagging it with <strong> or with <emphasis>.
In practice, of course, this appeared just as bold or italic in most web browsers. But the fundamental difference was that, for example, a voice interface system could interpret the tags — and pronounce marked words with more, or much more, emphasis than ordinary words — while “reading” a script.
That was probably good enough at the time, when what was to be read was typically some document or article, with just a few bold or italic words.
But new technology is capable of far more, and is used for much more complex situations.
One of them being train disruptions.
And there are so many ways in which service can be affected! Some trains may have vanished completely. Others seem to be in dire straits, but then — miraculously! — they enter the station! Others just get more and more delayed: first ten minutes, then ten more minutes, then another fifteen, and so on …
These cases carry quite different contexts and emotional charge. And the emotional charge is an important part of the information. In a noisy environment with a lot of echoes, the synthetic words (that “Univoice”) can be difficult to hear and interpret. Yet with a natural, human speaker, tone of voice helps us interpret and understand what the message actually is.
When all announcements sound the same, we have to rely solely on our intellectual understanding and analysis of the words. And that’s much harder.
What we really need is a completely new mark-up for voice interfaces.
As delays become longer and longer, platform announcements should be tagged <deep compassion> or perhaps <shamefaced apology>. When it looks as if problems are being solved and delays are diminishing, the voice should use <certain optimism>. When a train unexpectedly enters the station, despite previously announced delays, we need the <triumphant after major setbacks> tag. In other situations there may be no option but <restrained despair>. And so on. (And not only for train services, of course.)
We may not be able to manually tag all possible phrases for a system. Perhaps AI will eventually try to interpret the context and adjust tone, sentence melody, speech speed and dynamics (the elements of what we call ‘prosody’).
But, until then: the old <strong> and <emphasis> tags are already inadequate, remnants of the limited possibilities with written representation of speech. What the voice interfaces need is something more like playwrights’ instructions for actors reading their lines:
IRINA [coldly] Stop now.
KULYGIN [hurt] Dear Masha, why?
MASHA I’ll tell you later [annoyed].
[Laughing through tears]: The train to Moscow will finally depart from Platform 3 …
Jonas Söderström is a UX strategist, writer and keynote speaker from Sweden. He studied Russian during the cold war, and considers “The Cherry Orchard” to be Anton Chekhov’s best play; and possibly the best play ever written.
(This text was originally published in Swedish, as “De okänsliga utropen, eller: Tjechov som maskininstruktör”. in 2016.)