Free speech

Vilde Reichelt
Bakken & Bæck
Published in
7 min readOct 1, 2021

--

As human-machine interactions are increasingly present in our work at BB, so are discussions about Voice User Interfaces (VUIs) and the way we design them. Speech activated services are constantly getting more diverse and accessible to the many. Still, it seems impossible to completely mitigate bias for individual differences such as conversational style. As machines depend on the data we feed them, how do we tackle the actual nature of spoken language and train algorithms for truly inclusive voice experiences.

Illustrations: Simon Bailly

You don’t say? — Words are never just words

As a species, we’ve been conversing for half a million years. In the course of that time, we’ve learned not only to speak and to listen, but also extract the actual meaning behind what’s been said. We evolved to grasp the tiniest emotional, cultural and contextual hints that are codified into our pronunciation, intonation and articulation. The acoustic elements of speech sounds from our unique voices are just peanuts compared to the variation in our manners of speaking: our choice of words and jargon — using elements such as jokes, figures of speech, telling stories, asking questions, apologising; everything that reflects our personalities.

Human intelligence includes the unique skill of cooperation. That’s what makes it possible for us to have conversations, blabber and share knowledge, despite our physical and psychological differences. We won’t be stopped by a little cold or a speech impediment or other physical factors that make people sound different than us. If people speak with just the right amount of stuttering, stammering, mispronunciations, hesitating, slurring and self-observation — that’s really saying something.

Putting aside the fact that we rely heavily on visual input and body language when we talk, we express individual differences through our accents, dialects, sociolect and idiolects. That includes any “mistakes”, like stumbling, mumbling, recapitulating and interruptions, too. Our speech variants carry personal information about who we are, how and where we grew up, our moral qualities and who we want to be. Our individual conversational styles are formed throughout life. We’re constantly listening for cues and confirmation with the other speaker — ensuring that we’re on the same page, so to speak.

Our advanced vocal tracts and the ability to distinguish speech sounds are uniquely human.

Now we’re talking — Mutter to me

It might not come as a surprise that machines struggle to understand us. They need a lot of training to learn what people understand inherently, with our combination of empathy, classification and memory which makes us able to read between the lines and helps us identify with different social groups. On a recent BB project, we asked ourselves if these features can be recognised and classified by a programmed voice user interface (VUI) — what does it need to understand human and what should it produce back?

When creating VUIs, we train language models on algorithms for automatic speech recognition (ASR), which need large spoken language corpora of transcribed training and test data. A working language model needs about 200 hours worth of audio collections, from utterances read out loud. If you want to make a medical speech recognition system, you need to have recordings between doctors and patients. When the speech segments are transcribed, it has to include everything — warts and all. In order to be inclusive and to recognise all types of speaking voices, the model should be balanced on gender, age, accent and speech impediments and so forth.

The reason why every bit of speech, even coughs and pause filling ums and ehs, have to be described and documented is that they convey information about the situation and the context; The shared information between the speakers and various minor and implicit details about human behaviour like mood. When that’s said, most systems are only set up to recognise words that fit into the defined phoneme sets, i.e. the listed speech sounds in a pronunciation lexicon. The gigantic dataset can vary between 100 000 words and one million words, but they rarely include all variants. Because of this, a lot of meaningful, non-standard forms and dialectal variants are filtered out and lost, in the process.

“Could you repeat that?” — The invisible bias

Despite amazing advances in natural language understanding (NLU), the underlying technology of voice experiences rely on text strings that need to be trained on spontaneous speech, not well-pronounced and grammatical correct sentences. The moment we encode a spoken utterance into text, we place it in the written standard of that language, making it linguistically biased from the get-go. Language purity is exclusionary because standardisation means simplifying all the variants to fit with what the system can recognise, not appreciating the diversity in speech expressions.

Language acquisition in robots is different from the philosophical investigations that people make.

Unlike other forms of bias, setting a hierarchy of spoken language standards, is not necessarily considered a bad practice. In many countries, the official tongue is the language variant spoken in the capital. It’s normal that for example presenters in public broadcasting switch from their regional dialects on TV and in radio, in order to accommodate a certain arbitrary standard which is the assumed “correct” form. People often meet discrimination and harassment in the job market — experiencing that their accent or way of speaking is stigmatised. In order to be acknowledged and respected, or even just understood, they change their natural form of expression.

A similar standardisation is maintained in conversation design for VUIs, and often skewed even further as tech companies extract training data from internal teams. Even though we at BB try to be as diverse as possible, most of us share similar characteristics. We have to be aware that members of the same workplace and social groups often have experiences that are distinct from those of the communities we are designing for. Those who make voice interfaces are often highly educated, with terminology and internal jargon for their exact discipline, and when you’re used to hearing the same language varieties as your own, it’s easy not to notice the great linguistic diversity in how other people really sound.

The frequencies and range of our individual voices is difficult to capture in complete language models.

Goes without saying — Adapting to the machine

Since our conversations are complex and based on factors like intimacy with the other speaker; our cultural references, confidentiality and shared context for the conversation, it’s natural to be sceptical about what a speech robot “knows”. Timing plays a big part in this as well. In sound, as opposed to reading text, we don’t usually have any patience after shouting hey Siri or OK Google. There’s no leeway for repetition and we don’t like to have our questions reported back to us.

Unproblematic speech control is important, especially for those of us with disabilities or limited motion. That’s why VUIs should enable people to use the same conversational style and natural way of speaking as they would with a human being. As users, we’re constantly aware that they are machines and we easily spot robotic patterns in the interaction flow. At the same time, we expect a certain level of quality — almost perfection, and that they adapt to us. In short, we lack the same empathic values when we speak to robots. This creates a loop that reinforces itself: The system is trained on unnatural speech which leads us to not speak to it in a natural way.

To ensure people’s sense of identity, so that they can perform speech acts that sounds like their normal manner of talking, we need robots that can adjust to the situation and the person it “speaks” with. If people are not comfortable talking to voice interfaces at all — let alone in their own dialect — the training data for machine learning won’t have the right vocabulary coverage it needs to include all types of people. Instead of relying on our own team’s intra-language and standardised forms only, we should cover a wide range of speakers and allow for cultural, ethnic, regional and individual differences — with all the flaws and imperfections. We have to go out there and listen to how real people talk.

Relevant books and articles
-
Sinduja Rangarajan (2021) Hey Siri — Why Don’t You Understand More People Like Me?
- Chiara Martino (2020) What is Conversational AI? An introduction to conversational interfaces
- Michael Huang (2020) Introduction to Conversation Design
- Sarah Elizabeth Verdon (2020) The impact of linguistic bias upon speech-language pathologists’ attitudes towards non-standard dialects of English
- Mclean & Osei-Frimpong (2019) Hey Alexa… Examine the variables influencing the use of Artificial Intelligent In-home Voice Assistants
- Nick Babich (2019) Designing For The Future With Voice Prototypes
- Linda McNair (2019) How interaction design can help people better communicate with each other

--

--

Vilde Reichelt
Bakken & Bæck

Linguist and UX writer – it’s all semantics to me.