Breaking Language and Accent Barriers in Voice Technology

4 min readApr 22, 2021

How Can Voice Technology Innovation Help Address Bias in AI?

In 1998, the New Radicals sang the lyric “You only get what you give” and while they most probably were not referring to issues of language and accent recognition in voice technology, they hit the cause right on the nose. When building a voice recognition solution, you only get a system as good and well-performing as the data you train it on. From accent rejection to potential racial bias, training data can not only have huge impacts on how the AI behaves, it can also alienate entire groups of people.

In a 2018 research study in collaboration with the Washington Post, findings from 20 cities across the US alone showed big-name smart speakers had a harder time understanding certain accents. For example, the study found that Google Home is 3% less likely to give an accurate response to people with Southern accents compared to a Western accent. With Alexa, people with Midwestern accents were 2% less likely to be understood than people from the East Coast. Several years after this research study, the issue still persists. Prompted by a survey out of the the Life Science Centre in Newcastle which found that 79% of respondents report having to suppress their regional accents in order to use voice assistants, the BBC launched their own voice assistant in 2020 specifically geared towards UK regional accents.

The accent-biases extend well beyond US borders. When applied to foreign accents and languages, conventional voice-recognition technology is inconsistent at its most basic task: understanding the speaker. The same Washington Post study found that Chinese and Spanish-accented English were particularly challenging for both Google Home and Amazon Echo. Despite new offerings for accented-English on these devices, many users around the world are left wondering: Why can’t my technology understand me?

These issues point to one of the biggest and most enduring biases in voice technology. Beyond geographic and regional considerations, the problem of accent non-recognition extends to race. A recent study tested the ability of five automatic speech recognition systems from Amazon, Apple, Google, IBM, and Microsoft to accurately understand structured interviews conducted with 42 white speakers and 73 Black speakers. All five of these systems were shown to have nearly twice the error rate with Black speakers than with white speakers on average.

The issue with systems used by Google and Amazon lies in the cloud-based transcription process and on its reliance on massive amounts of consumer voice data.The two step-approach of the big players in online, cloud-based voice recognition works by first converting the spoken command to text in the cloud, then applying Natural Language Processing before completing the action. While this approach has its advantages of unlimited vocabulary and internet access, the challenges relating to the quality and quantity of data required to train these solutions can lead to issues of entire groups being marginalised and misunderstood by their devices if the data that these solutions are trained on do not reflect their linguistic realities.

Conventional Speech to Text approach to voice AI: it is a more cumbersome solution with higher latency and bigger storage minimum requirements.

Fluent.ai’s patented Speech to Intent solution is fully embedded on device, meaning greater privacy, lower latency, and smaller storage footprints for end-users.

“This is not a natural way of learning language and speech,” says Fluent.ai founder and CTO Vikrant Singh Tomar, explaining that children, for example, do not learn to write before they learn to speak. By removing the dependency on cloud-based speech transcription, models can be more easily trained to support accents and languages in smaller packages than ever before. Offline solutions for voice interfaces mean specific vocabulary best suited for low-powered consumer devices that do not need to connect to the internet. Not only does this protect user voice data from potential security risks in the cloud, it also reduces latency for responses and makes the solution lighter in terms of storage.

Communicating with your devices should feel as seamless as communicating with another human being. Using Fluent.ai’s patented Speech to Intent solutions, devices are able to recognize and understand the acoustic of what is being asked, as opposed to relying on a transcription. This approach allows for smoother and safer interfacing between you and your devices.

Another way to combat issues of bias against natural speech such as differences in language and accents is to ensure you have “good” and “clean” data to train solutions. Ideally, the data used to train a voice solution for example looks like the data the solution could encounter in real-world scenarios. This means training solutions for devices with data from multiple sources and accurately represents the entire demographic where that device will be used by consumers. Beyond that, selecting and “cleaning” data for training helps avoid teaching AI inappropriate and potentially offensive behaviours like misogyny or racism.

As technology companies become increasingly aware of issues that can inadvertently be built into their AI-enabled devices, more techniques to reduce them will develop. The ultimate goal of voice-enabled interfaces is to allow users to have a natural conversation with their devices with privacy and efficiency in mind. At Fluent, our patented approach enables offline devices to interact naturally with end users of any accent or language background, allowing everyone to be understood by their technology. With faster, more accurate speech understanding that supports any language and accent, Fluent.ai’s goal is to finally break the barriers to the global adoption of voice user interfaces.

Breaking Language and Accent Barriers in Voice Technology

Written by Fluent.ai Inc.