Mkoutsog
Crayon Data & AI
Published in
6 min readMay 30, 2024

--

Fairness in text-to-speech systems. Towards synthetic gender-ambiguous voices

Mimicking and representing human communication

With the advent of deep neural network technology, the quality of text-to-speech (TTS) systems has improved significantly leading to a more natural-sounding, expressive, human-like speech.

The barriers of traditional text-to-speech (TTS) systems on voice quality, pronunciation, intonation, emphasis, and, rhythm are overcome on current open-source solutions like Tacotron2, Wavenet, Ftastspeech2, where the generated speech produced has voice characteristics close to recorded voice used for training the models. It would be safe to say that successful AI solutions are those that try to mimic human behaviour. But what happens if an AI system is asked to represent this human behaviour?

Representing the diversity of human characteristics and behavior with an AI solution carries the risk of bias. Most datasets that are used to train AI models can potentially spread and reinforce harmful stereotypes. Responsible AI standards have been made to ensure the development of unbiased and inclusive datasets and algorithms that do not discriminate against any particular group.

AI should be designed and trained in a way that reflects the diversity of the world we live in and to promote fairness and equality. To gain insight into the challenges that Data Scientists encounter when developing a fair AI system, let us examine a recent scenario that we encountered at Crayon.

Responsible AI on speech synthesis solutions

Understanding common stereotypes

Imagine a high-quality AI text-to-speech (TTS) system designed for a transportation system e.g train announcements. The TTS system is built using high quality audio recordings of one speaker performed in a professional studio. This speaker represents the voice brand of the transportation company. The AI trained model is then able to produce synthetic speech from text with the proper prosodic characteristics that sound as natural as the human speech.

The question is does this system comply with the Responsible AI standards? Maybe not. Let’s see why?

To start with, the audio samples may not be appropriate to build an AI system that operates in noise. The recording has been done in a quiet environment with the speaker eliciting probably casual read speech (the speaker was instructed to read sentences aloud) while under noisy conditions, people elicit another type of speech called Lombard.

Casual speech is difficult to understand in noisy environments, especially for people with cognitive disabilities e.g., hard of hearing, dyslexia. Thus, the AI solution ignores the targeted environment and reinforces the stereotype that all people have normal hearing without any hearing disabilities. And there is more.

The voice brand may not represent the diversity of humanity. Why should a company choose between a male and a female voice for the brand? On which criteria will the company make this selection? In noisy conditions, female voices seem to have a greater intelligibility advantage compared to male voices but still, customers may complain regarding the gender selection or exclusion. Having a specific gender on the voice brands also reinforces the stereotype that humans are binary, excluding people that identify themselves as genderless.

So, what is the best solution for this specific use-case scenario? The best AI solution should:

1. be able to synthesise speech as naturally as possible with the proper voice quality.

2. account for noisy conditions. Thus, the acoustic characteristics of the synthesised speech should be robust to noise.

3. be gender-neutral, that is … well we need to define what is gender neutral!!

I will not focus on the first two requirements. Mainly because today, there are many developed solutions for production use, that perform high quality speech synthesis (my favourite platforms are Azure, altered.ai, and others). Also, if you know a little bit of signal processing you will know how to apply famous “intelligibility filters” inspired from Lombard and clear speech that change speech characteristics to make speech robust in noise.

Gender-ambiguous voices

I would like to refer to the gender-ambiguous voice. While there is not a clear definition in the literature of the gender-ambiguous voice, a common met term is “a voice that does not exclude the gender of the spoken person”. But if you ask someone if the voice that he/she heard is male or female I am pretty sure you will get a direct answer. This is because people tend to map a gender to the voice they hear depending on their cognitive perception and life experience. Therefore, this is not the correct definition of gender-ambiguous voice. Our definition of gender-ambiguous voice is:

  • “Gender-ambiguous voice can be considered the voice that cannot be classified with certainty as male or female by one listener when the listener is asked.”
  • “Gender-ambiguous voice can be considered the voice that has equal probability of being classified as male and female across many listeners.”

The key here is the certainty. If you ask someone to tell you if a voice is male or female, you will get a binary answer. But if you ask: “are you sure? how sure are you? how surprised would you be if I tell you that you are actually wrong?” then you might get “well now that you say, I am not so sure.”

How to synthesise a gender-neutral voice

To synthesise a gender-neutral voice we need to understand what makes speech sound gender-neutral. And there are many questions to be answered before we can conclude.

What differs individuals in their speech? What are the acoustic differences between male and female voices?[1 ,2]. Is there any research on gender neutrality? Which people consider themselves non-binary? Do non-binary populations elicit speech differently, compared to binary populations?[3, 4, 5, 6, 7, 8, 9, 10]. Are there any synthetically gender-ambiguous voices, and if yes, how are they generated? [11, 12]. How to synthetically create a gender-ambiguous voice?

You can dig into the literature and find interesting work that can answer the above questions. I have added some references to help you. To make it short I will share my conclusion of what makes a voice genderless.

The pitch and the timbre are the acoustic speech features that possibly contribute to the gender uncertainty in gender-neutral voices. But there is another characteristic that I believe mainly contributes to the gender-ambiguity of the voice, the speaking style. And the speaking style is a communication tool that has always been there, in the way we communicate with our children (infant speech), with our hard-of-hearing mother (shouted speech), when we talk to our friend in the restaurant (Lombard speech) or when we give instructions to a non-native tourist (clear speech). And while it is present in speech, it has been sidelined from the research work even now on gender-ambiguous speech synthesis. To successfully perform gender-ambiguous transformations we need a combination of acoustic transformations and speaking style.

In Crayon, we were able to create a gender-ambiguous voice to represent people throughout the gender spectrum (binary and non-binary). Our solution is not based only on acoustic transformations but on our knowledge of speech and the importance of speaking style in all aspects of speech communication.

Considering the differences between male and female speaking style, we have successfully created a gender-ambiguous voice agent by morphing synthetic female speech with a male voice. The morphed outcome keeps all the characteristics of the female speaking style while acoustically having the timbre of a male voice, with an ambiguous pitch close to the male-female boundary. This combination of female speaking style and male acoustic features leads to a gender-ambiguous voice.

Now you have the definition of gender-ambiguous voice, which will help you to evaluate your transformed voices, an evaluation framework that follows the responsible AI guidelines, and, the recipe for how to create a gender-ambiguous voice. All you have to do is sit back with your coffee and read our scientific article https://arxiv.org/pdf/2403.07661. At your disposal, you have a variety of open-source solutions in python, but I recommend for production solutions to use commercialised tools for high quality text-to-speech and voice morphing e.g. as Microsoft Azure, altered.ai etc. Good luck!

--

--