A.I.’s Word of Honor

There’s A Real Lack of Diversity in Automatic Speech-to-Text, And Here’s Why

Your Team at Rev
3 min readJun 30, 2020

--

AUSTIN, Texas — A transcript is an exercise in trust.

At first, it may seem like just a transactional relationship. Somebody says words and somebody (or something) else writes those words down.

But when a reader sits down with that text, there is a belief that whatever was said was captured as close to the letter as possible.

Think of it that way and that’s some serious trust.

Trust that the writer was doing their best, whether that’s as a human or as A.I.

We’re a speech-to-text company. We offer both human transcription and A.I. speech-to-text services. So, the belief in our integrity is absolutely integral to our future of us as a business. Trust is a currency in that way to us.

And while we love the trust we get from our customers (and think it’s well deserved if we don’t mind saying) we don’t think it’s ideal to simply accept it and not do more. We need publicly note there is still work ahead of us.

The truth is that speech-to-text, industry-wide slants white and male.

As A.I. makes recommendations and decisions in our lives, slanted A.I. becomes a very important issue.

That makes a lack of diversity into an ethical problem, especially with inclusion and equality. And, for the fact of the matter, we see it in facial recognition A.I. too.

An A.I. model is only as diverse as the data that goes into training it.

If the training data is not representative of a particular race or gender, the model will not do as well recognizing their patterns. That’s the way it is.

Let’s zero back in on speech. Because some traditional sources of training models have been audiobooks, public talks, phone calls, YouTube videos, and so on — and that input skews white and male — if you’re not careful in your data selection, the resulting speech models will perform far better for that demographic. And for most of the industry, that’s what’s happened.

Now, thankfully at Rev, we’re fortunate that our data is extensive and from a wide variety of audio sources, so it suffers less from this data selection problem. In fact, for example, our speech recognition accuracy for women is actually better than for men.

Still, typically, systems are tuned on a representative subset of data from usual users. So, if 80% of your users are male and 20% are female, then the test set often represents that split. Now, when you use that data set to “tune your system,” you might, without noticing, bias your system towards “males” because it “moves the needle” on your test set.

The thing about the test suite is that if you have a system that deals with a specific use-case, your test suite will represent that population. It will most likely not contain anything for English speakers from, say… Siberia because you might not be selling any device or getting users from that part ofthe world.

There are two key components to training A.I. in the right way: One, a data selection process that produces diversity of data for both the training and test data sets, and two, having reliable metadata so that we can both do this data selection and to segment accuracy measurements along these dimensions.

At Rev, this is the kind of thing we’re always thinking about. That’s partly because we know that trust is so vital in our industry. It’s also just the right thing to do.

For more information about us, head over to Rev.com/blog and read all the cool stuff we’re up to.

--

--