Closed (Minded) Captioning

Karen Pan
Inclusify by Design
6 min readMar 14, 2021
Closeup of a microphone with a computer screen in the background
Photo by Magda Ehlers from Pexels

Historically, speech recognition technology used to only register male voices, leaving out female voices (and consequently higher pitched male voices). The fault with machine learning algorithms in speech recognition tech isn’t a mystery to anybody. I come across it every day when I rewatch my online lectures for school — the video plays and the autogenerated audio transcript as well as the captions are so inaccurate they become useless. But it doesn’t stop there. The problem also comes up when I watch YouTube videos.

When I’m multitasking (e.g. watching a video while making food) I’ll turn on the captions because there’s a lot of background noise. For the most part they’re accurate, but if the people in the video have an accent, it’s probable that the accuracy decreases. Captions are of no use to me if they aren’t correct. I can only imagine how frustrating it must be for people who rely solely on closed captioning. So the question is: how can we make videos more inclusive when the solution to that problem — artificial intelligence (AI) — is exclusive? I will be using YouTube’s speech recognition technology and closed captioning policies to help answer this question.

A Crashed Crash Course on Machine Learning

Simply put, it’s exactly what it sounds like: learning for machines. Technically speaking, machine learning is a subset of AI based on the idea that computers can learn from data, recognize patterns, and make decisions with minimal human intervention. Two of the most popular learning methods are supervised learning and unsupervised learning.

  • Supervised Learning: an algorithm that uses labeled data and examples to learn specific outputs under certain circumstances. In other words, the system is being trained to detect the correct response/output for given situations.

Example: Input data could be labeled as “forged” and “authentic,” and the computer would learn the difference between the two.

  • Unsupervised Learning: an algorithm that isn’t provided the correct outputs. Under this method, the system has to figure out the “correct answer” on its own by finding patterns in the data.

Example: Targeted ads are freakishly accurate because patterns in user interaction with websites are identified, analyzed, and recycled accordingly.

To put it in layman’s terms, I would analogize supervised versus unsupervised learning to swimming with floaties vs being thrown in the deep end.

The Best of the Worst

As I mentioned earlier, sometimes I watch YouTube videos with the captions on. The following screenshots are two examples that are the epitome of the problem I called out at the beginning of this post.

This video is a snippet of a bit from Trevor Noah’s standup at the Apollo. Trevor is a South African comedian whose voice does carry an accent, but not a heavy one. He’s talking about human behavior — more specifically Americans’ behavior — toward traffic lights. In this part, he changes his voice to make fun of Americans, and the automated captions transcribe this:

Captions from a YouTube video with Trevor Noah in the background
Photo from You Obey Traffic Lights?! Trevor Noah | Live at the Apollo | BBC Comedy Greats from YouTube

It is supposed to say: “‘What are you doing?’ ‘I’m walking.’ ‘Oh the man is red, can’t go, the man is red.’” The ‘man’ he’s referring to is the one people see at the crosswalk.

The second video is a bit from Jack Whitehall’s standup at the Apollo. Jack is a British comedian whose voice does carry a heavy accent. He’s talking about the difference between groups of girls and groups of guys at the airport when they go on vacation. This is what the closed captioning picks up:

Captions from a YouTube video with Jack Whitehall in the background
Photo from Jack Whitehall on Boys’ and Girls’ Holidays | Live at the Apollo | BBC Comedy Greats from YouTube

This is supposed to say: “‘You see the girls first with their little wheelies, with their little, nice neatly ironed pink hoodies…’” By ‘wheelies’ he means their carry-on luggage.

Like I said, these are quintessential examples of machine learning’s inability to learn a voice that doesn’t belong to a white American male, where his accent isn’t even considered an ‘accent.’ It’s like me when I cram for a test and spew out nonsense the day of in hopes that it’ll pass as “good enough.”

How YouTube Handles Closed Captioning

YouTube offers automatic captioning, as shown above. Clearly, there’s work that needs to be done. YouTube even acknowledges that in a note on the YouTube Help page, part of which states: “YouTube is constantly improving its speech recognition technology. However, automatic captions might misrepresent the spoken content due to mispronunciations, accents, dialects, or background noise. You should always review automatic captions and edit any parts that haven’t been properly transcribed.” Yes, this is true, mispronunciations, accents, dialects, and background noise can cause speech recognition tech to output wrong text. However, users should not have to go back and proofread the automatic captions. Although consistently returning to videos and fixing captions is part of that learning process, integrating the nuances of speech should be done before the technology is implemented. The whole idea behind machine learning is to minimize human intervention to the point where they won’t need to intervene.

There is also an option for content creators to add their own captions. This allows captions to be written in the language of the user’s choosing. One drawback is that this option is time consuming. Whether the captions are uploaded as a file, transcribed live as the video plays, or typed manually, this puts the brunt of the work on the user. It would be so much easier to let the computer do the work.

The last option is to turn on “Community Contributions.” Community Contributions is basically a group effort on providing information about and closed captions for videos. Viewers can add titles, descriptions, subtitles and closed captions to the videos. They can also submit language translations to existing captions. This option by far is the most inclusive option. Unfortunately, since September 28, 2020, this option is discontinued due to rare usage and problems with spam/abuse.

So…What Now?

My two cents is that YouTube should bring back Community Captions. It was their most inclusive closed captioning option, allowing anyone to contribute to the accessibility of videos. The most inclusive aspect was the fact that international users could translate videos to any language of their choosing. This expands the video’s reach to a wider audience. Its issue regarding spam and abuse apparently led to its rare usage. Ultimately, the efficacy of this tool is determined by the video owners because they have to approve the community’s input. It’s arguably easier to reject spam and abuse from being published than fixing the errors. However, that’s no reason to give up on the tool. Machine learning can be used to filter out inappropriate and offensive captions, and thus prevent them from being submitted. Then again, this would solve a symptom, not the core problem.

The core issue is that machine learning in speech recognition lacks diversity and inclusivity. It hasn’t yet learned how to pick up different accents and dialects. Machine learning aims to teach computers to think and make decisions like humans. Yet computers lack the nuances that make humans human — and that’s a difficult thing to teach. Involving a diverse group of people in the process of teaching speech recognition technology will take accessibility to new heights. If the people behind the tech value inclusivity, then the product will as well. Feeding a system different accents, dialects, and slang will eventually teach it to recognize them. A combination of supervised and unsupervised learning could help accomplish this. It’s like giving the system floaties, then throwing it into the deep end without them.

Last Thoughts

Ironic isn’t it? Technology is simultaneously both the cause of and solution to the problem. Technology is supposed to help people connect with each other. It’s supposed to bring us together and help us accomplish the unimaginable. In this case, ‘the unimaginable’ means creating empathetic AI that can think and act on its own. That can’t happen if it’s developed on a foundation of exclusivity, no matter how inclusive its intentions are. Machine learning is an innovative form of AI that has a lot of potential. If we want to do this right, we need to embed inclusivity in its code. If we can do that, then the possibilities are endless.

Further Reading

If you want to learn more about the machine learning mechanics behind speech recognition, read this blog post. It’s thorough, understandable, and nerdy in the best way possible!

The removal of Community Captions that I mentioned earlier caused an uproar within the YouTube community. So much so that the hashtag #Don’tRemoveYoutubeCCs manifested into existence. Read more about people’s concerns in this BBC News article.

Sources

https://support.google.com/youtube/answer/6373554?hl=en

https://support.google.com/youtube/answer/6052538?hl=en-GB

https://support.google.com/youtube/answer/2734796

https://www.bbc.com/news/newsbeat-54074573

https://youtu.be/mU35XlTkLnA

https://youtu.be/OzSlYUBcph4

https://www.sas.com/en_us/insights/analytics/machine-learning.html#:~:text=Machine%20learning%20is%20a%20method,decisions%20with%20minimal%20human%20intervention.

--

--

Karen Pan
Inclusify by Design

Industrial Engineering Student at the University of Washington