Is public media ready for machine transcription?

Pop Up Archive
Feb 26, 2015 · 6 min read

(A Socratic dialogue)

A conversation between Peter Karman of Pop Up Archive & Andy Kruse of American Public Media on the trade-offs between access and accuracy when it comes to machine transcripts for media.

Preface

For the last year I have immersed myself in the technology of automatic speech recognition (ASR) for transcribing public radio broadcasts. I started with a grant from the Knight Foundation while at American Public Media|Minnesota Public Radio and have continued on at Pop Up Archive.


The conversation

Andy Kruse (American Public Media):

I feel like the [machine-generated] transcripts I skimmed today from Marketplace Tech and The Splendid Table in Audiosear.ch were better than the ones I had seen in the past. More accurate, less embarrassing, and almost like you could read it without the audio and get the gist of what was being said.

Image for post
Image for post
A machine-made transcript for The Splendid Table on Audiosear.ch

Peter Karman (Pop Up Archive):

We're not cherry-picking. The technology continues to improve, and your expectations might be changing, too. That’s the thing about machine-generated anything; it is difficult to measure qualitative improvement over time from purely anecdotal evidence. Transcripts can “feel” better or more accurate for lots of little reasons, even if the word-error-rate (WER) (lingo for measure-able accuracy) is the same. A WER of 10% is pretty good (90% accurate) but it all depends on *which* words are accurate, right? If all the proper nouns are wrong, that'll feel different than if “is” and “his” get confused.

The technology continues to improve, and your expectations might be changing, too.

Andy:

I wish I had a sense of the speed at which this technology gets better. Are we, as people of Earth, just a couple years away from this being usable, or is it more like a decade? What makes it better: Is it incremental human tweaks to existing systems to fine-tune and weed out the junk? Or is it more like some other guy has to develop something from scratch that is fundamentally better and replaces everything we’ve done up until now?

Peter:

You've just summarized in a beautiful way the ambivalence that so many public media people have expressed about what Pop Up Archive is doing. You are very much not alone in your inner conflict.

Here’s my take:

  • No transcript is 100% accurate. Not even the human-generated ones. That’s because accuracy is not the same as perfection. The stutters and false starts and ums and ahs get edited away in a human-generated transcript. Machines don’t do that editing, but neither do they have the ability (yet) to always distinguish ambiguous sounds and glean accurate wording based on context or accents, etc. So there are trade-offs, and a combination is usually the best bet. Lots of our customers edit/clean up the machine-generated transcripts before they publish them.
  • I don’t believe humans want to read entire radio transcripts. They don’t want to follow along while they listen. I’ll bet your web analytics would bear that out. Do pages with transcriptions, alone, get any traffic? I seem to recall Marketplace dropping their human-generated transcription efforts because the results saw so little traffic.
  • Listening to audio is primarily an activity for the visually occupied. You listen while you do something else: drive, wash dishes, cook, type, surf the web. Maybe people want to read the transcript while they listen, but I don’t think so.
  • So know your audience: transcripts are really good for *machines* to read. Google loves transcripts. Transcripts help search engines interpret what the audio contains. That helps people find audio via search engines, so that they can click play. Maybe they’ll read along, but probably not. Transcripts are a discovery tool, and maybe a complement to listening.

Andy:

I've noticed as I read along with this audio that your most impressive transcripts happen when the host is reading from a prepared script. Presumably they speak more clearly, they enunciate, and the ends of sentences are easier to spot (they drop their voices, they pause, whatever it is that robots can listen for). The transcript trouble is generally found in natural conversation and extemporaneous speaking, when you get those stutters and meandering sentences that don't finish and people are cutting each other off. A lot of our material is unrehearsed, so some level of inscrutability is justified.

To publish a transcript [for an audio story] — even a perfect one — in print, would be to assume that what works in one medium works exactly as well in another.

It doesn't, so we don’t. I think the value of transcription is as a starting point for an editor. It’s a quicker way to skim a conversation for the parts that are worth writing about. In this way, we're essentially blogging about ourselves, pulling quotes and adding context.

Image for post
Image for post
Transcript matches when searching “finance” in Audiosear.ch

Peter:

The red pill? Or the blue one?


Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store