This post was written by Pop Up Archive’s Principal Engineer, Peter Karman (@peterkarman).
For the last year I have immersed myself in the technology of automatic speech recognition (ASR) for transcribing public radio broadcasts. I started with a grant from the Knight Foundation while at American Public Media|Minnesota Public Radio and have continued on at Pop Up Archive.
During this year I have talked with many people in the public media industry about ASR, about its maturity (or lack thereof) and about the role it should play for public media producers and audiences. Among those people is Andy Kruse, digital producer for APM’s arts and cultural programming, which includes The Splendid Table, The Dinner Party Download and the Infinite Guest podcast network.
Andy and I recently traded email about the state of the art in ASR, what it means for public media, and what we’re doing with a new project, Audiosear.ch. What follows is an edited version of our email conversation.
Andy Kruse (American Public Media):
I feel like the [machine-generated] transcripts I skimmed today from Marketplace Tech and The Splendid Table in Audiosear.ch were better than the ones I had seen in the past. More accurate, less embarrassing, and almost like you could read it without the audio and get the gist of what was being said.
I don't know if the ones today seemed better because the technology has improved in the past six months, or these were cherry-picked to be featured, or if nothing is different and I was just in a better mood.
Peter Karman (Pop Up Archive):
We're not cherry-picking. The technology continues to improve, and your expectations might be changing, too. That’s the thing about machine-generated anything; it is difficult to measure qualitative improvement over time from purely anecdotal evidence. Transcripts can “feel” better or more accurate for lots of little reasons, even if the word-error-rate (WER) (lingo for measure-able accuracy) is the same. A WER of 10% is pretty good (90% accurate) but it all depends on *which* words are accurate, right? If all the proper nouns are wrong, that'll feel different than if “is” and “his” get confused.
The technology continues to improve, and your expectations might be changing, too.
I wish I had a sense of the speed at which this technology gets better. Are we, as people of Earth, just a couple years away from this being usable, or is it more like a decade? What makes it better: Is it incremental human tweaks to existing systems to fine-tune and weed out the junk? Or is it more like some other guy has to develop something from scratch that is fundamentally better and replaces everything we’ve done up until now?
Or am I focusing on the wrong thing? Is there enough benefit in the tags and just extracting major concepts that you think we're idiots for not already jumping on this? I’m really interested in the final product being something that can be consumed in a different medium (text only instead of still needing the audio), but maybe I want too much.
There is much conflict in my brain about how to feel.
You've just summarized in a beautiful way the ambivalence that so many public media people have expressed about what Pop Up Archive is doing. You are very much not alone in your inner conflict.
Here’s my take:
- No transcript is 100% accurate. Not even the human-generated ones. That’s because accuracy is not the same as perfection. The stutters and false starts and ums and ahs get edited away in a human-generated transcript. Machines don’t do that editing, but neither do they have the ability (yet) to always distinguish ambiguous sounds and glean accurate wording based on context or accents, etc. So there are trade-offs, and a combination is usually the best bet. Lots of our customers edit/clean up the machine-generated transcripts before they publish them.
- I don’t believe humans want to read entire radio transcripts. They don’t want to follow along while they listen. I’ll bet your web analytics would bear that out. Do pages with transcriptions, alone, get any traffic? I seem to recall Marketplace dropping their human-generated transcription efforts because the results saw so little traffic.
- Listening to audio is primarily an activity for the visually occupied. You listen while you do something else: drive, wash dishes, cook, type, surf the web. Maybe people want to read the transcript while they listen, but I don’t think so.
- So know your audience: transcripts are really good for *machines* to read. Google loves transcripts. Transcripts help search engines interpret what the audio contains. That helps people find audio via search engines, so that they can click play. Maybe they’ll read along, but probably not. Transcripts are a discovery tool, and maybe a complement to listening.
So are you focusing on the wrong thing? I tend to think so. I don't know for sure because I don't yet have the data/evidence to support the assertions I’m making. One of the reasons we’re doing Audiosear.ch is to start gathering that data.
I've noticed as I read along with this audio that your most impressive transcripts happen when the host is reading from a prepared script. Presumably they speak more clearly, they enunciate, and the ends of sentences are easier to spot (they drop their voices, they pause, whatever it is that robots can listen for). The transcript trouble is generally found in natural conversation and extemporaneous speaking, when you get those stutters and meandering sentences that don't finish and people are cutting each other off. A lot of our material is unrehearsed, so some level of inscrutability is justified.
To publish a transcript [for an audio story] — even a perfect one — in print, would be to assume that what works in one medium works exactly as well in another.
It doesn't, so we don’t. I think the value of transcription is as a starting point for an editor. It’s a quicker way to skim a conversation for the parts that are worth writing about. In this way, we're essentially blogging about ourselves, pulling quotes and adding context.
But we don't have the human resources to do this for all of our hours of audio. So what’s the best thing we can leave for the world when we publish a new podcast? Is it just that audio player on a page all by itself, or is it that audio player with an imperfect transcript? Isn’t the latter a little more generous? If we can get over the occasional embarrassment, maybe some random person would make something of our raw material. That’s where I see the potential.
So I am a little bit interested in publishing the full-text robotic transcript of a podcast — a little bit for the sake of search, and a little more bit for the sake of making the entirety of the world’s information accessible.
If I thought we could start working with someone now and the accuracy would be five 9s by the time we turned it on, that would be assuring. But people have thought this technology is close for years now. The future always seems annoyingly just out of reach.
Just kidding. There is no audio in the future. We'll just swallow pills.
The red pill? Or the blue one?
As an audience member, I would find an imperfect transcript useful if it helped me to more quickly find a particular part of a story to listen to, or to help me locate the story in the first place. So I like framing the debate as useful-vs-not rather than perfect-vs-not.
So for me it comes down to: do we trust the audience enough to recognize the value in imperfection? If there is a risk in publishing something machine-generated, will the audience reward that risk or punish it? I believe they will reward it for its usefulness and overlook its imperfections.
Visit the new audio discovery project from Pop Up Archive: Audiosear.ch