Search Through Sound: Finding Phrases in Audio

Scott Stephenson
Deepgram
Published in
4 min readJan 7, 2016

In January 2015 my Co-Founder and I were kicking around the idea of a search engine that would let a person find phrases in a block of audio. We were looking for something that could peer into interviews, podcasts, video lectures — things like that. And if it was done right, you would be able to search through many seasons of a certain TV show and find all the crucial moments like, “You’re fired!”.

We thought, ‘This has to exist, right?’. Surprisingly, no. There wasn’t a company out there that really provided the functionality. Certainly not in a way that was useful to us, at least. So we started hacking together a Google-based transcription to see if we can get a barebones prototype going. In a couple days it was running — search for something, and most of the time you got it. Huge pat on the back, right?

Speech recognition is hard.

Reality hit us when we noticed a problem. Sometimes the phrase was definitely spoken — you could hear it plain as day in the audio stream — but the search missed it. It turns out this is due to the inaccuracy of automatic speech transcription software.

We went on a quest to get our hands on some top quality speech recognition bad-assery. What we were met with was another dose of reality; speech recognition is hard. More evidence emerges when you dig into the current audio research scene and notice that this topic is still a very active topic.

The big tech companies (Google, Microsoft, Apple, etc.) put forth large efforts to get this sort of thing right. Even after that, you generally only get 90% word accuracy. That’s on very clean, well recorded speech. With input sources containing conversational speech of questionable quality — say, YouTube videos — the word error rate get pretty bad (more than half is wrong sometimes!).

Can audio search work well?

This got us wondering, ‘can we improve the audio search situation?’. We landed on something we think is pretty good — search based on how a phrase sounds, not on the precise spelling in text. We were sure this would provide better results but we weren’t sure just how much better it would be.

We dug into research to see if this technique had been tried in a production form. We turned up quite a few papers — most were not totally relevant — but a Google academic paper on searching through political speeches from 2008 was striking. ‘What was their method?’, you might wonder. They used just regular old text transcription with no additional incorporation of the way the audio actually sounded. Bummer, right?

Use the way words sound.

What we were stumbling across was what speech researchers call keyword search. There is an existing method for doing this called acoustic keyword spotting, but that requires reprocessing the data every time for each and every search — that’s totally impractical. So, yeah, applying this idea is a fairly difficult problem. We didn’t really know just how hard at the time, but we know now (eight months of coding our first search engine and starting a company along the way helps beat that into you).

We landed on a phonetic representation of the audio. You’ve seen that in a dictionary right? We built software that allows you to upload audio and have the server process that audio into a giant searchable lattice. With a lattice like this, you can fuzzily go through the entire audio file for your search phrase in a fraction of a second. There is a huge improvement using this method when compared to the text-based approach — search recall goes from a tepid 45% to a grin-inducing 90%+. Now we have our secret sauce.

We’ve got an audio toolbox.

Building a web API isn’t what we were trained to do, but we weren’t foreign to hacking all manners of software contraptions into reality for the particle physics experiments we worked on previously. I spent a lot of time getting the pieces of the processing chain into working order and Noah took to building a squeaky clean API and search experience. Now, we have an audio toolbox that allows the user to upload audio/video and search the resulting lattice, all with brutally simple API calls. The API is expanding with custom machine learning efforts to help classify that audio, too. Got a huge pile of call data that you’d love to know what is inside? We can help with that!

The bulk of the effort was painstakingly completed over the summer with the latter half taking place in a San Mateo startup accelerator: BoostVC. The accelerator culminates with a demo day where we pitch our product to (hopefully very eager) investors that can help Deepgram flourish as an early stage startup.

We would love feedback on the demo.

If you want to test out Deepgram’s robot ears, you can start by visiting www.deepgram.com. Drop us a line if you think you’ll be able to discover something new with Deepgram.

Test the demo, give feedback, and get early API access at www.deepgram.com.

Originally published at blog.deepgram.com on January 7, 2016.

--

--