My Summer Project!

Summer’s coming to an end and I recently completed my TWiML internship. If you’re just joining, I’m Malaika Charrington, TWiML’s first summer intern. This summer I worked on a project in which we tested several different machine transcription services, evaluating their features and efficiency to determine the best service for transcribing the podcast. Our ultimate goal is to transcribe each podcast and then tag each transcription by topic, allowing listeners to more easily find podcasts that fit their interests. See my previous posts on the project: Introducing TWiML & AI’s newest Intern and Transcription Project Update.

When I last updated you all, I told you about several changes I made to my project. At that point, my plan was to test 20 different clips against 6 different speech-to-text services using their default settings. We quickly realized that that plan was not ideal, as it wouldn’t allow us to explore the many features and capabilities of the services that we were testing. On top of this, since time was somewhat limited, building 6 functional API integrations, then manually comparing all 120 transcriptions to my ground truths was a bit unrealistic. We thus changed the plan again, narrowing our view to a single API that friends in the know suggested might be the best fit for us–the Google Cloud Speech-to-Text Service.

In investigating the service, we explored several interesting features, such as Google’s word timestamping feature and their word alternatives feature, which gives different possibilities for words that the service is uncertain about.

With help from my dad and a freelance developer, Luke Reichold, who we met in our co-working space, we developed a program that transcribed our audio, concatenating the words, labeling the different speakers (although these labels were not very accurate) and punctuating the text. The program ran through each of Google’s unique audio models, and several other API features, and put each transcription in a text file within a directed folder.

Model

The Google Speech-to-Text API supports three distinct audio “models” — default, video, and phone call — which each make different assumptions about the way the audio is recorded. To try to find the best one for our situation, we tested each.

We transcribed seven audio clips with each of the three audio models. As you can see in the chart below, in all but one instance, Google’s video model was the most accurate averaging at roughly 98% accuracy. The telephone model was the second most accurate, averaging at 95% accuracy. The default model turned out to be the least accurate at 92% accuracy.

The poor performance of the default model validated our decision to more thoroughly test a single transcription service versus comparing multiple services based on their default options.

Input Type

We tested two different file types: uncompressed WAV files and compressed MP3 files converted into the uncompressed FLAC format in order to submit them via Google API. The reason we tested both is that when the WAV files that we originally record are converted to compressed MP3 files for publication, the file loses data, some of which might impact the transcription process. However the MP3 files are smaller and much easier to deal with, so if we could get good results from them, that would be best.

The Google API surprisingly doesn’t accept MP3 files as input, perhaps as a way to enforce the best practice of transcribing from uncompressed files. So to understand the difference in transcription performance between the original WAV files and the published MP3 files we converted the MP3 files to FLAC. There is no information lost or gained with that conversion.

We found the difference in accuracy between the uncompressed and compressed audio to be quite significant. For the default and video models, the transcriptions from the WAV files were more accurate than their FLAC counterparts, whereas, for files transcribed through the telephone model, the FLAC files were more accurate.

The graphs below compare transcription accuracy for uncompressed and compressed audio with identical speech content using Google’s three transcription models.

On average, the WAV files we tested were 96% accurate versus 94% for FLAC files.

Phrase Hints

Though there were a number of times the transcription was completely off, many of the inaccuracies that I encountered were as simple as ‘in’ vs ‘an,’ or ‘weather’ vs ‘whether.’ In many cases, though, the service surprised me by its ability to use context to differentiate between homonyms or accurately transcribe somewhat abstract technical language.

One interesting feature that we tried to employ was “phrase hints.” This is essentially a custom dictionary feature that allows the user to input context-specific or less common words so that they’re more easily transcribed by the API.

We had high hopes for this feature, but my attempt to use it to improve the accuracy of the transcriptions was surprisingly unsuccessful. While I only tried this feature for a single sample, for some reason, submitting phrase hints along with our API calls resulted in decreased transcription accuracy for two of the three tested models, despite my hand-picking each word for the custom dictionary from words that the API transcribed incorrectly when run without hints!

The below graph compares the accuracy of one audio file transcribed using Google’s three models, once using phrase hints (red) and once without (blue)

Next Steps

Unfortunately, the time available ended up being our biggest constraint. If I had more time with the project, I would test more audio clips and test the phrase hint function more thoroughly. On top of this, I would love to see how the Google Speech-to-Text API performs when accents are present. In general, I would continue going into more depth in the research aspect of the project and exploring the API’s abilities and accuracy-enhancing features.

Overall, I’m excited that these exploratory efforts have left the TWIML team in a great spot to expand on the project and further explore generating and using transcripts.

This internship was a fantastic experience, and I had a lot of fun working on this project! I learned a lot of incredibly interesting tricks about coding, using the terminal, developing with APIs, using documentation, and countless other skills that I will continue to hone throughout my college experience. Thanks everyone for accompanying me on this journey! It’s been fun!

Header image: Google