Why The Audio Compression Format Impacts the Speech to Text Transcription Accuracy

Marco Noel
IBM Watson Speech Services
5 min readDec 21, 2020
Photo by Standsome Worklifestyle on Unsplash

Although Watson Speech to Text supports multiple audio file formats, they do not deliver the same level of accuracy. Not all audio formats are created equal! Accuracy is a top priority for many of our customers, especially when transcribing calls to extract insights, what we often call the “call analytics” use case. In this article, I will explain why audio compression matters and provide some recommendations around what to do with it under specific situations.

To compress or NOT to compress: that is the question…

Photo by Rohit Farmer on Unsplash

For speech recognition, there are three key factors that might influence your choice when converting or exporting audio files:

  • File size limit: Watson Speech to Text service supports audio file sizes up to 100 Mb when using our synchronous / live stream endpoint (/recognize). Anything higher than 100 Mb and up to 1 Gb file size, you have to use our asynchronous endpoint (/recognitions). File size determines the endpoint, regardless of audio formats.
  • Call duration : This obviously has a direct impact on the audio file size. For general calls between an agent and a user, I have not seen anything exceeding 30 minutes, which in this case, would not require compression. Of course, if your use case deals with meetings or workshops, they can easily exceed 60 minutes.
  • Data Volume : The more audio data you can send to the service, the more you can process, the faster you get results.

Don’t forget: you get billed for audio minutes transcribed, not file size uploads.

Compression reduces the size of your audio files, which should increase how quickly you get results, so it should be a good thing then, right? But what about speech accuracy?

Lossless vs lossy compression: I can’t hear the difference so why should I care…

One very important factor to always consider for speech accuracy is the integrity of its audio features. When you listen to the audio, you might not hear the difference, but for speech recognition, it is a big deal. Depending on which compression you choose, it can have a direct impact on the speech accuracy.

Let me try to explain the difference between the two types of compression:

Lossless : the audio file size is compressed but it does not impact the audio quality. It’s like a zip file format for audio files. When decompressed, you recover the original quality. The most common one is Free Lossless Audio Codec (FLAC).

Lossy : this audio compression permanently deletes data to reduce the file size. The compression rate can go as high as 10-to-1. For a human ear, audio quality sounds the same as uncompressed, but for Watson STT, the recognition accuracy is worse. During the decompression process, some unwanted sound artifacts might appear. Some lossy audio formats are MP3, AAC, OGG Opus, OGG Vorbis, WMA.

Note: OGG Opus is the logical successor to OGG Vorbis because of its low latency with high audio quality while being similar or smaller in size.

The most common uncompressed, lossless audio format is WAV. The table below shows different audio durations for 100 Mb file size for different audio formats. Although the WAV format can only contain 55 minutes of audio, that’s usually enough time for a customer call. Because the data features remain intact, they are optimal for speech recognition results. That means your transcription will be more accurate.

Audio duration for 100 Mb file size in different formats

We conducted some experimentations and compared Word Error Rates (WER) from a set of 15 audio files with different compressions. Here’s a summary of our findings:

  • As expected, WAV and FLAC delivered the best WER results — this is our baseline — because the audio remained intact with no feature loss.
  • OGG Opus saw a slight degradation of 2% WER relative to our baseline
  • MP3 delivered the worst results with a 10% degradation relative to WAV/FLAC

Using the same test sets, we did some experiments with the speaker diarization feature. We also noticed some degradation with MP3 on speaker labels accuracy, but not at the same level as WER.

Recommendations

Photo by Matt Walsh on Unsplash

Based on these results above, here are some recommendations around the use of audio compression for speech recognition use cases:

  • Stay uncompressed and lossless: If your maximum call duration is less than 55 minutes (< 100 Mb), keep the audio files uncompressed and lossless (WAV format) to keep the audio quality optimal, then use the Watson STT synchronous endpoint.
  • Use STT asynchronous: If your WAV file exceeds 100 Mb, use the Watson STT asynchronous endpoint instead of compressing it, in order to get the best accuracy.
  • Use compressed but lossless: If you absolutely have to compress your audio file, try to use FLAC (lossless compression). Your file size should be reduced while keeping the audio quality intact without losing any feature.
  • Use lossy compression… as a last resort: If you need even more compression, select OGG Opus (lossy compression) which showed the least degradation in speech accuracy in our tests.

If you need an even higher level of compression, you have to take into consideration the potential loss of audio features that will directly impact the speech accuracy.

As usual, I encourage you to conduct your own experimentation with the different types of compression, see it for yourself, and choose the best audio format that suits your needs.

Click HERE to learn more about IBM Watson Speech to Text or go through the STT Getting Started video HERE

--

--

Marco Noel
IBM Watson Speech Services

Sr Product Manager, IBM Watson Speech / Language Translator. Very enthusiastic and passionate about AI technologies and methodologies. All views are only my own