Analysis of Automatic Speech Recognition Tools

Jay Mehta
Version 1
Published in
6 min readMar 30, 2023

At the Version 1 Innovation Labs, we are working to identify suitable approaches to state of the art Speech to Text tools to provide audio/video files based on their evaluation parameters.

For this, we created a sample dataset consisting of audio files from team recordings and produced transcripts. The transcript are then carefully corrected by replacing unwanted terms, correcting words, and so on. This modified transcript serves as the basis for assessment.

We have found potential tools/models for generating transcripts via API or offline. The installation of these instruments is minimal and simple. A few of these tools, such as Azure and AWS, require subscriptions and charging for their services.

Later in this analysis we will perform an evaluation of ground-truth and hypothesis based on few measurable parameters such as “Word Error Rate” to find out most accurate and cost-efficient tool that can be used to generate transcript from audio/video.

AI-powered code analysis and documentation — Decipher | Version 1

1. Azure Cognitive Service

Speech to Text is a feature of the Speech Service, which is part of Azure Cognitive Services. This feature is available via the SpeechSDK, the REST API and the SpeechCLI.

Batch transcription is used to transcribe a large quantity of audio data stored in a database. It is built on a Universal Language Model that was trained using Microsoft-owned data.

In many instances, the base model is insufficient; in such cases, a custom speech model can be built by training with additional data.

Custom model training can increase accuracy and reliability over base model training.

The documentation is available here.

2. AWS transcribe

Amazon Transcribe is an automatic speech recognition service that uses machine learning models to convert audio to text. You can use Amazon Transcribe as a standalone transcription service or to add speech-to-text capabilities to any application.

It transforms audio data stored as a media file in an Amazon S3 bucket to text data. You are conducting batch transcription if you are transcribing media files stored in a bucket.

As with Azure, the model used here is a pre-trained Amazon model. You can also create a bespoke trained-language model to meet your specific needs.

The documentation is available here.

3. OpenAI- Whisper API

The OpenAI Whisper Model is a language model developed by OpenAI that is designed to generate realistic and natural-sounding text.

It is a variation of the GPT-3 language model that has been fine-tuned to produce text that is more coherent and fluent, especially when generating longer pieces of text.

The Whisper Model is available through OpenAI’s API, which allows developers to integrate it into their applications and use it to generate text for a variety of purposes.

The documentation is available here.

Photo by Rolf van Root on Unsplash

4. AssemblyAI API

AssemblyAI’s API can be used to transcribe and understand audio/video files with their AI models.

This API can transcribe audio/video files that are available via a URL. For example, things in S3 buckets or blobs in Azure storage.

Files can also be uploaded straight into their API and accessed from there.

The documentation is available here.

5. OpenAI- Whisper

Whisper is a speech recognition model that can be used for a variety of purposes. It is a multitasking model that can perform multilingual voice recognition, speech translation, and language identification after being trained on a big dataset of diverse audio.

There are 5 model sizes available out of them 4 are English-only versions, which offers speed and accuracy.

This model can be used offline(on premises).

In this analysis, we have used large sized model. It may take some time to create a transcript based on the model selected.

In comparison to OpenAI’s API, there is no size restriction for audio/video in the offline model. This makes it much easier to use

The documentation is available here.

6. Vosk

Vosk is an offline open-source speech recognition toolkit.

Vosk models are small (50 Mb) but provide continuous large vocabulary transcription, zero-latency response with streaming API, reconfigurable vocabulary, and speaker identification.

To get the most out of this toolkit, create your own language model and use it to get accurate outcomes.

The documentation is available here.

7. Others

There are numerous offline models that can be used to produce transcripts and perform a variety of other tasks in automatic speech recognition

We attempted several samples in order to obtain transcripts from the provided audio. They may not be as useful as anticipated due to factors such as custom model training or the high-level GPU infrastructure needed to run it on the local machine. Or even a lack of knowledge about how to utilize it.

Here are the few examples:

Evaluation

We used the python package JIWER to evaluate transcripts generated by different tools and compare which came out on top.

JiWER is a simple and fast python package to evaluate an automatic speech recognition system.

Using this we have calculated following measures:

  • word error rate (WER)
  • match error rate (MER)
  • word information lost (WIL)
  • word information preserved (WIP)

For example, we can use WER to calculate the ratio of errors in a transcript to total syllables spoken. A lower WER in speech-to-text implies better speech recognition accuracy.

This graph clearly demonstrates that the WER for Azure and OpenAI Whisper models (offline) is the lowest. As a result, it is most effective in speech-to-text and other tools.

Pricing

When it comes to pricing, each service provider has a unique entity and style. Here’s a rough estimate, in tabular form, of how much it costs per minute/hour, based on the length of the audio

Conclusion

After comparing the performance and capabilities of different tools, Azure Speech Service, and the Open AI Whisper model (on-premises use) for speech-to-text transcription stand out.

It can be determined that Azure Speech Service is a better choice when accessed via REST APIs. It generates transcriptions of speech in numerous languages with high accuracy and reliability, customizable language and acoustic models, and support for different audio formats.

The Open AI Whisper model, on the other hand, is better suited for on-premises use because it provides for more control and customization over the transcription process, such as fine-tuning the models and adjusting the hyperparameters as needed. The Whisper model also offers competitive transcription accuracy and speed, with the added benefit of being able to handle a broader range of speech types and accents.

As a result, the choice between these two options is heavily influenced by the particular use case and requirements. If high-volume, cloud-based transcribing is required, Azure Speech Service is the better option. However, if greater customization, control, and flexibility are needed, and the transcription is to be done on-site, the Open AI Whisper model is the better choice.

About the Author:
Jay Mehta is a Microsoft .NET Developer at the Version 1 Innovation Labs.

--

--