Comparison of Speech to Text API’s

Published in

Analytics Vidhya

3 min readNov 29, 2019

Speech to text systems are a critical intermediate step for natural language processing and clean, human readable transcripts. Audio Speech Recognition(ASR) is one of the most demanding fields in the vast scope of Machine Learning. With the decrease in error with each iteration of a model, the competition is at its peak. Manual transcribing is gradually being taken over by machines with room for error but not almost zero which is offered by a human. For machines to understand us and vice versa, there is a dire need for Speech processing.

You can see that the state of speech-recognition and artificial intelligence still has a way to go to match human capability. To close that gap, we’ll be looking for advancements in voice recognition technologies that resolve existing accuracy and security issues and can fully operate as an embedded solution.

You do not have to become trained in Natural Language Processing to use Speech Recognition, you just need to use one of the APIs already present for Audio Speech Recognition. Load HTTP request with the audio/video, shoot it to the API’s server and you will receive the transcript response.

If you’re interested in creating your own Speech processing machine, it will take a lot of training, time and data, and not to forget a model that caters to your ideal needs.

The Big Four already provide us with near-perfect Speech Processing APIs then why do we need to look at other APIs?

There are many API’s other than the ones provided by big four which have a comparable efficiency.

Here is a list of APIs that I have compared.

IBM Watson
Amazon Transcribe
Google Cloud
Microsoft Azure
Speechmatics
Twilio

The comparisons I made were on the basis of Word Error Rate

Word Error Rate

Word Error Rate is a common metric for comparing Audio Speech Recognition. It compares a reference with a hypothesis

Word Error Rate = (S + I + D) / N

where

S is the number of substitutions,
D is the number of deletions,
I is the number of insertions and
N is the number of words in the reference

REFERENCE: Some of words
HYPOTHESIS: Sum of words
“Substitution” is happening in this case. ‘Some’ is substituted by ‘Sum’.

WER calculation is based on Levenshtein distance at the word level.

I worked on a stereo sound which two separate audio channels for 6 files where there was only one speaker without disturbance. This kind of assessment does not analyze translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort.

WER Chart

The speakers in the Audio all speak with three different accents; American, Asian, British.

I have calculated the mean of each of the API’s, and drawn a graph based on that.

These Cloud API’s keep on working to improve its model using customer data. The data computed would be different from the time you are reading this.

This is my first article ever, and would like to keep doing this and improving myself at the same time. Any advice on writing better or presenting better would be appreciated.

Comparison of Speech to Text API’s

Word Error Rate

Word Error Rate = (S + I + D) / N

Written by QB