Comparative Analysis of Three Open-Source Automatic Speech Recognition (ASR) Neural Network Models

Through examination of accuracy and efficiency of three different ASR neural network models, the recommendation of the superior open-source ASR model, as of 2021, is identified below.

Khoa Tran
7 min readJul 21, 2021
Parts to transcribing audio to text as well as conversational AI

Summary

In this project, I examined three different pre-trained models dealing with automatic speech recognition as an update to a 2020 study. In this analysis, I built upon Nick Nagari’s previous article analyzing four different open-source ASR models, by implementing two new open-source models as well as Nvidia’s updated version of NeMo in order to test their accuracy as well as processing times. Using six audio files from the Open Speech Repository, consisting of Harvard sentences with speech from both male and female voices, the Automatic Speech Recognition models of NeMo, Speech2Text, and Wav2Vec2 are compared for their accuracy with word error rate (WER) as well as processing times.

Background

With the rise of automation and voice-activated technologies, voice recognition has become a critical aspect in innovation and the growth of technology. Since human’s speech contains colloquialism and abbreviations, extensive computer analysis is required to process natural language into accurate outputs with meaning. In the process of natural language processing (NLP), there are two critical parts, automatic speech recognition (ASR) and natural language understanding (NLU). ASR deals with processing of speech to text whereas NLU processes text to actual meaning. In this article, we will focus on automatic speech recognition (ASR) as its accuracy determines the final output, ASR allows for conversion of speech to text.

About NeMo: Developed by Nvidia, Nemo is Python toolkit for building, training, and fine-tuning GPU-accelerated conversational AI models using a simple interface. Built for researchers working on automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech synthesis (TTS), the primary objective of NeMo is to help researchers from industry and academia to use pretrained models and make it easier to create new conversational AI models.

About Speech2Text: Developed at Facebook, Speech to Text Transformer (S2T) model trained for automatic speech recognition (ASR) is an end-to-end sequence-to-sequence transformer model. It is trained with standard autoregressive cross-entropy loss and generates the transcripts autoregressively. It is a medium size pretrained model.

About Wav2Vec2: Also developed at Facebook, the wav2vec2 is a large model pretrained and fine-tuned on 960 hours of Libri-Light and Librispeech on 16kHz sampled speech audio. The automatic speech recognition model only takes inputs of audio sampled at 16kHz.

Results

The six different male and female audio files from the Open Speech Repository, which consist of Harvard sentences, are sampled at a rate of 8kHz. Converted to audio files sampled at a rate of 16kHz using FFmpeg in order to match the compatibility requirement of audio input with the three ASR models. The performance results of word error rate (WER) percentage and processing time, analyzed as processing ratio, which is the amount of seconds of given audio processed per second. Displayed in the charts below, the difference in performance between the ASR models can be seen.

Results of word error rate and processing time of the three different automatic speech recognition models described above on female audios, as Nvidia’s NeMo had the least errors and fastest processing time for all three audio clips.
Performance Results of ASR Models on Female Audios

Comparing different ASR models of NeMo and Speech2Text on female audio transcription, as NeMo had a significantly lower word error rate as well as processing time. Wav2Vec2 isn’t optimized for female audio and isn’t shown as a result.

Results of word error rate and processing time of the three different automatic speech recognition models described above on male audios, as Nvidia’s NeMo had the least errors and fastest processing time for all three audio clips.
Performance Results of ASR Models on Male Audios

Performance analysis on the three described ASR models with multiple male audio clips for transcription. The difference in word error rate (WER) and processing time between the ASR models are much smaller in male audio compared to female audio; however, Nvidia’s NeMo continuously transcribe audio with greater accuracy and efficiency with a lower WER percentage and processing times.

Averaged results of word error rate and processing time of the three different automatic speech recognition models described above, as Nvidia’s NeMo had the least errors and fastest processing time for both male and female audio clips.
Averaged Performance Results of ASR Models on Male and Female Audios

Key Findings

  • With NeMo’s recent update, the processing times and word error rate has significantly reduced, resulting in high efficiency and accuracy of transcription among both male and female audio.
  • Speech2Text and Wav2Vec2 had similar WER and processing times. wav2vec2 is slightly more accurate but takes more memory and time to process the audio. With speech2text, the processing time is much faster with just a small increase in WER.
  • NeMo consistently used less memory and data to process; this results in more efficient transactions.
  • Female audio produced a higher error rate among all ASR models compared to male audio. This is likely due to lack of sufficient female voice samples in the training data used for the models.
  • Wav2Vec2 cannot process female audio even when speech input is sampled at 16kHz. This is due to the model not being trained/optimized for female audio input, using up too much memory. As a result, wav2vec2 has N/A in all sections of female audio.

Conclusion

Nvidia’s NeMo, comprehensively, had the best performance in both female and male audio in the facets of processing time and accuracy through word error rate calculations. Speech2Text amd Wav2Vec2 performed well with male audio, but they were not comparable to NeMo. Female audio still causes issues in all three ASR, but as an open-source ASR, Nvidia’s NeMo is the best option with respect to processing time, accuracy, and memory requirements as of August 2021.

Nvidia’s NeMo logo
NeMo’s Application Stack
NeMo’s Application Stack

Appendix

Audio transcription of the associated Harvard sentences listed above:

List 5: “A king ruled the state in the early days. The ship was torn apart on the sharp reef. Sickness kept him home the third week. The wide road shimmered in the hot sun. The lazy cow lay in the cool grass. Lift the square stone over the fence. The rope will bind the seven books at once. Hop over the fence and plunge in. The friendly gang left the drug store. Mesh wire keeps chicks inside.”

List 3: “The small pup gnawed a hole in the sock. The fish twisted and turned on the bent hook. Press the pants and sew a button on the vest. The swan dive was far short of perfect. The beauty of the view stunned the young boy. Two blue fish swam in the tank. Her purse was full of useless trash. The colt reared and threw the tall rider. It snowed rained and hailed the same morning. Read verse out loud for pleasure”

List 2: “The boy was there when the sun rose. A rod is used to catch pink salmon. The source of the huge river is the clear spring. Kick the ball straight and follow through. Help the woman get back to her feet. A pot of tea helps to pass the evening. Smoky fires lack flame and heat. The soft cushion broke the man’s fall. The salt breeze came across from the sea. The girl at the booth sold fifty bonds”

List 72: “A gold ring will please most any girl. The long journey home took a year. She saw a cat in the neighbor’s house. A pink shell was found on the sandy beach. Small children came to see him. The grass and bushes were wet with dew. The blind man counted his old coins. A severe storm tore down the barn. She called his name many times. When you hear the bell come quickly”

List 57: “Paint the sockets in the wall dull green. The child crawled into the dense grass. Bribes fail where honest men work. Trample the spark else the flames will spread. The hilt of the sword was carved with fine designs. A round hole was drilled through the thin board. Footprints showed the path he took up the beach. She was waiting at my front lawn. A vent near the edge brought in fresh air Prod the old mule with a crooked stick”

List 23: “A pencil with black lead writes best. Coax a young calf to drink from a bucket. Schools for ladies teach charm and grace. The lamp shone with a steady green flame. They took the axe and the saw to the forest. The ancient coin was quite dull and worn. The shaky barn fell with a loud crash Jazz and swing fans like fast music. Rake the rubbish up and then burn it Slash the gold cloth into fine ribbons”

--

--

Khoa Tran
0 Followers

B.S. in Electrical and Computer Engineering and Informatics minor at the University of Washington