Comparing 4 Popular Open Source Speech To Text Neural Network Models

I compared pre-trained models for Vosk, NeMo QuartzNet, wav2letter, and DeepSpeech2 for my summer internship. For my company's needs, I recommended NeMo QuartzNet model from NVIDIA.

4 min readJul 16, 2020

Speech-to-text software is becoming more and more popular as we continually progress our relationship with technology. Siri and Google Assistant are core components in smartphones, and many rely on this type of software to aid day-to-day activities.

In terms of open-source Automatic Speech Recognition (ASR) software out there, the options are limited. There can be many benefits to implementing one of these free systems, but the many nuances of the English language can add another layer of complexity.

In this analysis, I took six audio files of men and women speaking the Harvard sentences in an American accent from the Open Speech Repository and ran them through four different ASR neural networks at a framerate of 16000. In the testing, I noticed some of the audio spoken by women were lower quality, but decided to include them to see how accurately the ASR’s would transcribe them despite the issues.

The speech-to-text softwares I used were Vosk, NeMo, wav2letter, and DeepSpeech2. I compared the model load times, inference time, and word error rate (WER). Each ASR has good documentation and unique features that are highlighted below.

About Vosk:

Vosk is a speech to text software made by Alpha Cephei. It comes ready to translate multiple languages, such as English, German, French, Spanish, Portuguese, Chinese, Russian, Turkish, and Vietnamese. Vosk works on edge devices also with a small model size fit for mobile phones or IoT applications. Various language models allow for better transcription accuracy, ranging from 36MB to 3.2GB. Vosk can be easily implemented with a simple python script and KaldiRecognizer, a preprocessor for audio files. In this analysis, I used the ‘danzuu’ model.

About NeMo:

NeMo (neural modules) was developed by NVIDIA. Neural Modules are a core component of AI that take typed input (a .wav file) and produce a typed output (the transcription). Like Vosk, there are multiple models that can be used to increase the inference time. It can be implemented into a simple python script but without the need of the preprocessor to aid the audio transcription. In this analysis, I used the ‘QuartzNet15x5’ model.

About wav2letter:

Wav2letter was made by Facebook AI Research. It comes with the option of pre-trained models or trainable models. Compared to NeMo and Vosk it was tedious to get the necessary components installed, but once working properly I did not encounter any more issues. In this analysis, I used the pre-trained model in the wav2letter download.

About DeepSpeech:

Deepspeech was developed by Mozilla. The installation and use require much less effort than the other Vosk, NeMo, or wav2letter. It includes additional features, such as being able to add a microphone for live transcription. In this analysis, I used the pre-trained model in the DeepSpeech2 download.

Key Findings:

NeMo performs very well with clear audio files, but poorer quality files have a steep increase in WER
wav2letter performs the most consistently against varying levels of audio quality
Vosk is less accurate and slower than NeMo and Wav2Letter
DeepSpeech2 has slowest transcription time, and WER increases drastically as the audio quality drops

Overall, NeMo performs the best in terms of transcription time and can be very accurate (as seen from the male audio). wav2letter performs most consistently across the board, both in terms of transcription time and WER. There are additional paid options available, but the free open-source ASRs are becoming more and more promising.