Two minutes NLP — Speech Recognition options with Python
DeepSpeech, SpeechBrain, SpeechRecognition, Speech-to-Text APIs
Speech-related tasks overview
Automatic Speech Recognition (ASR) is the task of transforming speech to text. Other common speech-related tasks are:
- Spoken Language Understanding: speech-to-semantics.
- Speaker Recognition: identifying or verifying speaker identities from speech recordings.
- Speech Enhancement: improving the quality of the speech signal by removing noise.
- Speech Separation: separating multiple speakers speaking at the same time.
- Speaker Diarization: detecting who spoke when.
- Multi-microphone signal processing: combining the information recorded by multiple microphones.
Open-source Speech Recognition
The biggest drawback of open-source solutions is that the computing power required to do speech recognition will have to come from your hardware. Another important consideration is that open-source speech recognition options are usually less accurate than cloud-based API options. You’re probably better off with a cloud solution if accuracy is important to your project.
- CMU Sphinx: collects over 20 years of CMU research. Some advantages of this library: CMUSphinx tools are designed specifically for low-resource platforms, flexible design, and focus on practical application development and not on research.
- DeepSpeech: was originally a paper about speech recognition techniques produced by Baidu’s research team. DeepSpeech can run offline and on devices. DeepSpeech works on a wide range of devices from Raspberry Pi devices to actual GPUs that are used to train models in the industry.
- SpeechBrain: it’s an open-source and all-in-one speech toolkit. It is designed to make the research and development of neural speech processing technologies easier by being simple, flexible, user-friendly, and well-documented. Integrates with HuggingFace transformers.
- SpeechRecognition: open-source wrapper of various speech recognition APIs, both open-source and closed-source cloud solutions.
You can find more comparisons of open-source speech recognition libraries here.
Cloud-based Speech Recognition
Cloud solutions for building a speech recognition project have the big advantage of being easy to use, more accurate than open-source options, and don’t require you to host any models on your own hardware. The main drawback of some cloud solutions is the cost.
Examples of closed-source cloud solutions are Google Cloud Speech-to-Text API, Wit.ai, Microsoft Azure Speech, Houndify API, and IBM Speech to Text.