Two minutes NLP — Speech Recognition options with Python

DeepSpeech, SpeechBrain, SpeechRecognition, Speech-to-Text APIs

Fabio Chiusano

Published in

NLPlanet

3 min readDec 6, 2021

Speech-related tasks overview

Automatic Speech Recognition (ASR) is the task of transforming speech to text. Other common speech-related tasks are:

Spoken Language Understanding: speech-to-semantics.
Speaker Recognition: identifying or verifying speaker identities from speech recordings.
Speech Enhancement: improving the quality of the speech signal by removing noise.
Speech Separation: separating multiple speakers speaking at the same time.
Speaker Diarization: detecting who spoke when.
Multi-microphone signal processing: combining the information recorded by multiple microphones.

Open-source Speech Recognition

The biggest drawback of open-source solutions is that the computing power required to do speech recognition will have to come from your hardware. Another important consideration is that open-source speech recognition options are usually less accurate than cloud-based API options. You’re probably better off with a cloud solution if accuracy is important to your project.

CMU Sphinx: collects over 20 years of CMU research. Some advantages of this library: CMUSphinx tools are designed specifically for low-resource platforms, flexible design, and focus on practical application development and not on research.
DeepSpeech: was originally a paper about speech recognition techniques produced by Baidu’s research team. DeepSpeech can run offline and on devices. DeepSpeech works on a wide range of devices from Raspberry Pi devices to actual GPUs that are used to train models in the industry.
SpeechBrain: it’s an open-source and all-in-one speech toolkit. It is designed to make the research and development of neural speech processing technologies easier by being simple, flexible, user-friendly, and well-documented. Integrates with HuggingFace transformers.
SpeechRecognition: open-source wrapper of various speech recognition APIs, both open-source and closed-source cloud solutions.

You can find more comparisons of open-source speech recognition libraries here.

Cloud-based Speech Recognition

Cloud solutions for building a speech recognition project have the big advantage of being easy to use, more accurate than open-source options, and don’t require you to host any models on your own hardware. The main drawback of some cloud solutions is the cost.

Examples of closed-source cloud solutions are Google Cloud Speech-to-Text API, Wit.ai, Microsoft Azure Speech, Houndify API, and IBM Speech to Text.

Two minutes NLP related posts

Two minutes NLP — Building blocks to train a paraphrases generation model effortlessly

T5, BART, and PEGASUS

medium.com

Two minutes NLP — Quick tips to make your semantic search projects painless

Semantic search, embeddings, symmetric vs asymmetric search, and embeddings storage