Two minutes NLP — Speech Recognition options with Python

DeepSpeech, SpeechBrain, SpeechRecognition, Speech-to-Text APIs

Fabio Chiusano
NLPlanet
3 min readDec 6, 2021

--

Photo by Soundtrap on Unsplash

Speech-related tasks overview

Automatic Speech Recognition (ASR) is the task of transforming speech to text. Other common speech-related tasks are:

  • Spoken Language Understanding: speech-to-semantics.
  • Speaker Recognition: identifying or verifying speaker identities from speech recordings.
  • Speech Enhancement: improving the quality of the speech signal by removing noise.
  • Speech Separation: separating multiple speakers speaking at the same time.
  • Speaker Diarization: detecting who spoke when.
  • Multi-microphone signal processing: combining the information recorded by multiple microphones.

Open-source Speech Recognition

The biggest drawback of open-source solutions is that the computing power required to do speech recognition will have to come from your hardware. Another important consideration is that open-source speech recognition options are usually less accurate than cloud-based API options. You’re probably better off with a cloud solution if accuracy is important to your project.

  • CMU Sphinx: collects over 20 years of CMU research. Some advantages of this library: CMUSphinx tools are designed specifically for low-resource platforms, flexible design, and focus on practical application development and not on research.
  • DeepSpeech: was originally a paper about speech recognition techniques produced by Baidu’s research team. DeepSpeech can run offline and on devices. DeepSpeech works on a wide range of devices from Raspberry Pi devices to actual GPUs that are used to train models in the industry.
  • SpeechBrain: it’s an open-source and all-in-one speech toolkit. It is designed to make the research and development of neural speech processing technologies easier by being simple, flexible, user-friendly, and well-documented. Integrates with HuggingFace transformers.
  • SpeechRecognition: open-source wrapper of various speech recognition APIs, both open-source and closed-source cloud solutions.

You can find more comparisons of open-source speech recognition libraries here.

Cloud-based Speech Recognition

Cloud solutions for building a speech recognition project have the big advantage of being easy to use, more accurate than open-source options, and don’t require you to host any models on your own hardware. The main drawback of some cloud solutions is the cost.

Examples of closed-source cloud solutions are Google Cloud Speech-to-Text API, Wit.ai, Microsoft Azure Speech, Houndify API, and IBM Speech to Text.

--

--

Fabio Chiusano
NLPlanet

Freelance data scientist — Top Medium writer in Artificial Intelligence