20 MB is all you need for speech-to-text

Alireza Kenarsari
Picovoice

--

Speech-to-Text is a technology that automatically converts human speech into text. Hence, it is also known as automatic speech recognition or ASR. It is a crucial piece of technology. The amount of available spoken data is growing exponentially. The first step towards generating any value from it is transcribing it. The rest of this post is about speech-to-text engines and why I thought it’s a good idea to create one more in 2022!

Where are we?

  • Existing competitive engines are too big and slow to run on commodity hardware and hence are either cloud-based or on-premise.
  • Motivated by cost savings, Google, Amazon, and Apple started pushing their speech-to-text engines closer to where data resides. But only for their first-party products (i.e. Google Pixel, Amazon Echo, and iPhone).
  • Third parties that don’t have Google-like resources are stuck with cloud-based or on-prem offerings.

Why another speech-to-text engine?

The current paradigm of sending audio data to someone else’s cloud infrastructure is expensive. The high cost is an inherent limitation as bandwidth and dedicated cloud-compute are costly. Also, cloud computing doesn’t help with privacy or latency. If you want to keep it private and fast, you need to bring algorithms closer to the data.

Introducing Leopard

Leopard is the speech-to-text engine we’ve built at Picovoice.

  • It performs all voice processing on-device. It is 20 MB in size and runs on virtually anything.
  • It matches cloud-level accuracy (see benchmark section below).
  • It is at least 10x more cost-effective than alternatives
  • It allows adding custom vocabulary so that your model understands what matters to you the most.

Where to go from here?

Start building for free! No credit card is required.

Open-Source Benchmark

Word Error Rate (WER)

The accuracy of Leopard is benchmarked against major cloud providers (Google, Amazon, Microsoft, and IBM). Mozilla’s DeepSpeech as an on-device alternative is also included. The benchmark is the average of LibriSpeech, CommonVoice, and TED talks.

Word error rate (WER) benchmark. Mozilla DeepSpeech, IBM Watson, Google Speech-to-Text, Picovoice Leopard, Amazon Transcribe, and Azure Cognitive Services.

Runtime

We also have included runtime metrics for both DeepSpeech and Leopard. Leopard is 60x smaller and 9x faster.

CPU and memory usage. Picovoice Leopard vs Mozilla DeepSpeech. Leopard archives 11% WER and Mozilla DeepSpeech is at 22.86%.

--

--