Published in


20 MB is all you need for speech-to-text

Speech-to-Text is the technology that automatically converts human speech into text. Hence, it is also known as automatic speech recognition or ASR. It is a crucial piece of technology. The amount of available spoken data is growing exponentially. The first step towards generating any value from it is transcribing it. The rest of this post is about speech-to-text engines and why I thought it’s a good idea to create one more in 2022!

Where are we?

  • Existing competitive engines are too big and slow to run on commodity hardware and hence are either cloud-based or on-premise.
  • Motivated by cost savings, Google, Amazon, and Apple started pushing their speech-to-text engines closer to where data resides. But only for their first-party products (i.e. Google Pixel, Amazon Echo, and iPhone).
  • Third-parties that don’t have Google-like resources are stuck with cloud-based or on-prem offerings.

Why another speech-to-text engine?

The current paradigm of sending audio data to someone else’s cloud infrastructure is expensive. The high cost is an inherent limitation as bandwidth and dedicated cloud-compute are costly. Also, cloud computing doesn’t help with privacy or latency. If you want to keep it private and fast, you need to bring algorithms closer to data.

Introducing Leopard

Leopard is the speech-to-text engine we’ve built at Picovoice.

  • It performs all voice processing on-device. It is 20 MB in size and runs on virtually anything.
  • It matches cloud-level accuracy (see benchmark section below).
  • It is at least 10x more cost-effective than alternatives
  • It allows adding custom vocabulary so that your model understands what matters to you the most.

Where to go from here?

Start building for free! No credit card is required. Picovoice gives away 100 hours of free speech to text per month.

Open-Source Benchmark

Word Error Rate (WER)

The accuracy of Leopard is benchmarked against major cloud providers (Google, Amazon, Microsoft, and IBM). Mozilla’s DeepSpeech as an on-device alternative is also included. The benchmark is the average of LibriSpeech, CommonVoice, and TED talks.

Word error rate (WER) benchmark. Mozilla DeepSpeech, IBM Watson, Google Speech-to-Text, Picovoice Leopard, Amazon Transcribe, and Azure Cognitive Services.


We also have included runtime metrics for both DeepSpeech and Leopard. Leopard is 60x smaller and 9x faster.

CPU and memory usage. Picovoice Leopard vs Mozilla DeepSpeech. Leopard archives 11% WER and Mozilla DeepSpeech is at 22.86%.




Picovoice is the end-to-end platform for building voice products on your terms. Unlike Alexa and Google services, Picovoice runs entirely on-device while being more accurate.

Recommended from Medium

A New AI Lexicon: Monopolization

How AI is Transforming Law Firms?

Can computers think, or fall in love?

woman sitting on hill

Does Google Translate Think Lady Gaga Is Britney Spears?

Demystifying buzzwords: The relationship between ‘Data Science’ ‘Artificial Intelligence’ ‘Machine…

Playstation Assist- Sony’s brand new smart assistant

When will your next home transaction be with a chatbot?

What is a neural network? How to train it and why is it necessary in 2021?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alireza Kenarsari

Alireza Kenarsari

Founder at Picovoice

More from Medium

Using Haystack to create a neural search engine for Dutch law, part 1: why use Haystack?

Automating Information Extraction with Question Answering

Building an Intelligent Video Deduplication System Powered by Vector Similarity Search

Cover image.

[New Hugging Face Feature] Constrained Beam Search with 🤗 Transformers