Open AI — Introducing Whisper

4 min readSep 23, 2022

Speech recognition is the ability of a machine or program to identify words spoken aloud and convert them into readable text. The most basic speech recognition software can only identify a limited number of words and phrases, and only when spoken very clearly. More advanced software can handle more natural speech, different accents, and multiple languages. Speech recognition technology relies on research in computer science, linguistics, and engineering. Many common devices and text-based programs have speech recognition capabilities to facilitate hands-free operation. Open AI introduces Whisper which is a trained open-source neural net that approaches human level robustness and accuracy on English speech recognition.

Speech Recognition

Speech recognition is the ability of a computer program to process human speech and convert it into written text. This is different from voice recognition, which simply seeks to identify an individual user by their voice.

Speech recognition is used to identify words in spoken language.
Voice recognition is a biometric technology for identifying an individual’s voice.

Speech recognition systems use computer algorithms to process and interpret spoken words and convert them into text. A software program turns the sound a microphone records into written language that computers and humans can understand. This is done by following these four steps:

analyze the audio
break it into parts
digitize it into a computer-readable format
use an algorithm to match it to the most suitable text representation

The algorithms that process and organize audio into text for speech recognition software must be trained on different speech patterns, speaking styles, languages, dialects, accents and phrasings to adapt to the highly variable and context-specific nature of human speech. The software also separates spoken audio from background noise that often accompanies the signal.

Whisper

The Whisper system uses automatic speech recognition to transcribe speech in multiple languages. It is trained on 680,000 hours of data from the web, which gives it improved accuracy for accents, background noise, and technical language. The system is being open-sourced so that people can use it to build useful applications and conduct further research on robust speech processing.

Architecture

The whisper model’s architecture is shown as follows:

The Whisper architecture uses a Transformer with an encoder and decoder to split audio into 30-second chunks, convert it into a log-Mel spectrogram, and then generate text captions. It can also be used for tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

Whisper is a speech recognition model that was trained on a large and diverse dataset.

Performance

It does not specialize in any one area, and as a result, does not outperform models that are specifically designed for a certain task. However, it is much more robust than those models and makes fewer errors. Other existing approaches frequently use smaller, more closely paired audio-text training datasets, or use broad but unsupervised audio pre-training. Whisper is much more robust than other models when it comes to zero-shot performance across many diverse datasets, making 50% fewer errors.

English vs Non-English

Approximately one-third of Whisper’s audio recordings are in languages other than English. When transcribing these recordings, Whisper alternates between transcribing them in their original language or translating them into English. We have found that this approach is particularly effective at teaching speech-to-text translation, and outperforms the current best-supervised method for translating from CoVoST2 to English with no training data.

Whisper’s high accuracy and ease of use will allow developers to add voice interfaces to a much wider set of applications. Check out the paper, model card, and code to learn more details and to try out Whisper.

Conclusion

The technology of speech recognition is constantly improving. This makes it possible for people to communicate with computers without having to type. This technology is used in many different business applications. Speech recognition programs have come a long way in the last 60 years. They continue to get better, especially because of artificial intelligence. Whisper is just one of the advancements in this field. There’s way more to come!

For more interesting and exciting blogs, stay tuned!

Follow me: M. Haseeb Hassan