How to Run Whisper Speech Recognition Model and Create a Simple App

Open AI is open-sourcing a neural net called Whisper that approaches human-level robustness and accuracy in English speech recognition. I will show how to easily run an example.

5 min readOct 1, 2022

This tutorial was meant for me to just get started and see how OpenAI’s Whisper performs.

Seeing the news about OpenAI's open-sourcing Whisper was super exciting. Speech recognition was my first challenge out of academia and has a special place in my heart.

What is speech recognition?

Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or speech-to-text, is a capability that enables a program to process human speech into a written format.

What is Whisper?

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation¹.

How to run Whisper?

Let’s explore how to run Whisper using a Jupyter Notebook.

Step 1: Install Whisper

The following command will pull and install the latest commit from this repository, along with its Python dependencies.

It also requires the command-line tool ffmpeg to be installed on your system, which is available from most package managers:

Step 2: Prepare Whisper in Python

Once Whisper is installed, you can import it into your code

import whisper

There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed².

Let’s load the base and the large multilingual models for comparison. Whisper’s performance varies widely depending on the language and the size of the model.

Step 3: Run Whisper

To test the power of Whisper we will use an audio file. I will use famous audio from Dark Knight Rises extracted from Moviessoundclips.net. In addition to the mp3 file, there is also the original transcription of the audio.

!wget -O audio.mp3 http://www.moviesoundclips.net/movies1/darkknightrises/darkness.mp3

The original transcription:

“Oh, you think darkness is your ally. But you merely adopted the dark. I was born in it, molded by it. I didn’t see the light until I was already a man, by then it was nothing to me but blinding!”

A caveat to consider is Whisper break audio file in 30-second segments. In this way, the transcribe() method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window.

Oh you think darkness is your ally? Are you merely adopted the dark? I was born in it, more lived by it. I didn't see the light until I was already a man, but then it was nothing to me but... the brand…

And the transcription for the large model

Oh, you think darkness is your ally. You merely adopted the dark. I was born in it. Molded by it. I didn't see the light until I was already a man. By then it was nothing to me but blinding.

Not doing any fancy analysis we can see the transcription by the larger model is more accurate.

Using lower-level access to the model we can also detect the language and transcribe the audio. Let’s give it a try:

This gives the following result:

Detected language: enOh, you think darkness is your ally. You merely adopted the dark. I was born in it. Molded by it. I didn't see the light until I was already a man. By then it was nothing to me but blinding.

Now to create a simple web app we need to install Gradio. It is a great tool to build and demo machine learning projects with a friendly web interface. I encourage you to have a look at https://gradio.app/ to get more familiar with it.

To use Gradio you basically require three things: input, output, and a function.

We will require to install Gradio

pip install gradio -q

Next, we build the function inference(), which takes an audio input and returns a transcription.

And then we just need to define the Gradio interface wit the following:

If you run the previous cell, you should get something like this:

Conclusion

In this tutorial, I cover the basic usage of Whisper by running it in Python using a jupyter notebook. The code used in this article can be found here.

An example of the deployed app in HuggingFace spaces 🤗 can be found here.

References

1- https://openai.com/blog/whisper/

2- https://github.com/openai/whisper