How to Run Whisper Speech Recognition Model and Create a Simple App

Open AI is open-sourcing a neural net called Whisper that approaches human-level robustness and accuracy in English speech recognition. I will show how to easily run an example.

Franco Vega
5 min readOct 1, 2022
Image generated using stable diffusion.

This tutorial was meant for me to just get started and see how OpenAI’s Whisper performs.

Seeing the news about OpenAI's open-sourcing Whisper was super exciting. Speech recognition was my first challenge out of academia and has a special place in my heart.

What is speech recognition?

Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or speech-to-text, is a capability that enables a program to process human speech into a written format.

What is Whisper?

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

Whisper architecture¹

The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation¹.

How to run Whisper?

Let’s explore how to run Whisper using a Jupyter Notebook.

Step 1: Install Whisper

The following command will pull and install the latest commit from this repository, along with its Python dependencies.

It also requires the command-line tool ffmpeg to be installed on your system, which is available from most package managers:

Step 2: Prepare Whisper in Python

Once Whisper is installed, you can import it into your code

import whisper

There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed².

Let’s load the base and the large multilingual models for comparison. Whisper’s performance varies widely depending on the language and the size of the model.

Step 3: Run Whisper

To test the power of Whisper we will use an audio file. I will use famous audio from Dark Knight Rises extracted from Moviessoundclips.net. In addition to the mp3 file, there is also the original transcription of the audio.

!wget -O audio.mp3 http://www.moviesoundclips.net/movies1/darkknightrises/darkness.mp3

The original transcription:

“Oh, you think darkness is your ally. But you merely adopted the dark. I was born in it, molded by it. I didn’t see the light until I was already a man, by then it was nothing to me but blinding!”

A caveat to consider is Whisper break audio file in 30-second segments. In this way, the transcribe() method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window.

Oh you think darkness is your ally? Are you merely adopted the dark? I was born in it, more lived by it. I didn't see the light until I was already a man, but then it was nothing to me but... the brand…

And the transcription for the large model

Oh, you think darkness is your ally. You merely adopted the dark. I was born in it. Molded by it. I didn't see the light until I was already a man. By then it was nothing to me but blinding.

Not doing any fancy analysis we can see the transcription by the larger model is more accurate.

Using lower-level access to the model we can also detect the language and transcribe the audio. Let’s give it a try:

This gives the following result:

Detected language: enOh, you think darkness is your ally. You merely adopted the dark. I was born in it. Molded by it. I didn't see the light until I was already a man. By then it was nothing to me but blinding.

Now to create a simple web app we need to install Gradio. It is a great tool to build and demo machine learning projects with a friendly web interface. I encourage you to have a look at https://gradio.app/ to get more familiar with it.

To use Gradio you basically require three things: input, output, and a function.

We will require to install Gradio

pip install gradio -q

Next, we build the function inference(), which takes an audio input and returns a transcription.

And then we just need to define the Gradio interface wit the following:

If you run the previous cell, you should get something like this:

Conclusion

In this tutorial, I cover the basic usage of Whisper by running it in Python using a jupyter notebook. The code used in this article can be found here.

An example of the deployed app in HuggingFace spaces 🤗 can be found here.

References

1- https://openai.com/blog/whisper/

2- https://github.com/openai/whisper

--

--

Franco Vega

Full Stack Data Scientist / ML Engineer | Lifelong learner | Dad |. Letting uncertainty become an adventure.