How to Run Whisper Speech Recognition Model and Create a Simple App
Open AI is open-sourcing a neural net called Whisper that approaches human-level robustness and accuracy in English speech recognition. I will show how to easily run an example.
This tutorial was meant for me to just get started and see how OpenAI’s Whisper performs.
Seeing the news about OpenAI's open-sourcing Whisper was super exciting. Speech recognition was my first challenge out of academia and has a special place in my heart.
What is speech recognition?
Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or speech-to-text, is a capability that enables a program to process human speech into a written format.
What is Whisper?
Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.
The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation¹.
How to run Whisper?
Let’s explore how to run Whisper using a Jupyter Notebook.
Step 1: Install Whisper
The following command will pull and install the latest commit from this repository, along with its Python dependencies.
It also requires the command-line tool ffmpeg
to be installed on your system, which is available from most package managers:
Step 2: Prepare Whisper in Python
Once Whisper is installed, you can import it into your code
import whisper
There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed².
Let’s load the base and the large multilingual models for comparison. Whisper’s performance varies widely depending on the language and the size of the model.
Step 3: Run Whisper
To test the power of Whisper we will use an audio file. I will use famous audio from Dark Knight Rises extracted from Moviessoundclips.net. In addition to the mp3 file, there is also the original transcription of the audio.
!wget -O audio.mp3 http://www.moviesoundclips.net/movies1/darkknightrises/darkness.mp3
The original transcription:
“Oh, you think darkness is your ally. But you merely adopted the dark. I was born in it, molded by it. I didn’t see the light until I was already a man, by then it was nothing to me but blinding!”
A caveat to consider is Whisper break audio file in 30-second segments. In this way, the transcribe()
method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window.
Oh you think darkness is your ally? Are you merely adopted the dark? I was born in it, more lived by it. I didn't see the light until I was already a man, but then it was nothing to me but... the brand…
And the transcription for the large model
Oh, you think darkness is your ally. You merely adopted the dark. I was born in it. Molded by it. I didn't see the light until I was already a man. By then it was nothing to me but blinding.
Not doing any fancy analysis we can see the transcription by the larger model is more accurate.
Using lower-level access to the model we can also detect the language and transcribe the audio. Let’s give it a try:
This gives the following result:
Detected language: enOh, you think darkness is your ally. You merely adopted the dark. I was born in it. Molded by it. I didn't see the light until I was already a man. By then it was nothing to me but blinding.
Now to create a simple web app we need to install Gradio. It is a great tool to build and demo machine learning projects with a friendly web interface. I encourage you to have a look at https://gradio.app/ to get more familiar with it.
To use Gradio you basically require three things: input, output, and a function.
We will require to install Gradio
pip install gradio -q
Next, we build the function inference()
, which takes an audio input and returns a transcription.
And then we just need to define the Gradio interface wit the following:
If you run the previous cell, you should get something like this:
Conclusion
In this tutorial, I cover the basic usage of Whisper by running it in Python using a jupyter notebook. The code used in this article can be found here.
An example of the deployed app in HuggingFace spaces 🤗 can be found here.
References