How to use Whisper — an OpenAI Speech Recognition Model that turns audio into text with up to 99% accuracy

Egor Menyaylo
GIMZ
Published in
5 min readFeb 16, 2023

--

image generated by Midjourney

Whisper is a speech transcription system from the creators of ChatGPT. Anyone can use it, and it’s completely free, but there’s one problem.

Whisper does not have a web version like ChatGPT. You have to install it manually, read the guides written by the developers for the developers, write some code, and so on.

But in fact, you can try the neural network without even opening GitHub. Here’s how to do it.

Who can use it

Editors, speakers and anyone who needs to do speech-to-text. And there can be absolutely different ways of using it:

  • If you made a working call with Zoom or Google Meet — you get a text that helps you dive into the context, make a follow-up, and don’t miss any details.
  • Conducted an interview — got a ready-made text draft right away.
  • Giving a presentation at a conference — you get an article with minimum effort.
  • Recorded a lecture or presentation of a project — got a ready text version.
  • Or made subtitles and so on.

The system is trained on 680 000 hours of speech data from the network and recognises 99 languages.

How to use Whisper

There are three main ways:

1. Hardcore, but the best (local installation). Go to GitHub, dig into sources, read tutorials, and install Whisper locally on your computer (both Mac and PC will work).

  • Pros: works offline and fast, especially on good hardware.
  • Cons: Not everyone will want to understand it.

2. Simple, but slow (in the cloud). Right in the browser, and it takes literally five minutes to set up. You’ll need Google Colab (kind of like Google Docs, only for writing Python code) and a few simple commands.

  • Pros: you don’t have to bother installing it, you can use it on any device. Good way to get familiar with Whisper.
  • Cons: it’s slow, and you have to re-download the model (up to 3GB) every time you restart it. Also, the service’s free computing resources are limited and all data is deleted after 12 hours — restrictions can be removed by taking out a paid subscription.

3. Convenient, but paid (app). Native Mac app, but the free version of MacWhisper only supports the simplest recognition patterns.

  • Pros: install and use.
  • Cons: you have to pay 10 euros for good results. But even the free version does not support the most advanced recognition model large-v2.

We’ll leave the first method for next time (it has its own nuances) and tell you about the second one, so that everyone can test the neural network and understand whether you need it or not.

Whisper in your browser

The method should work on any device, even smartphones, but this is not certain. Certain — it works on desktop browsers.

1. Create a new Google Colab file. Simply click on the link.

2. Enable GPU (works without it, but it’s better to choose).

Menu → Runtime → Change runtime type

Select GPU as the hardware accelerator and click “Save”:

3. Install Whisper

Paste the code below into an empty box and run it (the Play button next to the left of the box or the Ctrl + Enter). The installation will take a couple of minutes.

!pip install git+https://github.com/openai/whisper.git
!sudo apt update && sudo apt install ffmpeg

4. Upload an audio file.

Definitely supports .mp4 video and .mp3, .wav, .m4a audio. We tested.

In the left column, press folder and download the file using any method. You can simply drag and drop it into the browser in any formats (.mp3 and .m4a are definitely supported).

5. Run Whisper.

Write the command below with your file name (we took this one).

!whisper "Polyglot speaking in 12 languages.mp3"

Then press Play. Whisper will start transcribing, and after that upload text files to the same place where you uploaded the audio file. You can download them (there will be no timestamps like on the screenshot) in .json, .srt, .tsv, .txt and .vtt formats.

The results are impressive. Whisper recognised English and Spanish speech and transcribed them into text. But further languages were immediately translated into English, because it detected English as the main language of the file.

Source language can be specified by adding the — language parameter. For example:

!whisper "Polyglot speaking in 12 languages.mp3" –-language Italian

We have also tried a normal recording, which was made on a dictaphone. There is no processing, raw source. At the same time, if the recording is absolutely bad, you can try to use Enhance Speech from Adobe to improve the sound quality. It works very well, too.

Models and quality of transcription

Whisper has several recognition models, the bigger the model, the steeper the result and the longer the run time.

The most advanced large-v2 is trained on the same dataset as large — but 2.5 times more, which improves the final result.

By default, Google Colab will use a small model. During the test, such a model transcribed a five-minute recording in 40 seconds, and an hour-long one in 8 minutes. It’s fast, but the result is appropriate. Not too bad, the meaning of the text will be clear, but the nuances can slip away.

Let’s run the same file, but on a large-v2 model:

!whisper "Polyglot speaking in 12 languages.mp3" --model large-v2

The results are even better. This time, the model transcribed not only English and Spanish, but all other languages as well.

Whisper has some other parameters, you can see them by writing the command:

!whisper -h

At least you can play around with it in your browser. It’s better to install it on your computer and compare the speed. When we do it, we’ll tell you everything.

--

--