OpenAI Audio (Whisper) API Guide

bezbos.
9 min readApr 2, 2023

--

OpenAI provides an API for transcribing audio files called Whisper. Whisper is an automatic speech recognition system trained on over 600.000 hours of multilanguage supervised data collected from the web. It’s capable of generating transcriptions in multiple languages as well as generating English-translated transcriptions from various languages.

OpenAI Audio (Whisper) API Guide

For example, if you were a call center that recorded all calls, you could use Whisper to transcribe all the conversations and allow for easier searching and categorization. You could also use it to transcribe YouTube videos so that you don’t have to do it manually.

Whisper can transcribe audio recordings

Overall, Whisper is a recent addition to the OpenAI family and it’s a very powerful tool.

Creating Transcriptions

The core feature of Whisper is transcribing audio. It can recognize almost all human languages, so it’s capable of generating transcripts for them as well.

Transcribing English Audio

Let’s create a transcription of an audio file. I’m going to use one of my recordings from the course where I talk about OpenAI community libraries:

The transcriptions endpoint is located at https://api.openai.com/v1/audio/transcriptions and it accepts a POST request with form-data payload containing two mandatory parameters:

file represents the audio recording we want to transcribe (supported audio formats: mp3, mp4, mpeg, mpga, m4a, wav, and webm);

modelwhich represents the audio model (at the time of this writing, only whisper-1 is available);

curl --location 'https://api.openai.com/v1/audio/transcriptions' \
--header 'Authorization: Bearer API_KEY' \
--form 'file=@"/home/bezbos/Downloads/english_recording_1.wav"' \
--form 'model="whisper-1"'

Here is the response:

{
"text": "If you wish to follow this course using a different programming language, you can visit this page to find a list of community libraries that currently exist for OpenAI. Note that OpenAI does not verify the correctness or security of these libraries, so make sure to take a good look at those projects before you decide to use them. In this course, we will be using the official OpenAI libraries available for JavaScript and Python, so I suggest you to do the same. Calling OpenAI API endpoints is not difficult and you don't necessarily need a library, but they do help you to get started faster and they also have some nice features like logic and more graceful error handling."
}

The transcript is, dare I say, perfect. Not only did it properly transcribe the text, but it’s making proper use of dashes, dots, capitalization, and so on. It genuinely looks as if a human wrote this transcription.

Now, this doesn’t mean that Whisper API is without faults. In many cases, it might transcribe certain words and pronunciations wrong. For example, I will send a recording with some difficult-to-pronounce words like “DALL-E”, “GPT-3” and my name “Boško Bezik”:

Here is the request:

curl --location 'https://api.openai.com/v1/audio/transcriptions' \
--header 'Authorization: Bearer API_KEY' \
--form 'file=@"/home/bezbos/Downloads/english_recording_2.wav"' \
--form 'model="whisper-1"'

Here is the response:

{
"text": "Hello, this is a test recording about GPT-3, Dolly and other OpenAI models. I'm trying to use some words difficult to pronounce, but I do hope the Whisper API will pick them up properly. My name is Boško Bezic and this is the recording that I've made."
}

As we can see, it got a few words wrong, like the DALL-E model and my name. In such cases, we can help Whisper by providing the prompt parameter that explains what the transcription is about and helps with pronunciation. I am going to tell OpenAI that this recording is about DALL-E, GPT-3 and that it’s recorded by Boško Bezik (me). From this prompt, Whisper will be able to infer the correct pronunciation for DALL-E and my name:

curl --location 'https://api.openai.com/v1/audio/transcriptions' \
--header 'Authorization: Bearer API_KEY' \
--form 'file=@"/home/bezbos/Downloads/english_recording_2.wav"' \
--form 'model="whisper-1"' \
--form 'prompt="This is a recording about DALL-E and GPT-3, by Boško Bezik"'

Here is the response:

{
"text": "Hello, this is a test recording about GPT-3, DALL-E and other OpenAI models. I'm trying to use some words difficult to pronounce, but I do hope the Whisper API will pick them up properly. My name is Boško Bezik and this is the recording that I've made."
}

It got everything right this time!

Transcribing Non-English Audio

Whisper is also capable of transcribing other languages. For example, I will provide an audio file of me speaking in the Bosnian language:

Here is the request:

curl --location 'https://api.openai.com/v1/audio/transcriptions' \
--header 'Authorization: Bearer API_KEY' \
--form 'file=@"/home/bezbos/Downloads/bosnian_recording_1.wav"' \
--form 'model="whisper-1"'

Here is the response:

{
"text": "Pozdrav, ovo je snimak na Bosanskome gdje ja, Boško Bezik, pričam o OpenAI modelima Dolly, GPT-3 i najnovi Whisper model za prepoznavanje snimaka i pretvaranje tih snimaka u tekst"
}

It’s almost perfect, but it got the DALL-E pronunciation wrong, however, that can easily be fixed with a well defined prompt parameter. Although Whisper is very good at detecting the language, it’s best if you tell it what language it should expect. We can do that by setting the language parameter with an ISO language code:

curl --location 'https://api.openai.com/v1/audio/transcriptions' \
--header 'Authorization: Bearer API_KEY' \
--form 'file=@"/home/bezbos/Downloads/bosnian_recording_1.wav"' \
--form 'model="whisper-1"' \
--form 'language="bs"'

Response Formats

The Audio API is also capable of returning responses in various formats suitable for video subtitles, such as SubRip File Format (srt) or Web Video Text Tracks (vtt). You can also request a plain text format or verbose JSON which will contain various metadata including tokens and segments. To change the format, simply set the response_format parameter to one of these options: json, text, srt, verbose_json, or vtt. I will send a request for each of these formats (except for JSON, we’ve seen that one) and show you the response:

curl --location 'https://api.openai.com/v1/audio/transcriptions' \
--header 'Authorization: Bearer API_KEY' \
--form 'file=@"/home/bezbos/Downloads/english_recording_2.wav"' \
--form 'model="whisper-1"' \
--form 'response_format="vtt"' \
--form 'prompt="This is a recording about DALL-E and GPT-3, by Boško Bezik"'

Web Video Text Tracks (response_format=vtt):

WEBVTT

00:00:00.000 --> 00:00:08.640
Hello, this is a test recording about GPT-3, DALL-E and other OpenAI models.

00:00:08.640 --> 00:00:14.040
I'm trying to use some words difficult to pronounce, but I do hope the Whisper API will

00:00:14.040 --> 00:00:15.840
pick them up properly.

00:00:15.840 --> 00:00:35.840
My name is Boško Bezik and this is the recording that I've made.

SubRip File Format (response_format=srt):

1
00:00:00,000 --> 00:00:08,640
Hello, this is a test recording about GPT-3, DALL-E and other OpenAI models.

2
00:00:08,640 --> 00:00:14,040
I'm trying to use some words difficult to pronounce, but I do hope the Whisper API will

3
00:00:14,040 --> 00:00:15,840
pick them up properly.

4
00:00:15,840 --> 00:00:35,840
My name is Boško Bezik and this is the recording that I've made.

Plain text (response_format=text):

Hello, this is a test recording about GPT-3, DALL-E and other OpenAI models. I'm trying to use some words difficult to pronounce, but I do hope the Whisper API will pick them up properly. My name is Boško Bezik and this is the recording that I've made.

Verbose JSON (response_format=verbose_json):

{
"task": "transcribe",
"language": "english",
"duration": 19.79,
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 8.64,
"text": " Hello, this is a test recording about GPT-3, DALL-E and other OpenAI models.",
"tokens": [
2425,
11,
341,
307,
257,
1500,
6613,
466,
26039,
51,
12,
18,
11,
413,
15921,
12,
36,
293,
661,
7238,
48698,
5245,
13
],
"temperature": 0.0,
"avg_logprob": -0.2631749353910747,
"compression_ratio": 1.2857142857142858,
"no_speech_prob": 0.17765560746192932,
"transient": false
},
{
"id": 1,
"seek": 0,
"start": 8.64,
"end": 14.040000000000001,
"text": " I'm trying to use some words difficult to pronounce, but I do hope the Whisper API will",
"tokens": [
286,
478,
1382,
281,
764,
512,
2283,
2252,
281,
19567,
11,
457,
286,
360,
1454,
264,
41132,
610,
9362,
486
],
"temperature": 0.0,
"avg_logprob": -0.2631749353910747,
"compression_ratio": 1.2857142857142858,
"no_speech_prob": 0.17765560746192932,
"transient": false
},
{
"id": 2,
"seek": 0,
"start": 14.040000000000001,
"end": 15.84,
"text": " pick them up properly.",
"tokens": [
1888,
552,
493,
6108,
13
],
"temperature": 0.0,
"avg_logprob": -0.2631749353910747,
"compression_ratio": 1.2857142857142858,
"no_speech_prob": 0.17765560746192932,
"transient": false
},
{
"id": 3,
"seek": 1584,
"start": 15.84,
"end": 35.84,
"text": " My name is Boško Bezik and this is the recording that I've made.",
"tokens": [
1222,
1315,
307,
3286,
7891,
4093,
879,
89,
1035,
293,
341,
307,
264,
6613,
300,
286,
600,
1027,
13
],
"temperature": 0.0,
"avg_logprob": -0.3431914578313413,
"compression_ratio": 0.927536231884058,
"no_speech_prob": 0.011091013438999653,
"transient": false
}
],
"text": "Hello, this is a test recording about GPT-3, DALL-E and other OpenAI models. I'm trying to use some words difficult to pronounce, but I do hope the Whisper API will pick them up properly. My name is Boško Bezik and this is the recording that I've made."
}

Temperature

We can also specify the temperature parameter which affects how random or deterministic the transcription will be. Higher temperature values will allow the audio model to pick tokens with lower log probability scores, while lower temperature values will ensure the model picks tokens with the highest log probability scores.

Temperature values below 0.2 will make the output focused and deterministic, while temperature values above 0.8will make the output diversified and random. The default temperature value is 0, while the maximum is 2, although I don’t recommend going above 1because it will result in gibberish output.

Let’s see an example of a transcript request with the temperature set to 1:

curl --location 'https://api.openai.com/v1/audio/transcriptions' \
--header 'Authorization: Bearer API_KEY' \
--form 'file=@"/home/bezbos/Downloads/english_recording_2.wav"' \
--form 'model="whisper-1"' \
--form 'prompt="This is a recording about DALL-E and GPT-3, by Boško Bezik"' \
--form 'temperature="1"'

Here is the response:

{
"text": "Hello, this is a test recording about GPT-3, DALL-E and other OpenAI models I'm trying to use some words difficult to pronounce but I do hope WHISPER API will pick them up properly my name is Boško Besick and this is the recording that I've made."
}

So as we can see, it pronounced a few things differently, like the Whisper model and my name. Also, this is what it would look like with the temperature set to 2:

{
"text": "Hello cadre tört LUIE π tragic這些 program ации"
}

It’s total nonsense. That’s why it makes no sense to set the temperature above 1, unless you explicitly want to generate gibberish.

Creating Translated Transcriptions

We can also use the Audio API to generate English transcriptions from non-English audio recordings. For example, I could generate an English transcript of a Bosnian audio recording. This can be very useful when you want to translate non-English speeches or recordings.

The endpoint for translated transcriptions is located at https://api.openai.com/v1/audio/translations and has nearly identical request parameters as normal transcriptions, minus the language parameter. I will send one of my audio recordings in Bosnian:

curl --location 'https://api.openai.com/v1/audio/translations' \
--header 'Authorization: Bearer API_KEY' \
--form 'file=@"/home/bezbos/Downloads/bosnian_recording_1.wav"' \
--form 'model="whisper-1"' \
--form 'prompt="This is a recording in Bosnian about DALL-E and GPT-3, by Boško Bezik"' \
--form 'temperature="0"' \
--form 'response_format="json"'

Here is the response:

{
"text": "Hello, this is a recording in Bosnian where I, Boško Bezik, am talking about OpenAI models DALL-E, GPT-3 and the newest Whisper model for recognizing and converting recordings into text."
}

Just as with the transcriptions endpoint, we can specify: prompt, temperature, and response_formatparameters. The language parameter is not available, instead, we have to rely on Whisper to detect the language properly or we could specify it in the prompt.

Limitations

At the time of this recording, Whisper is limited to accepting files up to 25MB in size. This can make it inconvenient when transcribing longer recordings. To get around this issue, we can compress the audio files at lower bitrates or split the recording into multiple files.

You can easily re-encode audio files at cloudconvert.com which is a superb resource for all your conversion and compression needs. You can significantly reduce the file size by re-encoding it as an MP3 at 128 kbps bitrate, which is still good quality at a small size.

However, sometimes even after compression, your file might still be too big. In such cases, you can split the file into multiple chunks. There are different tools for splitting, depending on your platform, but if all else fails, you can always use an online tool.

Summary

Let’s summarize this article in several key points:

  • Provides a simple and easy-to-use REST API for uploading audio files and returning transcriptions;
  • Supports popular audio formats (mp3, mp4, mpeg, mpga, m4a, wav, and webm);
  • Supports popular transcription formats (json, text, srt, verbose_json, and vtt);
  • Provides parameters (prompt, language, temperature) for guiding and affecting the transcript output;
  • Capable of transcribing non-English audio into English transcriptions (v1/audio/translations);
  • File size limited to 25MB. Can be circumvented by re-encoding audio at lower bitrates or splitting;

Integrate OpenAI API in Your Projects

🚀 Learn how to integrate state-of-the-art AI language models used by ChatGPT into your projects: https://bezbos.com/p/complete-openai-integration

Complete OpenAI Integration Course — Bring the Power of OpenAI Models to Your Applications!

📚🧐 You will learn all about the API endpoints that are available, including mechanisms for completion, edits, moderations, images, image edits, image variations, embeddings, fine-tuning, and other utility APIs.

💻🤝 With hands-on exercises, detailed explanations, and real-world examples, you will have a clear understanding of how to integrate OpenAI APIs into almost any project.

🚀👨‍💻 By the end of this course, you’ll be able to integrate OpenAI API into any project!

--

--

bezbos.

Learn in-demand skills at bezbos.com! Practical courses to pass exams, get a job, or level up your career, by a passionate software dev, researcher, and writer.