OpenAI provides an API for transcribing audio files called Whisper. Whisper is an automatic speech recognition system trained on over 600.000 hours of multilanguage supervised data collected from the web. It’s capable of generating transcriptions in multiple languages as well as generating English-translated transcriptions from various languages.
For example, if you were a call center that recorded all calls, you could use Whisper to transcribe all the conversations and allow for easier searching and categorization. You could also use it to transcribe YouTube videos so that you don’t have to do it manually.
Overall, Whisper is a recent addition to the OpenAI family and it’s a very powerful tool.
Creating Transcriptions
The core feature of Whisper is transcribing audio. It can recognize almost all human languages, so it’s capable of generating transcripts for them as well.
Transcribing English Audio
Let’s create a transcription of an audio file. I’m going to use one of my recordings from the course where I talk about OpenAI community libraries:
The transcriptions endpoint is located at https://api.openai.com/v1/audio/transcriptions
and it accepts a POST request with form-data payload containing two mandatory parameters:
file
— represents the audio recording we want to transcribe (supported audio formats: mp3, mp4, mpeg, mpga, m4a, wav, and webm);
model
— which represents the audio model (at the time of this writing, only whisper-1 is available);
curl --location 'https://api.openai.com/v1/audio/transcriptions' \
--header 'Authorization: Bearer API_KEY' \
--form 'file=@"/home/bezbos/Downloads/english_recording_1.wav"' \
--form 'model="whisper-1"'
Here is the response:
{
"text": "If you wish to follow this course using a different programming language, you can visit this page to find a list of community libraries that currently exist for OpenAI. Note that OpenAI does not verify the correctness or security of these libraries, so make sure to take a good look at those projects before you decide to use them. In this course, we will be using the official OpenAI libraries available for JavaScript and Python, so I suggest you to do the same. Calling OpenAI API endpoints is not difficult and you don't necessarily need a library, but they do help you to get started faster and they also have some nice features like logic and more graceful error handling."
}
The transcript is, dare I say, perfect. Not only did it properly transcribe the text, but it’s making proper use of dashes, dots, capitalization, and so on. It genuinely looks as if a human wrote this transcription.
Now, this doesn’t mean that Whisper API is without faults. In many cases, it might transcribe certain words and pronunciations wrong. For example, I will send a recording with some difficult-to-pronounce words like “DALL-E”, “GPT-3” and my name “Boško Bezik”:
Here is the request:
curl --location 'https://api.openai.com/v1/audio/transcriptions' \
--header 'Authorization: Bearer API_KEY' \
--form 'file=@"/home/bezbos/Downloads/english_recording_2.wav"' \
--form 'model="whisper-1"'
Here is the response:
{
"text": "Hello, this is a test recording about GPT-3, Dolly and other OpenAI models. I'm trying to use some words difficult to pronounce, but I do hope the Whisper API will pick them up properly. My name is Boško Bezic and this is the recording that I've made."
}
As we can see, it got a few words wrong, like the DALL-E model and my name. In such cases, we can help Whisper by providing the prompt
parameter that explains what the transcription is about and helps with pronunciation. I am going to tell OpenAI that this recording is about DALL-E, GPT-3 and that it’s recorded by Boško Bezik (me). From this prompt, Whisper will be able to infer the correct pronunciation for DALL-E and my name:
curl --location 'https://api.openai.com/v1/audio/transcriptions' \
--header 'Authorization: Bearer API_KEY' \
--form 'file=@"/home/bezbos/Downloads/english_recording_2.wav"' \
--form 'model="whisper-1"' \
--form 'prompt="This is a recording about DALL-E and GPT-3, by Boško Bezik"'
Here is the response:
{
"text": "Hello, this is a test recording about GPT-3, DALL-E and other OpenAI models. I'm trying to use some words difficult to pronounce, but I do hope the Whisper API will pick them up properly. My name is Boško Bezik and this is the recording that I've made."
}
It got everything right this time!
Transcribing Non-English Audio
Whisper is also capable of transcribing other languages. For example, I will provide an audio file of me speaking in the Bosnian language:
Here is the request:
curl --location 'https://api.openai.com/v1/audio/transcriptions' \
--header 'Authorization: Bearer API_KEY' \
--form 'file=@"/home/bezbos/Downloads/bosnian_recording_1.wav"' \
--form 'model="whisper-1"'
Here is the response:
{
"text": "Pozdrav, ovo je snimak na Bosanskome gdje ja, Boško Bezik, pričam o OpenAI modelima Dolly, GPT-3 i najnovi Whisper model za prepoznavanje snimaka i pretvaranje tih snimaka u tekst"
}
It’s almost perfect, but it got the DALL-E pronunciation wrong, however, that can easily be fixed with a well defined prompt
parameter. Although Whisper is very good at detecting the language, it’s best if you tell it what language it should expect. We can do that by setting the language
parameter with an ISO language code:
curl --location 'https://api.openai.com/v1/audio/transcriptions' \
--header 'Authorization: Bearer API_KEY' \
--form 'file=@"/home/bezbos/Downloads/bosnian_recording_1.wav"' \
--form 'model="whisper-1"' \
--form 'language="bs"'
Response Formats
The Audio API is also capable of returning responses in various formats suitable for video subtitles, such as SubRip File Format (srt) or Web Video Text Tracks (vtt). You can also request a plain text format or verbose JSON which will contain various metadata including tokens and segments. To change the format, simply set the response_format
parameter to one of these options: json
, text
, srt
, verbose_json
, or vtt
. I will send a request for each of these formats (except for JSON, we’ve seen that one) and show you the response:
curl --location 'https://api.openai.com/v1/audio/transcriptions' \
--header 'Authorization: Bearer API_KEY' \
--form 'file=@"/home/bezbos/Downloads/english_recording_2.wav"' \
--form 'model="whisper-1"' \
--form 'response_format="vtt"' \
--form 'prompt="This is a recording about DALL-E and GPT-3, by Boško Bezik"'
Web Video Text Tracks (response_format=vtt
):
WEBVTT
00:00:00.000 --> 00:00:08.640
Hello, this is a test recording about GPT-3, DALL-E and other OpenAI models.
00:00:08.640 --> 00:00:14.040
I'm trying to use some words difficult to pronounce, but I do hope the Whisper API will
00:00:14.040 --> 00:00:15.840
pick them up properly.
00:00:15.840 --> 00:00:35.840
My name is Boško Bezik and this is the recording that I've made.
SubRip File Format (response_format=srt
):
1
00:00:00,000 --> 00:00:08,640
Hello, this is a test recording about GPT-3, DALL-E and other OpenAI models.
2
00:00:08,640 --> 00:00:14,040
I'm trying to use some words difficult to pronounce, but I do hope the Whisper API will
3
00:00:14,040 --> 00:00:15,840
pick them up properly.
4
00:00:15,840 --> 00:00:35,840
My name is Boško Bezik and this is the recording that I've made.
Plain text (response_format=text
):
Hello, this is a test recording about GPT-3, DALL-E and other OpenAI models. I'm trying to use some words difficult to pronounce, but I do hope the Whisper API will pick them up properly. My name is Boško Bezik and this is the recording that I've made.
Verbose JSON (response_format=verbose_json
):
{
"task": "transcribe",
"language": "english",
"duration": 19.79,
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 8.64,
"text": " Hello, this is a test recording about GPT-3, DALL-E and other OpenAI models.",
"tokens": [
2425,
11,
341,
307,
257,
1500,
6613,
466,
26039,
51,
12,
18,
11,
413,
15921,
12,
36,
293,
661,
7238,
48698,
5245,
13
],
"temperature": 0.0,
"avg_logprob": -0.2631749353910747,
"compression_ratio": 1.2857142857142858,
"no_speech_prob": 0.17765560746192932,
"transient": false
},
{
"id": 1,
"seek": 0,
"start": 8.64,
"end": 14.040000000000001,
"text": " I'm trying to use some words difficult to pronounce, but I do hope the Whisper API will",
"tokens": [
286,
478,
1382,
281,
764,
512,
2283,
2252,
281,
19567,
11,
457,
286,
360,
1454,
264,
41132,
610,
9362,
486
],
"temperature": 0.0,
"avg_logprob": -0.2631749353910747,
"compression_ratio": 1.2857142857142858,
"no_speech_prob": 0.17765560746192932,
"transient": false
},
{
"id": 2,
"seek": 0,
"start": 14.040000000000001,
"end": 15.84,
"text": " pick them up properly.",
"tokens": [
1888,
552,
493,
6108,
13
],
"temperature": 0.0,
"avg_logprob": -0.2631749353910747,
"compression_ratio": 1.2857142857142858,
"no_speech_prob": 0.17765560746192932,
"transient": false
},
{
"id": 3,
"seek": 1584,
"start": 15.84,
"end": 35.84,
"text": " My name is Boško Bezik and this is the recording that I've made.",
"tokens": [
1222,
1315,
307,
3286,
7891,
4093,
879,
89,
1035,
293,
341,
307,
264,
6613,
300,
286,
600,
1027,
13
],
"temperature": 0.0,
"avg_logprob": -0.3431914578313413,
"compression_ratio": 0.927536231884058,
"no_speech_prob": 0.011091013438999653,
"transient": false
}
],
"text": "Hello, this is a test recording about GPT-3, DALL-E and other OpenAI models. I'm trying to use some words difficult to pronounce, but I do hope the Whisper API will pick them up properly. My name is Boško Bezik and this is the recording that I've made."
}
Temperature
We can also specify the temperature
parameter which affects how random or deterministic the transcription will be. Higher temperature values will allow the audio model to pick tokens with lower log probability scores, while lower temperature values will ensure the model picks tokens with the highest log probability scores.
Temperature values below 0.2
will make the output focused and deterministic, while temperature values above 0.8
will make the output diversified and random. The default temperature value is 0
, while the maximum is 2
, although I don’t recommend going above 1
because it will result in gibberish output.
Let’s see an example of a transcript request with the temperature
set to 1
:
curl --location 'https://api.openai.com/v1/audio/transcriptions' \
--header 'Authorization: Bearer API_KEY' \
--form 'file=@"/home/bezbos/Downloads/english_recording_2.wav"' \
--form 'model="whisper-1"' \
--form 'prompt="This is a recording about DALL-E and GPT-3, by Boško Bezik"' \
--form 'temperature="1"'
Here is the response:
{
"text": "Hello, this is a test recording about GPT-3, DALL-E and other OpenAI models I'm trying to use some words difficult to pronounce but I do hope WHISPER API will pick them up properly my name is Boško Besick and this is the recording that I've made."
}
So as we can see, it pronounced a few things differently, like the Whisper model and my name. Also, this is what it would look like with the temperature
set to 2
:
{
"text": "Hello cadre tört LUIE π tragic這些 program ации"
}
It’s total nonsense. That’s why it makes no sense to set the temperature
above 1
, unless you explicitly want to generate gibberish.
Creating Translated Transcriptions
We can also use the Audio API to generate English transcriptions from non-English audio recordings. For example, I could generate an English transcript of a Bosnian audio recording. This can be very useful when you want to translate non-English speeches or recordings.
The endpoint for translated transcriptions is located at https://api.openai.com/v1/audio/translations
and has nearly identical request parameters as normal transcriptions, minus the language
parameter. I will send one of my audio recordings in Bosnian:
curl --location 'https://api.openai.com/v1/audio/translations' \
--header 'Authorization: Bearer API_KEY' \
--form 'file=@"/home/bezbos/Downloads/bosnian_recording_1.wav"' \
--form 'model="whisper-1"' \
--form 'prompt="This is a recording in Bosnian about DALL-E and GPT-3, by Boško Bezik"' \
--form 'temperature="0"' \
--form 'response_format="json"'
Here is the response:
{
"text": "Hello, this is a recording in Bosnian where I, Boško Bezik, am talking about OpenAI models DALL-E, GPT-3 and the newest Whisper model for recognizing and converting recordings into text."
}
Just as with the transcriptions endpoint, we can specify: prompt
, temperature
, and response_format
parameters. The language
parameter is not available, instead, we have to rely on Whisper to detect the language properly or we could specify it in the prompt
.
Limitations
At the time of this recording, Whisper is limited to accepting files up to 25MB in size. This can make it inconvenient when transcribing longer recordings. To get around this issue, we can compress the audio files at lower bitrates or split the recording into multiple files.
You can easily re-encode audio files at cloudconvert.com which is a superb resource for all your conversion and compression needs. You can significantly reduce the file size by re-encoding it as an MP3 at 128 kbps bitrate, which is still good quality at a small size.
However, sometimes even after compression, your file might still be too big. In such cases, you can split the file into multiple chunks. There are different tools for splitting, depending on your platform, but if all else fails, you can always use an online tool.
Summary
Let’s summarize this article in several key points:
- Provides a simple and easy-to-use REST API for uploading audio files and returning transcriptions;
- Supports popular audio formats (
mp3
,mp4
,mpeg
,mpga
,m4a
,wav
, andwebm
); - Supports popular transcription formats (
json
,text
,srt
,verbose_json
, andvtt
); - Provides parameters (
prompt
,language
,temperature
) for guiding and affecting the transcript output; - Capable of transcribing non-English audio into English transcriptions (
v1/audio/translations
); - File size limited to 25MB. Can be circumvented by re-encoding audio at lower bitrates or splitting;
Integrate OpenAI API in Your Projects
🚀 Learn how to integrate state-of-the-art AI language models used by ChatGPT into your projects: https://bezbos.com/p/complete-openai-integration
📚🧐 You will learn all about the API endpoints that are available, including mechanisms for completion, edits, moderations, images, image edits, image variations, embeddings, fine-tuning, and other utility APIs.
💻🤝 With hands-on exercises, detailed explanations, and real-world examples, you will have a clear understanding of how to integrate OpenAI APIs into almost any project.
🚀👨💻 By the end of this course, you’ll be able to integrate OpenAI API into any project!