Whisper API Pricing and Use Cases

Ivan Campos
Sopmac Labs
6 min readApr 5, 2023

--

Whisper API

Created by the company behind ChatGPT, Whisper is OpenAI’s general-purpose speech recognition model. Primarily, it’s used to convert spoken language into written text.

API Endpoints:

  • /transcriptions (transcribe from source language)
curl --request POST \
--url https://api.openai.com/v1/audio/transcriptions \
--header 'Authorization: Bearer YOUR_OPENAI_API_KEY' \
--header 'Content-Type: multipart/form-data' \
--form file=@/path/to/file/openai.mp3 \
--form model=whisper-1
  • /translations (translates into English)
curl --request POST \
--url https://api.openai.com/v1/audio/translations \
--header 'Authorization: Bearer YOUR_OPENAI_API_KEY' \
--header 'Content-Type: multipart/form-data' \
--form file=@/path/to/file/boston-accent.mp3 \
--form model=whisper-1

Formats: M4A, MP3, MP4, MPEG, MPGA, WAV, and WEBM

Max file size: 25MB (~26 minutes of recording time at 128KB/sec)

Pricing: $0.006 per minute

Pricing

At 6 cents per 10 minutes, the API provides an affordable way to both transcribe or translate your audio files into text.

You can transcribe your life for $8.64 per day.

Practical pricing examples related to media for the Whisper API

Use Cases

The examples above focus on existing entertainment/media files, but this Automatic Speech Recognition (ASR) system has numerous potential use cases across a wide range of applications. Some of the primary use cases include:

  1. Accessibility tools: Whisper can be used to develop applications for individuals with hearing impairments or other disabilities, such as providing real-time captions for live events or transcribing speech into text for reading.
  2. Voice assistants: Whisper can be integrated into voice assistants, like those found in smartphones and smart home devices, to improve their accuracy and responsiveness in recognizing and processing voice commands from users.
  3. Facilitating global communication: By providing accurate speech recognition in multiple languages, Whisper can play a role in breaking down language barriers and fostering better communication between people from different linguistic backgrounds.
  4. Telemedicine: Whisper can facilitate remote medical consultations by transcribing patient descriptions of symptoms and other relevant information for healthcare professionals.
  5. Meeting and conference solutions: Whisper can be integrated into meeting and conference tools to provide live transcriptions, which can be useful for attendees who have difficulty understanding spoken language, are in noisy environments, or want to review the content later.
  6. Call center automation: By integrating Whisper into call center systems, businesses can automate various tasks, such as transcribing customer interactions, providing real-time support for agents, or analyzing customer sentiment and feedback.
  7. Podcast and radio analysis: Whisper can be used to analyze spoken content from podcasts or radio shows for data mining or trend analysis purposes.
  8. Virtual Reality (VR) and Augmented Reality (AR) applications: Whisper can be incorporated into VR and AR experiences, enabling users to interact with virtual environments using voice commands.

API

If you have a specific use case in mind for the OpenAI Whisper API, you can utilize the OpenAI Python library to implement it.

To install the openai library, you can use the following command:

pip install openai

Once you have installed the API, you can create an API key by going to the OpenAI website and clicking on the API Keys tab. You can then generate an API key by clicking on the Create New Secret Key button.

Once you have an API key, you can use it to make requests to the Whisper API. To make a request to the transcriptions endpoint, you can use the following code:

import openai
openai.api_key = "YOUR_OPENAI_API_KEY"
audio_file= open("/path/to/file/openai.mp3", "rb")
transcript = openai.Audio.transcribe("whisper-1", audio_file)
print(transcript)

Here’s a breakdown of what the code does:

  1. audio_file = open("/path/to/file/openai.mp3", "rb"): This line opens an audio file located at "/path/to/file/audio.mp3" in binary read mode ("rb"). The file is opened as a binary file because audio files are binary files, not plain text. The variable audio_file stores the opened file object.
  2. transcript = openai.Audio.transcribe("whisper-1", audio_file): This line calls the transcribe method from the openai.Audio module, passing in two arguments: "whisper-1" and audio_file. The string "whisper-1" specifies the model version to be used for transcription, which is OpenAI's Whisper ASR system in this case. The second argument, audio_file, is the file object we opened earlier, representing the audio file we want to transcribe. The method processes the audio file and returns the transcription as text, which is then stored in the variable transcript.
  3. print(transcript): This line prints the content of the transcript variable to the console, displaying the transcribed text obtained from the audio file.

In summary, this code snippet opens an audio file, transcribes it using OpenAI’s Whisper ASR system, and then prints the transcription. Note that in practice, you would need to replace “/path/to/file/openai.mp3” with the actual path of the audio file you want to transcribe.

Using the Python library, you can similarly call the translations endpoint using the following code:

audio_file= open("/path/to/file/boston-accent.mp3", "rb")
transcript = openai.Audio.translate("whisper-1", audio_file)
print(transcript)

Training Architecture

If you are interested in how the Whisper model works under the covers, below is a technical and non-technical description of the underlying training architecture.

Whisper Training Architecture

Technical Description

In the paper “Whisper: A Robust Speech Recognition Model via Large-Scale Weak Supervision,” the authors from OpenAI introduce a transformer-based sequence-to-sequence architecture designed for automatic speech recognition (ASR) and trained using a vast weakly supervised dataset. This model comprises an encoder and a decoder, with the encoder ingesting mel-spectrogram sequences as input, representing time-frequency domain audio features. Subsequently, the decoder generates a token sequence, such as words or subwords.

The encoder consists of multiple self-attention layers, which enable the model to focus on distinct portions of the input sequence. This is achieved by computing a weighted average of input features, with weights determined by feature similarity. These self-attention layers are succeeded by a feed-forward neural network.

Similarly, the decoder employs a stack of self-attention layers, but it also accepts encoder output as an additional input. This capability allows the decoder to utilize encoder-derived information to generate the output sequence, and it is also followed by a feed-forward network.

Training the model involves utilizing a loss function based on the cross-entropy between predicted tokens and ground truth tokens. The model is trained on a large-scale weakly supervised dataset consisting of audio and text that lack direct transcription labels. Instead, the model predicts label sets correlated with transcriptions, such as speaker identity, audio language, or background noise presence.

This transformer model achieves state-of-the-art performance on various ASR tasks, exhibiting robustness to noise and adaptability to challenging conditions.

Non-technical Description

OpenAI made a computer program that can understand what people are saying in different situations. This program is like a recipe with two main parts: the encoder and the decoder.

The encoder listens to the sounds and turns them into a special kind of picture. Then the decoder takes that picture and figures out what words were said. To do this, both the encoder and decoder use something called “self-attention”, which helps them focus on the important parts of the sounds.

The program learns how to understand speech by practicing with lots of examples, but these examples don’t have the exact words people are saying. Instead, it learns from clues, like who’s talking or what language they’re speaking.

The Whisper model works really well and can understand people talking even when it’s noisy.

Conclusion

Trained on 680,000 hours of multilingual and multitask supervised data collected from the web, Whisper can transcribe speech in multiple languages, as well as translate from those languages into English. It is also able to perform other tasks such as language identification, spoken language identification, and voice activity detection.

OpenAI has also open-sourced the whisper model to serve as a foundation for building useful applications and for further research on robust speech processing.

Resources

--

--

Ivan Campos
Sopmac Labs

Exploring the potential of AI to revolutionize the way we live and work. Join me in discovering the future of tech