Whisper ASR in MLX: How much faster is speech recognition, really?
50.85% Faster Transcriptions using Whisper in MLX on M1 Mac (as compared to without MLX)
Whisper is an automatic speech recognition (ASR) system developed by OpenAI, designed to convert spoken language into written text. In the realm of speech-to-text technology, the choice of underlying infrastructure significantly impacts performance outcomes.
This project undertakes a quick yet comprehensive comparison between three distinct Whisper configurations:
- “Local Whisper”: Whisper running on an M1 Mac
- “MLX Local Whisper”: Whisper in MLX on an M1 Mac
- “OpenAI API”: Whisper via the OpenAI API
The project aims to test these different configurations with a few audio files, gather data on how long it takes for the transcription to be created, and use this data to understand when to use each Whisper configuration.
The Experiment
Audio Files Used
A variety of audio files from AmericanRhetoric.com serve as the basis for the tests, allowing for diverse evaluations based on different lengths and content types. Due to the OpenAI API’s 25 MB file upload limit, smaller audio files are used for API tests. Conversely, larger audio files are used to assess the performance of local-only models, namely “Local Whisper” and “MLX Local Whisper.”
I used these audios:
- MLK: #1 “I have a dream” (16:27)
- Pearl Harbor: #4 “Pearl Harbor Address to the Nation” (7:42)
- Impeach: #13 “On the Articles of Impeachment” (13:03)
- Berlin: #22 “Ich bin ein Berliner” (8:29)
Files Greater than 25MB ∴ not tested with OpenAI API, but with MLX Local Whisper & Local Whisper
- Reelection: #75 “On Not Seeking Re-Election” (40:23)
- Crisis: #88 “A Crisis of Confidence” (32:33)
Code
I’ve used the following code for each evaluation, employing a Jupyter notebook within VS Code for execution. The code is organized into sections based on the method of accessing Whisper. The time it takes to generate the transcription is calculated by using the time
library in Python to calculate the difference in the time before the transcript is generated, and after.
MLX Whisper
# libraries
import whisper
import time
# define audio
audio = "/path/to/audio"
# define transcription function with timer
def mlx_whisper(audio):
start = time.time()
text = whisper.transcribe(audio)["text"]
end = time.time()
print(text[:100])
print(f"Time taken for {audio}: {end-start}")
# run transcription function over audio
mlx_whisper(audio)
Local Whisper (non-MLX)
# libraries
import whisper
import time
# define audio
audio = "/path/to/audio"
# define transcription function with timer
def local_whisper(audio):
start = time.time()
model = whisper.load_model("base")
result = model.transcribe(audio)
text = result["text"]
end = time.time()
# print the first 100 characters of the text
print(text[:100])
print(f"Time taken for {audio}: {end-start}")
# run transcription function over audio
local_whisper(audio)
OpenAI Transcriptions Endpoint
Full documentation for Speech to text from OpenAI API Docs.
# libraries
from openai import OpenAI
# define audio
audio = "/path/to/audio"
# set up client
client = OpenAI(api_key="...")
start = time.time()
audio_file= open(audio, "rb")
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file
)
end = time.time()
# print time taken
print(f"Time taken for {audio} : {end-start}")
After running each of the audio files through each of these functions, I collected all my data in a dataframe. This will be shown and discussed in the Results section below.
Results
On average, over all files tested:
- MLX Local Whisper is ~50.85% faster than Local Whisper.
- Local Whisper is ~25.66% faster than OpenAI API.
Figure 0 shows the data that the following charts and graphs are based on.
When grouped by experiment i.e. each model (OpenAI API v. Local Whisper v. MLX Local Whisper) we can see the disadvantage of OpenAI API’s 25MB file size cutoff, since we cannot evaluate the transcription performance of larger files. However, when we only consider the files that were examined over all three experiments, it is clear that OpenAI API and Local Whisper aren’t that different. On average, across all audio files excluding “Crisis” and “Reelection,” Local Whisper is approximately 25.66% faster than OpenAI API.
- Local Whisper transcribes “Pearl Harbor” approximately 72.37% faster than OpenAI API.
- Local Whisper transcribes “MLK” approximately 12.86% slower than OpenAI API. Berlin (36.69%)
- Local Whisper transcribes “Berlin” approximately 36.69% faster than OpenAI API. Impeach (6.42%)
- Local Whisper transcribes “Impeach” approximately 6.42% faster than OpenAI API. Mean Percentage Difference (25.66%)
Transcription Time v. Speech Length
MLX Local Whisper emerges as the undisputed champion in transcription speed, particularly evident in the analysis of longer audio files as seen in Fig 2.1. The upper right-hand corner of the graph highlights this superiority, specifically in instances like “Crisis” and “Reelection” (32 and 40 minutes, respectively), where the transcription time approximately halves upon transitioning from Local Whisper to MLX Local Whisper. This performance leap underscores the efficiency of MLX Local Whisper, especially when dealing with more extensive audio content.
Moreover, it’s worth noting a limitation with the OpenAI API when dealing with files of comparable size. The API faces constraints as it currently operates with a maximum limit of 25MB. This becomes apparent when examining audio files approaching or exceeding ~1000 seconds, where the file size surpasses the API’s capacity. In such scenarios, MLX Local Whisper proves to be the go-to solution for efficient and timely transcription, surpassing the limitations of the OpenAI API in handling larger files.
Transcription Time v. Speech File Size
Fig 2.2 diverges from the previous representation by focusing on the audio file size (measured in MB) in relation to the time required for transcription, as opposed to the duration of the audio file in seconds. While my knowledge about audio files is limited, I speculate that certain patterns observed, particularly in the lower left-hand corner, may be attributed to variations in file quality.
Notably intriguing is the phenomenon where certain audio files exhibit nearly identical transcription times for both Local Whisper and OpenAI, while others show a similar pattern for Local Whisper and MLX Local Whisper. Strikingly, for the same audio, the OpenAI API demonstrates significantly slower transcription times. One plausible hypothesis to account for this inconsistency is the variability in API response times.
It remains somewhat enigmatic why disparities in transcription times exist, and exploring the underlying factors, such as file quality and API response consistency, could provide valuable insights into the observed patterns.
Results by Audio File
How much faster is MLX Local Whisper over non-MLX Local Whisper? About 50.85% faster. The table below shows the exact percentage difference in the time it takes to transcribe an audio file when switching from Local Whisper to MLX Local Whisper:
+--------------+-----------------------+
| Audio | Percentage Difference |
+--------------+-----------------------+
| Pearl Harbor | -12.26% |
| MLK | -44.65% |
| Berlin | -67.86% |
| Impeach | -66.15% |
| Crisis | -57.39% |
| Reelection | -56.80% |
+--------------+-----------------------+
Broken down by audio file, we see that:
- Pearl Harbor (-12.26%): MLX Local Whisper transcribes “Pearl Harbor” approximately 12.26% faster than Local Whisper.
- MLK (-44.65%): MLX Local Whisper transcribes “MLK” approximately 44.65% faster than Local Whisper.
- Berlin (-67.86%): MLX Local Whisper transcribes “Berlin” approximately 67.86% faster than Local Whisper.
- Impeach (-66.15%): MLX Local Whisper transcribes “Impeach” approximately 66.15% faster than Local Whisper.
- Crisis (-57.39%): MLX Local Whisper transcribes “Crisis” approximately 57.39% faster than Local Whisper.
- Reelection (-56.80%): MLX Local Whisper transcribes “Reelection” approximately 56.80% faster than Local Whisper.
Considerations & Outlook
This experimentation does not take into consideration the quality of the output, only the speed at which the output was produced. An improved experiment would take this into consideration.
Further, it’s unclear whether OpenAI’s whisper-1
is comparable to the open source whisper’s base
model. I also realize in retrospect that the default model for Whisper in MLX is tiny
(see line 78). So it’s definitely not a 100% fair comparison. However, despite this, I think the main conclusion still stands: that MLX is worlds faster and well worth using.
Conclusion
In the exploration of Whisper’s diverse configurations — specifically, its deployment on an M1 Mac with and without MLX acceleration, alongside its utilization via the OpenAI API — distinct patterns in transcription speeds have emerged. The project revealed that MLX Local Whisper significantly outperforms its non-MLX counterpart, showcasing a remarkable ~50.85% increase in transcription speed, particularly evident in lengthier audio files.
Notably, the OpenAI API, constrained by a 25 MB file upload limit, faces limitations in handling larger audio files, highlighting the advantage of local models for such scenarios. In the comparison between Local Whisper and the OpenAI API, Local Whisper consistently demonstrates a ~25.66% faster transcription speed across various audio files, so long as those are within the file-size constraint.
Surely it’s only a matter of time before we see OpenAI expand the file size limit, but until then, running Whisper in MLX is the best option by a long shot.