Using OpenAI’s Whisper to Transcribe Real-time Audio

Jeremy Savage
4 min readApr 12, 2024

--

The availability of advanced technology and tools, in particular, AI is increasing at an ever-rapid rate, I am going to see just how easy it is to create an AI-powered real-time speech-to-text program.

Image of a man who is processing audio files in front of a microphone

With the release of Whisper in September 2022, it is now possible to run audio-to-text models locally on your devices, powered by either a CPU or a GPU.

In this brief guide, I will show you how to take audio from your microphone and convert it into text in real-time.

I am using an M1 MacBook Pro, so was having trouble utilising the GPU properly using standard Python libraries, so I decided to use a C++ tool that is a ‘Apple Silicon first-class citizen’ high-performance inference Whisper. Thanks to Georgi Gerganov and all other 293 contributors, as this was the only way I could find to successfully run whisper models on the GPU on an M1 Mac. There is a bunch of other cool functionality built into the tool too, which is worth checking out!

First I cloned the git repository by following the instructions here, and I created a simple Python wrapper for the tool which main the main file of the repository.

def transcribe_to_txt(input_filename: str, output_filename: str):
print('Running whisper transcription...')
# Compose the command of all components
command = ['./main', '-f', input_filename, '-otxt', '-of', output_filename, '-np']

# Execute the command
result = subprocess.run(command, capture_output=True, text=True)

I would then break the audio into chunks of 5 seconds, which I can use sounddevice.InputStream to do so. I give it a callback to run every 5 seconds, so we can process 5-second chunks of the audio via whisper.

with sd.InputStream(callback=callback, dtype='int16', channels=1, samplerate=16000, blocksize=16000*5):

We then define our callback to put the 5-second audio chunk in a temporary file which we will process using whisper.cpp, extracting the text from the audio, that we can then print to the console.

def callback(indata, frames, time, status):
# Raise for status if required
if status:
print(status)

# Create a tempfile to save the audio to, with autodeletion
with tempfile.NamedTemporaryFile(delete=True, suffix='.wav', prefix='audio_', dir='.') as tmpfile:
# Save the 5 second audio to a .wav file
with wave.open(tmpfile.name, 'wb') as wav_file:
wav_file.setnchannels(1) # Mono audio
wav_file.setsampwidth(2) # 16-bit audio
wav_file.setframerate(16000) # Sample rate
wav_file.writeframes(indata)

# Prepare the output filename
output_filename = tmpfile.name.replace('.wav', '')

# Transcribe the audio to text using our whisper.cpp wrapper
transcribe_to_txt(tmpfile.name, output_filename)

# Print the transcribed text
with open(output_filename + '.txt', 'r') as file:
print(file.read())

# Clean up temporary files
os.remove(output_filename + '.txt')

We purposely save the files, as we would want to potentially do things with this data further down the line, but for now, we just clean up the .txt files with the os module.

Now we have enough code to run:

try:
# Start recording with a rolling 5-second buffer
with sd.InputStream(callback=callback, dtype='int16', channels=1, samplerate=16000, blocksize=16000*5):
print("Recording... Press Ctrl+C to stop.")
while True:
pass
except KeyboardInterrupt:
print('Recording stopped.')

When running this we get:

Recording... Press Ctrl+C to stop.
Running whisper transcription...
Successful run.
(keyboard clicking)
[BLANK_AUDIO]

Running whisper transcription...
Successful run.
okay so this is just some audio that's

Running whisper transcription...
Successful run.
I'm just going to show that it's happening in real time, I'm just going to record a few of these things.

Running whisper transcription...
Successful run.
So just a few five second chucks.

The transcription happens fast, but don’t just take my word for it, I ran the timeit module to see how quickly the transcriptions happen like this:

def test_function():
transcribe_to_txt('audio_xkf1lhwh.wav', 'output.txt')

execution_time = timeit.timeit('test_function()', globals=globals(), number=100)
print(f"Average execution time: {execution_time / 100:.5f} seconds")

The output:

Successful run.
Running whisper transcription...
So this is running successfully. What am I trying to do here? I am trying...

Successful run.
Running whisper transcription...
So this is running successfully. What am I trying to do here? I am trying...

Successful run.
Average execution time: 0.48541 seconds

As you can see, the transcription happens exceptionally fast, with it taking less than 0.5 seconds to process 5 seconds of audio that contains speech.

Conclusion

Building this was surprisingly easy, with the largest challenge coming from the lack of support for the new Apple ARM devices, although I did discover a great C++ library that made it all possible.

The tech is surprisingly fast and easy to use, it certainly has a lot of use cases, especially when combined with other technologies such as LLMs, where the data can be used in various ways, from call centers to factory shopfloors error monitoring.

I hope this guide was useful!

--

--

Jeremy Savage

Data & AI Engineer☁️| Passion for AI, data & all things computing💾| Sharing my computing interests on Medium📚