Transcribing Audio with Python and Distil Whisper

WhisperJax is Dead. Long live Distil Whisper for ASR

Mark Craddock
Prompt Engineering

--

Distil-Whisper, is a distilled variant of the original Whisper model. The distilled model is 5.8 times faster and has 51% fewer parameters than its predecessor. These metrics alone make it a game-changer for real-world applications requiring quick responses.

But, it’s not just about speed and size. When Distil-Whisper was put to the test, it performed within 1% Word Error Rate (WER) on out-of-distribution data in a challenging zero-shot transfer setting. This demonstrates that the model retains a high level of accuracy, despite its reduced size.

Additionally, while the Whisper model is known for its robustness under tough acoustic conditions, Distil-Whisper goes a step further by being less susceptible to hallucination errors on long-form audio.

The Code

See the other parts of this series for data collection.

Runtime Checks:

This section ensures we are running on a GPU for efficient computations, especially beneficial for deep learning tasks. This code only works on a A100 GPU, but I’ve put in comments where you can change the code to run on a CPU.

try:
gpu_info = !nvidia-smi
except:
print('No GPU')
else:
#... Check and print GPU details ...

Memory Checks:

Check how much RAM is available. This provides insights into the available resources. Ensure we have enough memory for the task.

from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
#... Check and print RAM details ...

Setting Up Google Drive Integration:

Here, the Google Drive is mounted to the Colab environment to access or store files.

from google.colab import drive
drive.mount('/content/gdrive')

Directory Setup for Knowledge Base, Audio, and Transcripts:

It’s essential to keep data organised. Here we’re defining the paths for different categories of data.

KB_FOLDER = "/content/gdrive/Shareddrives/AI/WardleyKB"
#... Other directories defined ...
# Check and create directories if they don't exist

Setting Up the Transcription Model:

Install necessary libraries and load a pretrained model.

!pip install -q --upgrade transformers accelerate
#... Other installations ...

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
model_id = "distil-whisper/distil-large-v2"
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, ...)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

Configuring the Transcription Pipeline:

Define the ASR (Automatic Speech Recognition) pipeline that will be used to transcribe audio files.

pipeline = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=15,
batch_size=16,
torch_dtype=torch_dtype,
device=device,
)

Detailed Breakdown of the Transcription Logic

This section of the code deals with the actual transcription of the audio files. The primary focus is to ensure that audio files are transcribed if and only if they haven’t been processed before. Let’s break it down step by step:

Checking for Existing Transcripts:

This line checks if the transcript for the given audio file already exists. The logic is designed to avoid redundant processing, which can be time-consuming and resource-intensive.

if not os.path.isfile(transcript_filename):

Transcribing Audio Files:

This checks if the audio file itself exists.

if os.path.isfile(audio_filename):

Measuring Transcription Time:

It’s often useful to know how long the transcription takes, especially for optimisation purposes. These lines record the start time, perform the transcription, and then calculate the total time taken. Rounding up ensures a cleaner display and better readability.

start = time.time() transcription = transcribe_file(audio_filename) runtime = time.time() - start
rounded_runtime = math.ceil(runtime) # Round up to the nearest second print("Runtime: ", rounded_runtime, " seconds")

Transcribe the Audio Files:

This function uses the pipline to transcribe the audio files.

def transcribe_file(filename):
print (f"Transcribing New file: {filename}")
transcription = pipeline(filename, return_timestamps=True)
transcription = replace_wordly_with_wardley(transcription)
return transcription

Previewing the Transcription:

This provides a quick glimpse of the first 100 characters of the transcribed text. Useful for a quick sanity check.

print(transcription['text'][:100])

Saving the Transcription:

Once transcribed, the text is stored in a JSON format. This structure typically retains the transcribed text and associated metadata, such as timestamps for each spoken word or phrase.

with open(transcript_filename, 'w') as f: f.write(json.dumps(transcription))

Handling Missing Audio Files:

If the audio file is missing, the script provides a warning. This is especially useful for debugging or ensuring data integrity.

else:     print (f"File does not exist: {audio_filename}")

In essence, this script transforms audio content into a textual format, making it searchable and more accessible. With Google Drive integration and OpenAI’s models, it offers a robust solution for creating a Wardley Mapping knowledge base.

--

--

Mark Craddock
Prompt Engineering

Techie. Built VH1, G-Cloud, Unified Patent Court, UN Global Platform. Saved UK Economy £12Bn. Now building AI stuff #datascout #promptengineer #MLOps #DataOps