Speech-to-Text | Quickly get a transcription of a large audio file in any language with “Faster-Whisper”

Pierre Guillou
5 min readDec 4, 2023

--

Credit: ChatGPT4 with the prompt “An image showing a cozy, brightly colored office with a diverse array of decorations and plants. A professional of any gender and descent is sitting at a wooden desk, working on a laptop with a large screen displaying the ‘Faster Whisper’ software interface. The software is actively transcribing a large audio file into text, with multiple languages being displayed side by side. The audio file is represented by a clear, simple waveform on the screen. (…)”
Credit: ChatGPT4 with the prompt “An image showing a cozy, brightly colored office with a diverse array of decorations and plants. A professional of any gender and descent is sitting at a wooden desk, working on a laptop with a large screen displaying the ‘Faster Whisper’ software interface. The software is actively transcribing a large audio file into text, with multiple languages being displayed side by side. The audio file is represented by a clear, simple waveform on the screen. The environment is modern yet approachable, with a large window in the background showing a sunny day outside, and the room is filled with books and personal touches to add warmth and color.”

The Dawn of a New Era in Speech-to-Text Technology — The month of November 2003 will remain as a landmark in the realm of Speech-to-Text Open Source technology. With the release of three high-level models — Whisper Large v3 (multilingual) from OpenAI, SeamlessM4T v2 (multilingual) from Meta, and Distil-Whisper (English) from Hugging Face— the field has taken a significant leap forward. Let’s explore in this first post, how to quickly use Large Whisper v3 through the library faster-whisper in order to obtain transcriptions of large audio files in any language. A notebook is provided.

[ EDIT 06/12/2023 ] If you need to get a mp3 file to use the code of this post, you can use my Web APP YouTube Video-to-Audio. Check the post Video-to-Audio | A notebook and Web APP to get mp3 audio file from any YouTube video to get the link.

[ EDIT 08/12/2023 ] Speech-to-Text | Get transcription WITH SPEAKERS from large audio file in any language (OpenAI Whisper + NeMo Speaker Diarization)

About the Speech-to-Text models

Whisper Large v3 (multilingual) from OpenAI

Whisper Large v3, the latest iteration from OpenAI, stands out with its multilingual capabilities. This model represents a significant advancement in automatic speech recognition (ASR), trained on hundreds of thousands of hours of multilingual data. It’s not just the scale of the training that’s impressive, but also the model’s ability to handle various accents, dialects, and even noisy backgrounds with remarkable accuracy. This makes Whisper Large v3 an ideal tool for global businesses and multilingual environments, transcending language barriers with ease. (model on Hugging face models hub)

SeamlessM4T v2 (multilingual) from Meta

Meta’s SeamlessM4T v2 is a versatile marvel in language processing technology. Uniquely designed to handle multiple tasks, it excels in Speech-to-speech translation (S2ST), Speech-to-text translation (S2TT), Text-to-speech translation (T2ST), Text-to-text translation (T2TT), Automatic speech recognition (ASR). The multilingual support of SeamlessM4T v2 breaks down language barriers, fostering global connectivity in business, education, and social interactions. (model on Hugging face models hub)

Distil-Whisper (English) from Hugging Face

Hugging Face, known for their user-friendly and accessible AI models, has released Distil-Whisper, an English-focused model. It is a distilled version of the Whisper v2 model that is 6 times faster, 49% smaller, and performs within 1% WER on out-of-distribution evaluation sets. (model on Hugging face models hub)

Implications and Future Prospects

These releases mark a significant moment in speech-to-text technology, showcasing the rapid advancements in AI. As we look to the future, we can expect these models to evolve and integrate more deeply into various aspects of our digital lives. They could transform customer service, healthcare, legal transcription, and more, making interactions more efficient and breaking down language barriers.

Code | Use of Large Whisper v3 via the library Faster-Whisper

faster-whisper is a reimplementation of OpenAI’s Whisper model using CTranslate2, which is a fast inference engine for Transformer models.

This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.

Easily, we can use Whisper Large v3 as base model for faster-whisper!

Any large audio file, Any language

Whether from an mp3 or wav file, you don’t need to segment your audio file into small files. Simply run the faster-whisper code (on CPU or GPU) on your audio file and obtain the transcription respecting silence, with temporal information and with the correct written style (capital letters, periods, commas, etc.)!

Basic python code

[ source ]

# !pip install faster-whisper

from faster_whisper import WhisperModel

model_size = "large-v3"

# Run on GPU with FP16
model = WhisperModel(model_size, device="cuda", compute_type="float16")

# or run on GPU with INT8
# model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
# or run on CPU with INT8
# model = WhisperModel(model_size, device="cpu", compute_type="int8")

segments, info = model.transcribe(
"audio.mp3",
beam_size=5,
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500),
)

print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Notebook to get styled transcript

In this notebook, you’ll see how to get your temporal transcription into a DataFrame and into a stylized written paragraph divided by sentence.

As an example (see notebook), here are the first sentences transcribed from the lesson 1 video of the DeepLearning.AI course about “Building and Evaluating Advanced RAG Applications” (audio file: lesson1_of_RAG_course_with_DeepLearningAI.mp3).

Retrieval Augmented Generation, or RAG, has become a key method for getting LMs answered questions over a user's own data.

But to actually build and productionize a high-quality RAG system, it costs a lot to have effective retrieval techniques, to give the LM highly relevant context to generate his answer, and also to have an effective evaluation framework to help you efficiently iterate and improve your RAG system, both during initial development and during post-deployment maintenance.

This course covers two advanced retrieval methods, sentence window retrieval and auto-merging retrieval, that deliver a significantly better context to the LM than simpler methods.

It also covers how to evaluate your LM question-answering system with three evaluation metrics, context relevance, drowdeness, and answer relevance.

I'm excited to introduce Jerry Liu, co-founder and CEO of LarmRatex and Anupam Data, co-founder and CEO of LarmRatex.

For a long time, I've enjoyed following Jerry and LarmRatex on social media and getting tips on evolving RAG practices, so I'm looking forward to him teaching this body of knowledge more systematically here.

And Anupam has been a professor at CMU and has done research for over a decade on trustworthy AI and how to monitor, evaluate, and optimize AI app effectiveness.

Thanks, Andrew.

It's great to be here.

Great to be with you, Andrew.

(...)

We can see that Whisper Large v3 did only few mistakes on names (larmRatex instead of LlamaIndex, Anupam Data instead of Anupam Datta).

Et voilà :-)

About the author: Pierre Guillou is an AI consultant (Generative AI & Deep Learning) in Brazil and France. Contact him via his LinkedIn profile.

--

--

Pierre Guillou

AI, Generative AI, Deep learning, NLP models author | Europe (Paris, Bruxelles, Liège) & Brazil