Speech-to-Text | Get transcription WITH SPEAKERS from large audio file in any language (OpenAI Whisper + NeMo Speaker Diarization)

3 min readDec 7, 2023

Image credit: the image was created by ChatGPT4 with the prompt “A cozy and warm office scene with a professional of any gender and descent, working on a computer. The computer screen displays two software interfaces, ‘Whisper’ and ‘NeMo’, each showing a large audio file being transcribed into text with speaker detection. On each screen, include an image representing a speaker, clearly indicating the transcription process involving multiple speakers. (…)” — Image credit: the image was created by ChatGPT4 with the prompt “A cozy and warm office scene with a professional of any gender and descent, working on a computer. The computer screen displays two software interfaces, ‘Whisper’ and ‘NeMo’, each showing a large audio file being transcribed into text with speaker detection. On each screen, include an image representing a speaker, clearly indicating the transcription process involving multiple speakers. The room is warmly lit and inviting, decorated with plants, framed pictures, and a bookshelf, blending advanced technology with a homey office environment.”

In a previous post, I showed how Whisper Large v3 (OpenAI’s newest multilingual text-to-speech model as of November 2023) could be easily used to get quickly a transcription of a large audio file in any language. In this post, I give in a notebook the code to obtain more: the transcription with speakers!

(Option) Get a mp3 audio file from a YouTube video

If you don’t have a mp3 audio file to run the code shown in this article, you can easily download one from any Youtube video. Just read the blog post “Video-to-Audio | A notebook and Web APP to get mp3 audio file from any YouTube video”.

Video-to-Audio | A notebook and Web APP to get mp3 audio file from any YouTube video

This post gives the links to the notebook and the online application of an YouTube Video-to-Audio functionality.

medium.com

Thus, I used the Web APP cited in this post blog to get the mp3 audio file of the YouTube video “Andrew Ng: AI regulation, education, and where we are headed for healthcare and beyond”.

YouTube video “Andrew Ng: AI regulation, education, and where we are headed for healthcare and beyond”

(Option) Transcription WITHOUT speaker diarization

If you don’t want speakers, there’s already a blog post about it (with an associated notebook): “Speech-to-Text | Quickly get a transcription of a large audio file in any language with Faster-Whisper”.

Speech-to-Text | Quickly get a transcription of a large audio file in any language with…

Let’s explore how to use Large Whisper v3 through the library faster-whisper to obtain transcriptions from large audio…

medium.com

Transcription WITH speaker diarization

The following image is just a screenshot of the end of the notebook to show what you can expect as speakers transcription of the audio file you have.

Note: the notebook also created the srt file of the transcript (with elapsed times).

result of the code: transcription with speakers

How to get that as result? Just run on Google Colab the following notebook :-)

Note: to get it, I only did a small update of the notebook Whisper_Transcription_%2B_NeMo_Diarization.ipynb of Mahmoud Ashraf (all texts were kept) that deserves a full credit.

language-models/speech_to_text_transcription_with_speakers_Whisper_Transcription_+_NeMo_Diarization…

pre-trained Language Models. Contribute to piegu/language-models development by creating an account on GitHub.

github.com

[ Explanation from Mahmoud Ashraf ] This notebook combines Whisper ASR capabilities with Voice Activity Detection (VAD) and Speaker Embedding to identify the speaker for each sentence in the transcription generated by Whisper. First, the vocals are extracted from the audio to increase the speaker embedding accuracy, then the transcription is generated using Whisper, then the timestamps are corrected and aligned using WhisperX to help minimize diarization error due to time shift. The audio is then passed into MarbleNet for VAD and segmentation to exclude silences, TitaNet is then used to extract speaker embeddings to identify the speaker for each segment, the result is then associated with the timestamps generated by WhisperX to detect the speaker for each word based on timestamps and then realigned using punctuation models to compensate for minor time shifts.

Et voilà :-)

And after…

With such great transcription with speakers, use any Deep Learning model now! It depends on what you want:

Translation into any language
Topic and/or named entity detection
Classification of texts
Summarize
Extract all key information
…

And what if you used an LLM (Large Language Model) like ChatGPT, Gemini or Llama2? ;-)

About the author: Pierre Guillou is an AI consultant (Generative AI & Deep Learning) in Brazil and France. Contact him via his LinkedIn profile.