Speech-to-Text | Get transcription WITH SPEAKERS from large audio file in any language (OpenAI Whisper + NeMo Speaker Diarization)

Pierre Guillou
3 min readDec 7, 2023

--

Image credit: the image was created by ChatGPT4 with the prompt “A cozy and warm office scene with a professional of any gender and descent, working on a computer. The computer screen displays two software interfaces, ‘Whisper’ and ‘NeMo’, each showing a large audio file being transcribed into text with speaker detection. On each screen, include an image representing a speaker, clearly indicating the transcription process involving multiple speakers. (…)”
Image credit: the image was created by ChatGPT4 with the prompt “A cozy and warm office scene with a professional of any gender and descent, working on a computer. The computer screen displays two software interfaces, ‘Whisper’ and ‘NeMo’, each showing a large audio file being transcribed into text with speaker detection. On each screen, include an image representing a speaker, clearly indicating the transcription process involving multiple speakers. The room is warmly lit and inviting, decorated with plants, framed pictures, and a bookshelf, blending advanced technology with a homey office environment.”

In a previous post, I showed how Whisper Large v3 (OpenAI’s newest multilingual text-to-speech model as of November 2023) could be easily used to get quickly a transcription of a large audio file in any language. In this post, I give in a notebook the code to obtain more: the transcription with speakers!

(Option) Get a mp3 audio file from a YouTube video

If you don’t have a mp3 audio file to run the code shown in this article, you can easily download one from any Youtube video. Just read the blog post “Video-to-Audio | A notebook and Web APP to get mp3 audio file from any YouTube video”.

Thus, I used the Web APP cited in this post blog to get the mp3 audio file of the YouTube video “Andrew Ng: AI regulation, education, and where we are headed for healthcare and beyond”.

YouTube video “Andrew Ng: AI regulation, education, and where we are headed for healthcare and beyond

(Option) Transcription WITHOUT speaker diarization

If you don’t want speakers, there’s already a blog post about it (with an associated notebook): “Speech-to-Text | Quickly get a transcription of a large audio file in any language with Faster-Whisper”.

Transcription WITH speaker diarization

The following image is just a screenshot of the end of the notebook to show what you can expect as speakers transcription of the audio file you have.

Note: the notebook also created the srt file of the transcript (with elapsed times).

result of the code: transcription with speakers
result of the code: transcription with speakers

How to get that as result? Just run on Google Colab the following notebook :-)

Note: to get it, I only did a small update of the notebook Whisper_Transcription_%2B_NeMo_Diarization.ipynb of Mahmoud Ashraf (all texts were kept) that deserves a full credit.

[ Explanation from Mahmoud Ashraf ] This notebook combines Whisper ASR capabilities with Voice Activity Detection (VAD) and Speaker Embedding to identify the speaker for each sentence in the transcription generated by Whisper. First, the vocals are extracted from the audio to increase the speaker embedding accuracy, then the transcription is generated using Whisper, then the timestamps are corrected and aligned using WhisperX to help minimize diarization error due to time shift. The audio is then passed into MarbleNet for VAD and segmentation to exclude silences, TitaNet is then used to extract speaker embeddings to identify the speaker for each segment, the result is then associated with the timestamps generated by WhisperX to detect the speaker for each word based on timestamps and then realigned using punctuation models to compensate for minor time shifts.

Et voilà :-)

And after…

With such great transcription with speakers, use any Deep Learning model now! It depends on what you want:

  • Translation into any language
  • Topic and/or named entity detection
  • Classification of texts
  • Summarize
  • Extract all key information

And what if you used an LLM (Large Language Model) like ChatGPT, Gemini or Llama2? ;-)

About the author: Pierre Guillou is an AI consultant (Generative AI & Deep Learning) in Brazil and France. Contact him via his LinkedIn profile.

--

--

Pierre Guillou

AI, Generative AI, Deep learning, NLP models author | Europe (Paris, Bruxelles, Liège) & Brazil