Diarising Audio Transcriptions with Python and Whisper: A Step-by-Step Guide

In this tutorial we will transcribe audio to get a file output that will annotate an API with transcriptions based on the SPEAKER here is an example:

You can then simply replace the SPEAKER_NN so you’ll get something like:

Create a wave file from the Video file

To do this we will execute the following code, that uses environment variables with ffmpeg to create the wav file.

Note you may need to install ffmpeg directly.

Create a model for speaker diarization

Next we’ll create a model for speaker diarization via pyannote.

Next we’ll write a function that takes a timestring and convert into an integer in millseconds.

Now we we’ll spilit the diarization file into groups.

Split Audio File into Groups

Next we’ll need to split the audio file based on the groups in the directory tmp .

Transcribe the audio files from the groups

Now let’s go ahead and transcribe the audio files from those groups.

Output the Transcription with Diarisation

Next we’ll create a transcription file with the speakers.

Combine Sentences for Easy Reading

Let’s now go ahead and createa combined sentences file so that each speaker represents a single line.

Calculating % of time each speaker spoke

You can also calculate the % of time each speaker spoke as follows:

Hopefully this guide gives you some pointers on how to transcribe with diarisation.

--

--

Group Product Manager @Twilio - Part-Time Crossfit Athlete.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store