Speaker Diarization

Separation of Multiple Speakers in an Audio File.

Jaspreet Singh
Dec 20, 2018 · 7 min read

Diarization: The Process of partitioning an input audio stream into homogeneous segments according to the speaker identity.

Identifying the number of Speakers in an Audio File

Diarization answers “Who Spoke When?” — When did speaker 1/Speaker 2 “start/stop” speaking.

Applications:

  1. Medical records: Doctor vs Patient Speech Separation (Provide more structured notes).
  2. Automatic notes generation for Meetings
  3. Call center Data analysis
  4. Court houses & Parliaments.
  5. Broadcast News(TV and Radio)
Audio File in Graph Form

The task of Speaker Diarization encounters many difficulties:

  1. The number of speakers in the program is unknown.
  2. There is no prior knowledge about the identity of the people in the program.
  3. Many speakers may speak at the same time
  4. There may be different audio recording conditions
  5. The audio channel may contain not only speech, but also music and other non-speech sources (applause and laughter, etc)
Problems to Identify Multiple Speakers in an Audio File.

Anatomy (Internal Working) & Infrastructure of Diarization.

  1. Speech Detection — Use Voice Activity Detector (VAD) to remove noise and non speech.
  2. Speech Segmentation — Extract short segments(sliding window) from audio & Run LSTM network to produce D vectors for each sliding window.
Generating D-vector from segments

3. Embedding Extraction(Recognition part) — For each segment aggregate the D vector belong to that segment to produce segment wise embeddings (Attach the d-vector to sliding window)

4. Clustering (Cluster the segments) — Finally cluster the segment wise embedding to produce diarization results. Determine the number of speakers with each speakers time stamps.

Infrastructure of Speaker Diarization Process

Problems to identify multiple speakers in an audio Files can be solved with diarization.

Spectral Clustering : It helps to determine the number of clusters.

The gap has max val when k is 8, 8 clusters are optimal in this case.

Spectral Clustering to find the Optimal Clusters

Spectral Clustering finds an optimal graph cut as per the data which helps in Speaker Diarization.

Diarization provides solution for the above problems

What Diarization is NOT ?

Diarization !=Speaker Change Detection :

Diarization labels whenever new speaker appears and if the same speaker comes again, it provides the same label. However, in speaker change detection no such labels are given.

Speaker Change Detection

Diarization != Speaker Recognition

No Enrollment: They don’t save voice prints of any known speaker. They don’t register any speakers voice before running the program. And also speakers are discovered dynamically.

The steps to execute the google cloud speech diarization are as follows:

Step 1: Create an account with Google Cloud.

Step 2: Create a Project.

Step 3: To acquire the key. Go To The Service Account key Page.

The steps for acquiring service account key

(1) From the Service account drop-down list, select New service account.

(2) Enter a name into the Service account name field.

(3) From the Role drop-down list, select Project > Owner.

(4) Click Create. A JSON file that contains your key downloads to your computer.

Steps to perform in Python/Python IDE Terminal:

1. pip install google-cloud.

2. pip install google-cloud-speech

3. Before executing the program export the JSON file:

For Linux/Ubuntu

export GOOGLE_APPLICATION_CREDENTIALS=”/home/user/Downloads/[FILE_NAME].json”

For Windows PowerShell

$env:GOOGLE_APPLICATION_CREDENTIALS=”[PATH]”

Python code to Implement Speaker Diarization:

# -*- coding: UTF-8 -*-
import argparse
import io
import sys

def transcribe_file_with_diarization(file_path):
“””Transcribe the given audio file synchronously with diarization.”””
# [START speech_transcribe_diarization_beta]
from google.cloud import speech_v1p1beta1 as speech
client = speech.SpeechClient()

speech_file = sys.argv[1]

with open(speech_file, ‘rb’) as audio_file:
content = audio_file.read()

audio = speech.types.RecognitionAudio(content=content)

config = speech.types.RecognitionConfig(
encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=48000,
language_code=’en-US’,
enable_speaker_diarization=True,
enable_automatic_punctuation=True,
diarization_speaker_count=4)

print(‘Waiting for operation to complete…’)
response = client.recognize(config, audio)

# The transcript within each result is separate and sequential per result.
# However, the words list within an alternative includes all the words
# from all the results thus far. Thus, to get all the words with speaker
# tags, you only have to take the words list from the last result:
result = response.results[-1]

words_info = result.alternatives[0].words

speaker1_transcript=””
speaker2_transcript=””
speaker3_transcript=””
speaker4_transcript=””
# Printing out the output:
for word_info in words_info:
if(word_info.speaker_tag==1): speaker1_transcript=speaker1_transcript+word_info.word+’ ‘
if(word_info.speaker_tag==2): speaker2_transcript=speaker2_transcript+word_info.word+’ ‘
if(word_info.speaker_tag==3): speaker3_transcript=speaker3_transcript+word_info.word+’ ‘
if(word_info.speaker_tag==4): speaker4_transcript=speaker4_transcript+word_info.word+’ ‘
print(“speaker1: ‘{}’”.format(speaker1_transcript))
print(“speaker2: ‘{}’”.format(speaker2_transcript))
print(“speaker3: ‘{}’”.format(speaker3_transcript))
print(“speaker4: ‘{}’”.format(speaker4_transcript))
# [END speech_transcribe_diarization_beta]

transcribe_file_with_diarization(sys.argv[1])

Diarization results on Jupyter terminal

Other Companies which are offering similar service :

Amazon (AWS Transcribe) : It offers the similar solution in JSON file which can be used for further applications.

Amazon Transcribe provides high quality and affordable speech to text transcription for a wide range of use cases.

Step 1: Created transcription Job

Output: JSON file is uploaded on the S3 bucket, with transcript results JSON Transcript file.

JSON Transcript file

Microsoft — Cognitive Speaker Recognition.

„It helps to recognize user based on their voice as voice has unique characteristics, that can be used to recognize and identify speaker.

„Just like everyone has a unique fingerprint, everyone has a unique voice.

How to Implement Speaker Identification

Step 1: Create an account with Microsoft Azure.

Step 2: In All Services -> Go to AI+Machine Learning.

Step 3: Click on Cognitive Services.

Step 4: Click on Add -> Click on More in Cognitive Services.

Step 5: Click on Speaker Recognition.

Step 6: Click on Create.

This will generate two keys which will remain active till your account is active.

Code Implementation for Speaker Identification

Step 1: Profile Creation: Create profile id with subscription key

Step 2: Profile Enrollment: Enroll the voice sample with the profile id & Subscription key.

Step 3: Speaker Identification: Pass the voice sample with the profile id.

Output : If the voice sample is correct which was associated with the profile id, the program will display the profile id otherwise displays 00000….

Note : Files required below implementing the code (Which are required to call the services of Microsoft Azure)

„IdentificationServiceHttpClientHelper

„IdentificationProfile

„IdentificationResponse

„EnrollmentResponse

„ProfileCreationResponse

which are available on Github.

Output of the Speaker Identification

Speaker Identification

Integration of Google and Microsoft Code to form Stenography:

# -*- coding: UTF-8 -*-

import argparse

import io

import IdentificationServiceHttpClientHelper

import sys

def identify_file(subscription_key, file_path, force_short_audio, profile_ids):

helper = IdentificationServiceHttpClientHelper.IdentificationServiceHttpClientHelper(subscription_key)

identification_response = helper.identify_file(file_path, profile_ids,force_short_audio.lower() == “true”)

print(‘Identified Speaker = {0}’.format(identification_response.get_identified_profile_id()))

print(‘Confidence = {0}’.format(identification_response.get_confidence()))

if identification_response.get_identified_profile_id() == ‘3c4712ea-c6b9–4f66-ac83-d21de36cfaf6’:

print(‘Shins_ID’)

if identification_response.get_identified_profile_id() == ‘bcebf33f-7d4d-4f3c-a5ae-6f51b1348c95’:

print(‘Jaspreets_ID’)

if identification_response.get_identified_profile_id() == ‘a091add1–87a0–464c-aa8b-7a953273f3d7’:

print(‘Deans_ID’)

if identification_response.get_identified_profile_id() == ‘3f81d6d5–9f77–4da0–8f65–8188c0978219’:

print(‘JTs_ID’)

def transcribe_file_with_diarization(subscription_key, file_path, force_short_audio, profile_ids):

helper = IdentificationServiceHttpClientHelper.IdentificationServiceHttpClientHelper(subscription_key)

identification_response = helper.identify_file(file_path, profile_ids,force_short_audio.lower() == “true”)

“””Transcribe the given audio file synchronously with diarization.”””

# [START speech_transcribe_diarization_beta]

from google.cloud import speech_v1p1beta1 as speech

client = speech.SpeechClient()

speech_file = sys.argv[2]

with open(speech_file, ‘rb’) as audio_file:

content = audio_file.read()

audio = speech.types.RecognitionAudio(content=content)

config = speech.types.RecognitionConfig(encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,sample_rate_hertz=16000,language_code=’en-US’,audio_channel_count=1,enable_speaker_diarization=True,enable_automatic_punctuation=True,diarization_speaker_count=1s)

print(‘Diarization Results :’)

response = client.recognize(config, audio)

# The transcript within each result is separate and sequential per result.

# However, the words list within an alternative includes all the words

# from all the results thus far. Thus, to get all the words with speaker

# tags, you only have to take the words list from the last result:

result = response.results[-1]

words_info = result.alternatives[0].words

speaker1_transcript=””

speaker2_transcript=””

speaker3_transcript=””

speaker4_transcript=””

# Printing out the output:

for word_info in words_info:

if(word_info.speaker_tag==1): speaker1_transcript=speaker1_transcript+word_info.word+’ ‘

if(word_info.speaker_tag==2): speaker2_transcript=speaker2_transcript+word_info.word+’ ‘

if(word_info.speaker_tag==3): speaker3_transcript=speaker3_transcript+word_info.word+’ ‘

if(word_info.speaker_tag==4): speaker4_transcript=speaker4_transcript+word_info.word+’ ‘

print(identification_response.get_identified_profile_id(),”:”,speaker1_transcript)

print(“speaker2: ‘{}’”.format(speaker2_transcript))

print(“speaker3: ‘{}’”.format(speaker3_transcript))

print(“speaker4: ‘{}’”.format(speaker4_transcript))

# [END speech_transcribe_diarization_beta]

if __name__ == “__main__”:

if len(sys.argv) < 5:

print(‘Usage: python IdentifyFile.py <subscription_key> <identification_file_path>’’ <profile_ids>…’)

print(‘\t<subscription_key> is the subscription key for the service’)

print(‘\t<identification_file_path> is the audio file path for identification’)

print(‘\t<force_short_audio> True/False waives the recommended minimum audio limit needed ‘

‘for enrollment’)

print(‘\t<profile_ids> the profile IDs for the profiles to identify the audio from.’)

sys.exit(‘Error: Incorrect Usage.’)

identify_file(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4:])

transcribe_file_with_diarization(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4:])

Output of Integrated Code

Conclusion: In this article we described the problem of identifying speeches of multiple speakers in an audio file. Google Speaker diarization is a powerful technique to get the desired results of transcribing the speaker with speaker tag. Speaker Diarization technique has less limitations and it is easy to implement.

Limitation: As there is no enrollment process, speaker diarization technique doesn’t recognize specific speaker.

References:

Data Driven Investor

from confusion to clarity, not insanity

Jaspreet Singh

Written by

Innovative Software Engineer/DevOps Engineer working on Masters from California State University — Dominguez Hills.

Data Driven Investor

from confusion to clarity, not insanity

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade