Audio Segmentation for Unsupervised Audio Data

3 min readFeb 20, 2024

lets first talk about unsupervised audio data, its the data which has no label for any speaker or have any idea about who speaks when.

The main goal for dealing with unsupervised audio data is segmenting each speaker on the basis of who speaks when. Clustering is one way to achieve this goal.

Clustering task for audio segmentation is based on the similarities in the audio features. the similar parts of an audio are group together to segment the audio in to cluster.

Feature extraction

Relevant features from the audio signals, pitches judged by listener to equal distance from each other (Mel-frequency Scaling). Mel-frequency scale is logarithmic and approximates the human auditory systems response to different frequency.

Agglomerative Clustering is being used for this purpose. Each object of the audio is considered to be its own cluster and then iteratively merges the similar clusters until the stopping criteria is met. A dendrogram is constructed depending upon the merging of clusters which represents dissimilarity between the clusters.

clustering = AgglomerativeClustering(num_speakers).fit(embeddings)
labels = clustering.labels_
for i in range(len(segments)):
  segments[i]["speaker"] = 'SPEAKER ' + str(labels[i] + 1)

the code snippet shows clustering can be done.

Audio Segmentation

Audio segmentation is the process of dividing an audio signal into smaller, more manageable segments or regions based on certain criteria such as silence, speech, music, noise, or other acoustic characteristics. The goal of audio segmentation is to break down the audio stream into meaningful units that can be analyzed or processed independently.

Code

# upload audio file
from google.colab import files
uploaded = files.upload()
path = next(iter(uploaded))

Select the number of speakers which are speaking in your audio file accordingly.

num_speakers = 2 #@param {type:"integer"}

language = 'any' #@param ['any', 'English']

model_size = 'large' #@param ['tiny', 'base', 'small', 'medium', 'large']


model_name = model_size
if language == 'English' and model_size != 'large':
  model_name += '.en'

Install libraries

!pip install -q git+https://github.com/openai/whisper.git > /dev/null
!pip install -q git+https://github.com/pyannote/pyannote-audio > /dev/null
import whisper
import datetime
import subprocess
import torch
import pyannote.audio
from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
embedding_model = PretrainedSpeakerEmbedding(
    "speechbrain/spkrec-ecapa-voxceleb",
    device=torch.device("cuda"))

from pyannote.audio import Audio
from pyannote.core import Segment
import wave
import contextlib
from sklearn.cluster import AgglomerativeClustering
imporpt numpy as np

If your audio file is in any format convert it into .wav file.

if path[-3:] != 'wav':
  subprocess.call(['ffmpeg', '-i', path, 'audio.wav', '-y'])
  path = 'audio.wav'

Load the whisper model

model = whisper.load_model(model_size)

result = model.transcribe(path)
segments = result["segments"]

with contextlib.closing(wave.open(path,'r')) as f:
  frames = f.getnframes()
  rate = f.getframerate()
  duration = frames / float(rate)

Generate Embeddings for the Audios

audio = Audio()

def segment_embedding(segment):
  start = segment["start"]
  # Whisper overshoots the end timestamp in the last segment
  end = min(duration, segment["end"])
  clip = Segment(start, end)
  waveform, sample_rate = audio.crop(path, clip)
  return embedding_model(waveform[None])

embeddings = np.zeros(shape=(len(segments), 192))
for i, segment in enumerate(segments):
  embeddings[i] = segment_embedding(segment)

embeddings = np.nan_to_num(embeddings)

Clustering

clustering = AgglomerativeClustering(num_speakers).fit(embeddings)
labels = clustering.labels_
for i in range(len(segments)):
  segments[i]["speaker"] = 'SPEAKER ' + str(labels[i] + 1)

Audio Segmentation

def time(secs):
  return datetime.timedelta(seconds=round(secs))

f = open("transcript.txt", "w")

for (i, segment) in enumerate(segments):
  if i == 0 or segments[i - 1]["speaker"] != segment["speaker"]:
    f.write("\n" + segment["speaker"] + ' ' + str(time(segment["start"])) + '\n')
  f.write(segment["text"][1:] + ' ')
f.close()

Audio Segmentation for Unsupervised Audio Data

Audio Segmentation

Written by Nimramuzamal