Fast Audio Machine Learning Model Debugging using Embedding Space Clustering

Daniel Klitzke
7 min readJul 13, 2023

--

tl;dr

Motivation

Debugging audio machine learning models can be a real pain. This is mainly because of two simple reasons:

  1. The data is unstructured, so filtering and searching it through simple hand-crafted rules is impossible!
  2. Detailed analysis mostly requires listening to audio samples, which makes the model debugging even more time-consuming.

In this post, I will show you how to significantly speed up your model debugging significantly using model embeddings and clustering.

Noise in audio recordings, like in this case, a low frequent humming can significantly influence the performance of automatic speech recognition models.

Note that the full code of the example used in this notebook is available HERE! If you want to run the code yourself, just download the notebook, as this post does not contain the full code!

How It Works

Our approach is based on a fairly simple methodology and requires the following as input data:

  1. The ground-truth labels for your audio ML problem
  2. The predictions of the model your want to debug
  3. The raw audio files, preferably in wav format

Our library sliceguard then does the following steps to detect issues in the audio data.

Compute audio embeddings for the raw audio data

Unstructured data is hard to process computationally. Embeddings are a method to get semantically meaningful numerical vectors. Once you have those meaningful numerical vectors, you can use them to compute similarities between your audio files. This makes navigating through your data much easier and allows you to recognize patterns in model failures much easier. Also, it facilitates subsequent steps like clustering your data.

Computing audio embeddings and reducing them to two dimensions using dimensionality reduction techniques will make the audio navigable by a similarity measure.

If you are looking for Python libraries to compute audio embeddings I recommend you take a look at the Huggingface Transformers Library and their Model Hub.

Note that depending on which model you use you will capture different properties of the data. E.g. if your use a model trained for classifying environmental noise you might capture general conditions regarding audio quality and background noise. However, if you use a model for speaker identification, you might be presented with the opportunity to identify issues related to speaker voice, being invariant to other conditions such as background noise.

Perform a hierarchical embedding clustering

With the previous step, we can already navigate through our data much easier. However, to detect data slices where our model doesn’t perform well, we have to have explicit groups to compute our evaluation metrics on. One way to get these groups out of the computed embeddings is clustering. In this case, we use hierarchical clustering, implemented as part of the dimensionality reduction method h-nne.

Hierarchical clustering divides the data into different partitionings with varying granularity. Treemaps are a nice way of visualizing the structure of the clustering result.

But why do we use a hierarchical clustering approach? Doesn’t that make everything more complicated? It is important for the following main reason:

The desired partitioning of the data varies depending on the type of data issue you want to detect. Two common data issues and the view you need to detect them could be the following:

  1. Outliers with overdriven Signal → Fine-granular data partitioning showing clusters with few contained samples.
  2. Larger unwanted bias → Coarse-granular data partitioning with cluster sizes large enough to shift the data distribution significantly.

Identify data issues in the clustering hierarchy

While we now have computed embeddings to make our data sortable and have clustered the data on different levels of granularity, there are still remaining questions:

  1. How can we determine which clusters are problematic for our model?
  2. How to determine the hierarchy level on which a problem is captured best?

The first question can be answered easily. Simply compute your evaluation metric for each cluster and compare it to the model's accuracy on the overall dataset. If this accuracy drop is high enough, there is a potential problem. Packages like fairlearn can help you with that.

The second question is strongly related to the previous description of hierarchical clustering itself. Basically, depending on what you are looking for, the problems are most likely best captured in a more fine-granular or a more coarse-granular partitioning. A way to determine which clusters to mark is to define a minimal cluster support and a minimal metric drop.
Then just go through the clustering hierarchy and mark all the clusters that fulfill the min. support and min. drop criteria while potentially unmarking their parent clusters if the problems are better captured by their children.

Review automatically identified issues

Especially when using only unstructured data for identifying issues the results are uncertain and not really interpretable. However, visualizing and listening to the audio data can provide an effective way to distinguish real issues from false positives. A tool that is optimized for doing this kind of review is Renumics Spotlight.

Hands-on: Finding Issues in Your Audio Data

So, now let’s see how this analysis looks hands-on! Luckily if you are willing to use our library sliceguard, you don’t have to implement the described behavior yourself. Instead, just install it as follows:

pip install sliceguard

Like this, every analysis step is just a single function call. To run the example completely, please download the notebook in our GitHub repository!

Checking for environmental noise

One thing you want to do is to check whether certain environmental noise conditions or issues with the recording quality might cause the model to fail. As described above, our approach is based on clustering audio embeddings. This means, that the model we generate the embeddings from has to capture those environmental noise properties. In sliceguard this call will do all the work for us:

# Perform an initial detection aiming for relatively small clusters of minimum 5 similar samples
sg = SliceGuard()
issue_df = sg.find_issues(
df,
["audio"],
"sentence",
"prediction",
wer_metric,
metric_mode="min",
embedding_models={"audio": "MIT/ast-finetuned-audioset-10-10-0.4593"},
min_support=3,
min_drop=0.2,
)
sg.report(spotlight_dtype={"audio": Audio})

We here supply a pandas DataFrame containing the following:

  1. Column audio that contains path to the wav audio files
  2. Column sentence containing the transcript ground-truth
  3. Column prediction containing the model prediction

We also supply a metric function we want to evaluate our model with, in this case, we simply compute the word error rate. And last but not least, we determine the model to compute the embeddings with. In this case to capture environmental noise properties, we go for this Audio Spectrogram Transformer, trained on Audioset.

Interactive visualization is a key component on Reviewing which potentially problematic clusters really pose an issue. Especially for unstructured data such as audio signals being able to look and listen to the raw data is essential. Renumics Spotlight offers an Audio Player and Spectrogram for this.

We end up with the above view, which gives the following insights:

  1. A lot of issues are not related to audio but simple model failures where, e.g., a stop token was not correctly generated.
  2. There is quite a bunch of issues with really loud and really quiet samples. Both can impact performance significantly.
  3. There are some audio recordings that contain artifacts like a low frequent humming noise or more high frequent loud background noises.
  4. There are some outliers where the sampling rate seems to be off, making the recording sound slow and low pitch.

Explore the data yourself in this Huggingface Space!

Checking for speaker-related issues

Environmental noise is not all we are interested in. Instead, we want to also detect issues that are caused by characteristics of certain speakers. So how can we do this even if our speakers are maybe not explicitly labeled? Simply use a model for embedding generation that was trained on a speaker identification task!

# Perform a detection using a speaker identification model for computing embeddings.
# This will help to recover problematic speakers even though they are not explicitely labeled.
sg = SliceGuard()
issue_df = sg.find_issues(
df,
["audio"],
"sentence",
"prediction",
wer_metric,
metric_mode="min",
embedding_models={"audio": "superb/wav2vec2-base-superb-sid"},
min_support=3,
min_drop=0.4,
)

We end up with a similar view again, this time showing us problematic slices based on speaker similarity.

Having a bunch of automatically detected issues as a starting point for review can significantly help speeding up your model debugging efforts although you will still need to review the clusters manually.

Insights we get from this are:

  1. There are some speakers with exceptionally bad recording quality, which is often not even recognizable by humans.
  2. In many cases, detected problems and thus problematic speakers have a strong accent. In fact, often, this would be detectable via the accent feature. While sliceguard can not yet detect this, it is planned for one of the next releases.
  3. Also, with the speaker embedding, we see speakers who frequently record in noisy environments.

Explore the data yourself in this Huggingface Space!

Conclusion

Of course, there is much more to a complete model evaluation than detecting problematic data slices. However, if your look for an easy method to go beyond global evaluation metrics and find biases and outliers in your data using a few lines of code, try out our library sliceguard or try to implement the described approach yourself. If you are more eager to try out the interactive exploration capabilities, try out Renumics Spotlight.

--

--