Evaluating AI-Based Audio Representation Models in 2022

Published in

Qosmo Lab

5 min readFeb 10, 2023

Overview

2022 has been one of the most exciting years in AI so far. DALL-E 2 was announced by Open AI in April and released in a private beta in July. Twitter was flooded by new, exciting images coming out of simple text prompts. Prompt engineering quickly became a thing and in September Stable Diffusion took the world by storm by open sourcing a similar model, making it accessible to anyone with enough compute at their fingertips.

Stable Diffusion’s growth was staggering as can be easily seen in graphs like the following showing its adoption rate on GitHub.

https://twitter.com/PeterDiamandis/status/1595495173030612992

In a more quiet manner, efforts to replicate OpenAI’s CLIP model have also been smoothly developing with releases such as https://github.com/mlfoundations/open_clip, which in some cases surpasses the performance of its predecessor.

In the audio domain things have been relatively quieter, although there is plenty of work being done towards a big release. Riffusion was released shortly after Stable Diffusion by fine-tuning it on spectrograms and showcasing it is indeed possible to repurpose it for sound generation. On the other hand, recently Harmonai released a 24/7 streaming youtube channel that plays music generated by a diffusion model trained exclusively on audio.

https://www.youtube.com/watch?v=kJgxC9d0p50 (now available at https://www.youtube.com/watch?v=2nzSQ3up1kw)

However, the goal of this post is to talk about the audio embeddings the progress we’ve seen on that front in 2022. We will look at four new exciting models released last year and evaluate them on the HEAR benchmark to see how they perform against the best models available at the time of the NeurIPS competition in 2021.

The HEAR Benchmark

The HEAR benchmark tries to answer a simple question:

What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning?

It gathers a variety of different tasks across multiple audio domains such as music, speech and environmental sounds. It uses a pretrained model to extract features for a particular task and trains a shallow classifier on top to gauge how useful these features are for the given task, in an approach very similar to transfer learning.

Check out their dedicated tasks page for in depth information about the tasks being used.

Candidate Models

Between the selected candidates we have models created with the specific goal of generating audio embeddings, models that act as neural codecs or compressors and finally intermediate representations in generative models. In alphabetic order:

Archisound

Archtineai, an open source AI lab from Switzerland, released one of the first open source audio diffusion generation models on the internet at https://github.com/archinetai/audio-diffusion-pytorch. Part of the diffusion pipeline is an autoencoder which can significantly compress the input audio. A few pretrained autoencoders are available at https://github.com/archinetai/archisound under a very simple API. For the purposes of this post we use the dmae1d-ATC64-v1 model.

CLAP

The developers of open-clip, LAION, have also been working on an effort to create a network similar to CLIP, but which can understand the connections between text and audio. Their efforts have been fully open sourced on GitHub at https://github.com/LAION-AI/CLAP. It is currently the model trained on the biggest publicly available dataset of captioned audio to our knowledge.

Encodec and SoundStream

These two models are both neural codecs released by Facebook (Encodec) and Google (Soundstream) respectively. Their purpose isn’t necessarily to embed audio for downstream tasks, but to compress it efficiently for transfers. But as we know, compression is intelligence so perhaps they can turn out to be useful.

Challenges

There were a couple of challenges along the way. The HEAR benchmark contains tasks where a fine grained representation is needed at the timestamp level, as well as a representation of the entire audio clip.

This poses different challenges for different models. CLAP by default requires a minimum of 10 seconds of audio to be passed in and will pad any shorter duration. It also does not normally return intermediate embeddings, so a few changes were required to be able to retrieve timestamp level embeddings. At the same time we wanted to keep the fusion capabilities for longer tracks so the implementation was trickier compared to the other models.

On the other hand, the other models return only fine grained embeddings. Here the solution to retrieve a full clip embedding — taking the mean — is much easier to implement but as we will see later is far from ideal in terms of quality.

A more surprising challenge was the lack of support for TF Lite models such as SoundStream within the heareval kit, but we will be sending a pull request to address this issue.

The code used for producing the results can be found in this notebook on Google Colab. If you have any questions or suggestions leave a comment or send us an email!

Results

We will display the results in a bar plot showing the previous best performance per task as well as the performance of each of our chosen models. On top of that we display the performance of openl3, one of the top performing models from the competition.

We notice CLAP obtains new SOTA performance on several tasks: FSD50K, ESC-50, Mirdangam Tonic, but performs very poorly on some tasks like Maestro, NSynth, VoxLingua107. It is likely that there is some overlap between test data used by the hear benchmark and the training data used by CLAP which leads to the new SOTA numbers on these two benchmarks since the paper itself also reports numbers that are lower than this.

Another interesting outlier is the archisound autoencoder achieving great performance on the Beehive task. It is also the best out of the investigated models on the pitch detection task, but falls short on more complex ones like FSD50K.

Both the neural audio codecs and the archisound autoencoder perform poorly on these more complex tasks on longer audio clips (FSD50K, GTZAN Genre), suggesting that simply aggregating very fine grained embeddings by taking a mean is not the right approach.

Conclusion

Due to the large amount of training data we expected CLAP to perform well on a wide variety of tasks, but it has not delivered within this trial. Moreover, while neural codecs can compress audio to bitrates previously thought impossible just naively using them as inputs to other tasks will not always lead to good performance. As suggested by recent generative models like MusicLM a combination of a high level, semantic representation like CLAP and a low level one like the neural codecs is one way to go forward.

This post was written in the beginning of January to summarize the progress that happened in 2022. However, the start of 2023 was very eventful for audio AI with several models released by different groups showcasing text to audio diffusion, one of them being based on an updated version of the archisound autoencoder. 2023 is already starting to look as one of the most exciting years in recent history for AI audio as a whole and it is likely this will apply to the audio representation side as well.