Comparative Audio Analysis With Wavenet, MFCCs, UMAP, t-SNE and PCA
The results are hosted in a small web application on my university's servers — have a play with it! Run the mouse over the purple dots to hear the sounds that are associated with the two-dimensional position vector.
Feel free to play with the features used (MFCCs or Wavenet latent variables) and the method of dimensionality reduction (UMAP, t-SNE or PCA.) UMAP and t-SNE will also have parameters such as step amount or perplexity that can be tweaked.
So what do we mean by dimensionality? It is an important topic in machine learning and data science that describes the potential complexity of a dataset. A dataset will comprise a multitude of data points, each having a constant amount of features, or dimensions. For example, I might be an avid bird watcher and create a dataset on the birds I come across in my travels. If each of the data points stored values such as beak length, wingspan and feather colour, then I would say my dataset has a dimensionality of three.
And why exactly do we care about dimensionality? Take this analogy:
You have dropped some cash somewhere along a straight path. You want to find it, so you walk along the line and find it after a relatively short amount of searching.
Clumsily, you are playing sport and drop the cash again in the field you were playing in. Finding the cash is considerably harder now as there are two axis to check for every position. Consequently, finding the cash takes considerably more time.
You magically become the world's most clumsy astronaut. During a spacewalk, the cash slips out of your back pocket. Irritated, you set out the next day to find the cash. You are now searching through three dimensions trying to find the cash in the vacuum of space. This takes far more time and resources than before and understandably the guys at Houston are not happy.
Hopefully it is clear as the dimensions increase (often well beyond three dimensions,) finding solutions and relevant areas (i.e where the cash is) requires greater amounts of time and resources. This holds true for both humans and computers. Another important issue is that you’ll need more data to precisely model higher dimensional space; as dimensionality increases, the volume of space increases so much — exponentially — that the available data becomes sparse, and that data may be too sparse to support a statistically significant model, as the data points all seem dissimilar in many ways over the many dimensions.
Dimensionality reduction is an important topic for those who practice machine learning, as high dimensionality can lead to a high computational cost as well as a tendency to overfit the data. With this in mind we begin to unravel the subject aptly named the curse of dimensionality, which refers to the phenomena that arise when computing high dimensional datasets in some manner.
What Is Dimensionality Reduction?
In dimensionality reduction we are looking to reduce our datasets dimensions. The higher the amount of dimensions, the harder it is to visualise it, and these features can be correlated, therefore increasing the redundancy of information in the dataset.
The most simple approach to dimensionality reduction might be to choose a subset of features that best describes our data, and discarding the rest of the dimensions from our data, which is called feature selection. This unfortunately is likely to throw away information.
A slightly better class of solutions could be instead transforming the dataset to a lower dimensional one. This is called feature extraction and is the main thrust of this article.
What Data Are We Using?
As an audio nerd, I thought it would be nice to try and reduce audio files (each of an arbitrary length,) down to a couple of values, so that they might be plotted and explored in a two-dimensional graph. This would enable one to explore an audio dataset and hopefully find similar sounds quickly. In Python, we can easily obtain the audio PCM data by using the librosa library. Here we loop through a folder of samples, and load the audio audio data for each file provided it is a wav file.
In this project the idea is to load the sample into memory and create a sequence of features from the audio. The features are then processed in a manner covered later so that it doesn’t matter how long the sequence of features are. After this, the features can be reduced in dimensionality by some method, for example PCA.
There are a number of ways we could take an array of PCM data and transform it to best describe the sound. We could turn the sound into frequencies over time and look at things like the spectral centroid, or the zero crossing rate. But next we’ll look at a robust feature that is prolific in speech recognition systems, the Mel-Frequency Cepstral Coefficients.
Mel-Frequency Cepstral Coefficients (MFCCs) can actually be seen as a form of dimensionality reduction; in a typical MFCC computation, one might pass a snippet of 512 audio samples, and receive 13 cepstral coefficients that describes that sound. Whilst MFCCs were initially developed to represent the sounds made by the human vocal tract, they turn out to be a pretty solid timbral, pitch invariant feature, that has all sorts of uses outside of automatic speech recognition tasks.
When obtaining MFCCs, the first step is computing the fourier transform of our audio data, which takes our time domain signal and turns it into a frequency domain signal. This computed by the fast fourier transform, which is an incredibly important algorithm of our time.
We now take the power spectrum from the frequencies we just computed and apply the Mel-Filterbank to it. This is as simple as summing the energies in each filter. The Mel-Frequency scale relates to perceived frequency of a pre tone compared to it’s actual measurement of the pitch; we are much better at noticing small perturbations in lower frequencies that we are at high frequencies. Applying this scale to the power spectrum closer relates the features to what Humans actually perceive.
We then compute the logarithm on each of the filtered energies, which is motivated by human hearing that doesn’t perceive loudness in a linear scale. This means if the sound is loud to begin with, large variations in volume won't sound that different.
The final step is to compute something called the cepstrum. A cepstrum is the spectrum of a spectrum. In english, that means to compute the Discrete Cosine Transform (DCT) of the log filterbank energies, which gives us the periodicity of the spectrum, which shows us how quickly the frequencies themselves are changing. The DCT is a similar transform the the fourier transforms, but the DCT only returns values that are real numbers (floating point) where the DFT returns a complex signal of imaginary and real numbers.
Whilst it’s nice to have an overview of MFCCs, fortunately Python and Librosa allows us to be slightly more terse than the author of this article and compute the features in one line.
Wavenet and Neural Audio Synthesis (NSynth)
Google’s project Magenta is a group that ask the question: Can machine learning be used to create compelling art and music? Neatly sidestepping the undefined, black-hole, questions of computational creativity, they have engineered some incredibly cool generative tools that creative various forms of media such as images or music.
DeepMind (another Google subsidiary,) created one of the most prolific and impressive neural networks called Wavenet. Magenta took this generative model and turned it into an autoencoder, and the resultant network was dubbed NSynth.
Incase you haven’t come across an autoencoder before, they are simply a type of neural network that are often used for unsupervised learning. The aim for autoencoders is usually to learn an efficient encoding of some data, often for the purpose of dimensionality reduction, and more increasingly for generative models. A common feature of an autoencoder is it’s structure; it will be comprised of it’s two parts — the encoder and decoder. Often, but not always, the decoder’s weights and biases will be the transpose of the encoders. If you are confused by the transpose operation, why not check out my guide on linear algebra!
As I mentioned, the aim of an autoencoder is often to compress the inputs into a smaller latent variable. However Z here is a low dimensional embedding that is a function of the input audio.
It is shockingly simple to leverage this fantastic network. First, install Magenta (it’s just TensorFlow code!), and then download this model’s weights to your working directory. The code below will then obtain a vectorised sequence of hidden states from the network compressing the information in the original signal.
All of the samples in this dataset are of various sizes, which are denoted by fifth column in the below console output.
As we compute features for each of these samples, be it MFCCs or NSynth, the difference in their lengths causes a difference in the length of each sequence of the resultant features. An issue we faced in this project was somehow taking a variable length of features and compressing that into a set vector of numbers, that would describe each sound well.
In the end, the feature vector for each sound was the concatenation of three things. Firstly, the mean feature, which gave us the average feature for that distribution of features in the sequence for that sound. This meant that for each dimension of the feature, the mean was computed. For MFCCs, the mean feature had a size of 13, and for NSynth, 16.
Secondly, the standard deviation of each dimension in the feature was computed. This had the same size as the mean feature, and told us the spread of the distribution of features.
Finally, we calculated the mean first order difference between the successive feature frames. This told us how much on average the features changed over time. This again had a size of 13 for MFCCs, and 16 for NSynth features.
The concatenation of these features mean that from an end to end perspective any sample of an arbitrary length would be squashed from it’s respective length in samples to 39 numbers if the features used were MFCCs or 48 numbers if the features were Wavenet based. Given an arbitrary length numpy array of arbitrary dimensional features, computing a single sized feature vector is as follows:
The first port of call is textbook linear algebra algorithm, Principal Component Analysis. I recall Dr. Rebecca Fiebrink who teaches an awesome machine learning course on Kadenze (seriously check it out) expressing mild exasperation over machine learning n00bs like me jumping to more complex algorithms such as t-SNE before exploring bread and butter algorithms such as PCA.
PCA works by trying to maximise the variance in data whilst reducing the dimensionality of it. It converts the data into linearly uncorrelated variables called principal components. Assuming we would like a two dimensional plot of this transformed data, we would use the two principal components with the largest variance to reveal the structure in the data. If you would like to understand it further, please see my linear algebra blog post for a numpy implementation and explanation.
We can easily compute PCA on features like so:
So what do the plots look like? Recall we essentially have two datasets, one with Wavenet-based features and one with MFCC derived features, and so each point in the two dimensional plot represents an audio file. We can see the Wavenet plot here:
And the MFCCs are plotted next:
Interestingly, both feature plots have a small concentration which are comparable samples; kicks and other very short percussive sounds, often with a lot of low end energy in the signal. Empirically, this could mean that the two feature vectors could discriminate well on these kind of sounds.
In both plots for the two features, it would also roughly seem that the frequency content is dictated by the y-axis; if you try running the mouse from the top to the bottom of the screen over the plots, the hi-hats and other high frequency sounds are at the top and the kicks tend to be at the bottom, whilst the more mid-energy snares and claps are around the middle. However, it was harder to discern what the x-axis represented.
The next dimensionality reduction algorithm, t-Distributed Stochastic Neighbour Embedding (t-SNE,) was designed for high dimensional datasets by Laurens van der Maaten and the neural network deity Geoffrey Hinton.
There are two stages to the t-SNE algorithm. It first constructs a probability distribution over pairs of high dimensional objects, so that similar objects are more likely to be picked. Because we want a low dimensional representation of these high dimensional objects, it constructs a similar probability distribution for low dimensional map. The divergence between the two probability distributions is then minimised. This divergence, or relative entropy is called the Kullback–Leibler divergence.
Sklearn makes computing t-SNE embeddings easy.
There were a few parameters for the t-SNE function, which are really well described here. I’ll leave a brief explanation however. The first argument for the algorithm is perplexity, which relates to the number of nearest neighbours used in other manifold learning algorithms. The perplexity changed for each new column. The other parameter is the iteration amount, which is how long t-SNE should optimise for. The iteration amount increased for every successive row. The iteration count had a far greater impact on the plots, and using the Wavenet features, we can see them here:
The MFCC feature based plots are next:
It was clear, for both feature datasets, that a solution was not correctly optimised in the plots when the iteration was low (the first few rows.) This was notably pointed out in the distil article priorly linked on using t-SNE effectively.
There was some clustering of sound that started to appear at the later and higher amounts of iterations. However, for both the feature sets, sometimes the local structure did not have comparatively similar sounds. The global structure often did paint the trend of the sounds — i.e one large portion of the plot would mostly be kicks and another might be hi-hats. The perplexity seemed to have little effect on the algorithm, which is well documented in the literature and sklearn’s documentation.
Uniform Manifold Approximation and Projection is a dimensional reduction technique. It has produced some really exciting results and I strongly urge you to check it out. From the github page it is described as working as:
Uniform Manifold Approximation and Projection (UMAP) is founded on three assumptions about the data
The data is uniformly distributed on Riemannian manifold;
The Riemannian metric is locally constant (or can be approximated as such);
The manifold is locally connected (not globally, but locally).
From these assumptions it is possible to model the manifold with a fuzzy topological structure. The embedding is found by searching for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure.
It is very straightforward to use as it is designed to be functionally similar to sklearn’s t-SNE packages. Here is the code I used to create the embeddings of the MFCC and Wavenet features.
Again I scaled the embeddings to be between zero and one, because the plots needed to be interpolated between one another. In the embeddings, scale doesn’t really matter, as like t-SNE, the only values that matter are what points are close to one another. In the code, we can see that again, some lists are looped over in a nested for loop to parameterise the UMAP function so we can see how it affects the embeddings. Note that the parameter settings on the far left and far right of the lists were poor arguments, and it was just the authors desire to see how the algorithm ran with such arguments.
The results for the Wavenet were beautiful plots with very interesting global and local structures. Each column is the amount of neighbours supplied to the algorithm, from a set of values [5, 10, 15, 30, 50]. A greater amount of neighbouring points used in local approximations of manifold structure will result in global structure being gained where local structure is lost. Each row is a set of values [0.000, 0.001, 0.01, 0.1, 0.5] to parameterise the minimum distance, which controls how close the embedding can compress the data points together. Having a larger value ensures the points are more evenly distributed, whereas smaller values will more accurately preserve local structure.
And the MFCC plots are just as good looking.
What is striking about the plots is the local structure that arise at the lower parameter settings, and conversely, the global structure that emerges when both parameter settings are very high. Jumping between the two features, where the parameter settings are the same, it noticeable that the Wavenet based features tend to preserve local structure slightly better than the MFCC based features.
In the interactive demo, try running the mouse over the local structure with the neighbours and distances sliders at a relatively low value like 1 or 2. You should notice that the algorithm correctly clusters these sounds together.
Largely, each of the algorithms were useful and it was really informative to parameterise and plot the outputs against both sets of features. One notable remark to make is on the interpretability of the plots. PCA seemed to be the strongest algorithm in this category, due to it’s relative simplicity. It was relatively easy to notice that the y-axis more or less encompassed high frequency content of the sample, which was a nice revelation.
Ensuring that the UMAP’s distances was not very high, and the neighbour amount at a very low value, it is easy to say that UMAP had far better local structure. Often the little lines and clusters were samples that had very high perceptual similarity. Inverting the parameters so that we had a high neighbour and minimum distance amount meant incorporating more global structure into the algorithm, and the global structure was a lot more convincing and empirically stronger that t-SNE’s or PCA’s structure.
The wavenet features were actually thrown in on a whim over the weekend and they really proved to be robust and reliable features when paired with the dimensionality reduction techniques. There was no noticeable degradation in clustering when compared to MFCC plots, and in other cases the plots actually appeared to be improved by using Wavenet embeddings compared to MFCCs with the same algorithm parameter settings. However, it is also important to highlight the time taken to compute the features. MFCCs wins easily here, as Wavenet is a complex network that takes a few seconds to process each sample on my laptop’s GPU. There is also the issue of downsampling the audio as an input for Wavenet. This degrades the quality of the audio inputs and potentially discards important information.
One nice takeaway is that we can do quality and useful dimensionality without complex new techniques such as UMAP or t-SNE. PCA with MFCCs as the feature produces usable and interpretable graphs, which although they are not as visually striking, still have functionality that everyone can use. However, one cannot deny the structural beauty of the other techniques.
Very quickly there are two code bases I want to share with you. The first is the notebook that was used to make this article. Its not as polished as I would usually like but I’m working against the clock so here it is. Feel free to use it, abuse it, and extend it as you like.
I have also uploaded the code for visualising these plots in the browser to github. I used the Material Design Lite library to create the user interface in a relatively clean manner, and the THREE.js library to plot the data super fast and optimised. The audio made easier with webaudiox.js.
As the author of this, I’d just like to say a massive thank you for reading this far. If you have any comments, please hit me up on twitter or leave a comment here. If you like the article, a share would go a long way. Peace!