# Audio signal feature extraction and clustering

Machine learning has been trending for almost a decade now. No wonder it has made countless claims and breakthroughs in the last few years. If you observe closely, probably you are using a tool made using ML right now. It is so amazing it almost feels like black magic and to be honest, if you knew what this field of Computer Science is capable of, you would end up feeling the same.

But since you are here reading this article, probably you already know. So let us not waste anymore time and get started.

## A word to the reader

- If you don’t have a Maths background, it is totally okay, I will provide visualizations to make it as easy as it can be plus this algorithm is probably one of the easiest in the whole field of ML. Probably a thing to learn from this is simple things can work wonders if they are designed right.
- This article will be lengthy so don’t give up. Take notes if necessary and code along to get the most out of it.
- The article and the following code have been inspired from a chapter of the book
*“Machine Learning with TensorFlow”.*Goes without saying that this is a more detailed approach to the problem using different set of tools.

This is a great book if you already know most of the well known algorithms or have at least some hint about them else you will just end up copying code from the book understanding nothing.

This book does not go into much detail about the algorithms and as the name suggests, it is not supposed to.

## Libraries and Frameworks

An important part of any application development involves picking the right tools and **NOT** reinventing the wheel. We will be using the following in our code( They are hyperlinked to their Code base ) :

Every library has its purpose and I’ll explain when to use what along the way.

*Installing them should be easy and I expect it to do it on your own.*

## A bird’s eye view

The fancy term for *Rules *is **Model. **A model is basically defined by parameters which is basically an array / matrix of numbers. The objective of any ML algorithm is to find the correct value of these numbers, so that we can use this trained model on data that hasn’t been seen by the model yet, in other words, new data.

There are two types of ML algorithms.

- In
**supervised**learning, the training data is labeled. For example, in this diagram, the ‘total_bill’ column is label while the rest is data. So for any given**x**in your training data you will be provided with**y.**

- In
**unsupervised**learning the label is not provided. So what are we supposed to do with the data. Our task in unsupervised learning is to find meaningful relations in the data. The example of this type of learning is soon to come as we would be implementing one of the most popular unsupervised learning algorithms.

## Audio Features

What do we mean when we say features? Something that can be quantified using numbers and should be observable in all the data points. For example a feature of a house would be number of bedrooms, price of the house, area of the house, age of the house etc. How about music files?

Every music file is basically made up of two important things, the **sample rate**, and the **sample data**. This has been beautifully described here in thes two videos. *I am no expert on audio signals hence if you wish to know more on the subject, I would suggest you go though these videos. For our purpose, it is totally optional.*

Now with the help of the sample rate and the sample data, one can perform several transformations on it to extract valuable features out of it.

- Zero Cross Rate
- Energy
- Entropy of Energy
- Spectral Centroid
- Spectral Spread
- Spectral Entropy
- Spectral Flux
- Spectral Roll off
- MFCC
- Chroma Vector
- Chroma Deviation

All of the above features have some or the other use but the one that we will be using is the ** Chroma Vector**. It is a representation of how humans relate colors to notes. In other words we think of

**same**notes but from two

**different**octaves to be of the same color. Thus we have 12 possible values at each window. A, A#, B, C, C#, D, D#, E, F, F#, G and G#. Of course they are not mutually exclusive hence for a given time fame one can have

*more than one*note. But to keep things simple we will only select the most prominent note for a particular window. Let us visualize the chroma vector using a chromagram.

As we can see, there is indeed more than one note being hit in the same time window. The arrows in the image below show the prominent notes that we would select for the given sample of a different example.

Now that we have a vector of notes, we will further find the note frequency. In other words we will find out which note is being hit how many time in the song. This image show the plot of the data we are expecting.

For example in this image, the 12th note was hit the most as compared to other notes. So finally we will be left with a (1x12) vector representing a data point. This will form the basic unit of our data set.

## K-means explanation

The algorithm in itself is pretty simple:

- Initialize all
**k**centroids. - Loop step 3 and 4 for given number of epochs
- Label the data-points with the closest centroid.
- Recalculate the centroids by taking the mean of all the data-sets with same labels.

Let us try to understand what is actually happening.

Each data point which is being represented as a set of 12 float values in our case if basically a 12 dimensional vector. Any point in 3D space can be represented as the set of 3 perpendicular vectors.

Well this would exactly be the case had our point been a feature vector of 3 features. Unfortunate we have 12 of them. So how do you visualize that? Well you don’t. That is the beauty of Maths, it lets you work with things you can’t even visualize in your head. So just extend all your vector geometry to a 12D space.

The distance formula now become:

Now to visualize what K-means is doing in each iteration, let us consider the data set to be 2D. Let k = 3. The algorithm first initializes 3 random centroids. Now the training loop starts and for each iteration, it paints each point in the data set with the color of the centroid that is closest (least distance) to it. When that is done, new centroids are calculated by taking the mean of the points with the same color. This animation shows the algorithm at work.

## Objective

Let’s jot down our main objectives here:

- Extract features and form an organized tabular table.
- Describe a model using the Tensorflow framework.
- Train the model using the feature table created in step 1.

The fun part of this DIY project is you can try it out on your own music files and you will be amazed to see the results.

## Code

We firstly import all the necessary libraries in our **jupyter-notebook**. This is mine.

Follow along.

import tensorflow as tf

import numpy as np

import pandas as pdfrom pyAudioAnalysis import audioBasicIO #A

from pyAudioAnalysis import audioFeatureExtraction #B

import matplotlib.pyplot as pltimport os #C

**#A — **This function is used to extract audio data like Frame rate and sample data of the audio signal.

**#B — **This function is responsible for extracting all the features from the audio signal that we talked about earlier.

**#C — **This is basically used to iterate through the music files in the file-system.

This is where we get our hands dirty. Let us start with the first point in our objective — **Extraction**. We start by defining the utility functions.

def preProcess( fileName ): [Fs, x] = audioBasicIO.readAudioFile(fileName) #A

#B

if( len( x.shape ) > 1 and x.shape[1] == 2 ):

x = np.mean( x, axis = 1, keepdims = True )

else:

x = x.reshape( x.shape[0], 1 )

#C

F, f_names = audioFeatureExtraction.stFeatureExtraction(

x[ :, 0 ],

Fs, 0.050*Fs,

0.025*Fs

)

return (f_names, F)

This function takes file name as an argument.

**#A — **We call the function provided by the library to read the audio data. This returns **Fs** and **x**. **Fs** is the *framerate* of the audio sample and **x** is a numpy array representing the *sample data* that you see in any music editing software.

**#B — **Did you notice that the picture I provided above has two sample data( The yellow wave live thing ) for same song. This is because it has two channels, one for the right and the other for left (Stereo). But we can only deal with a single channel (Mono). So we basically check if the sample data in fact has data for two channels, and if it does, then we take the mean of the two channel data.

**#C — **This function extracts all the 11 features that I had listed earlier and returns the feature names and their values in a numpy matrix. The **f_names** (feature names) will not be really useful as we know which row of the **F **matrix holds which data. The first argument is the sample data itself. The second and third argument are the **window size** ( 50 msec ) and **step amount** ( 25 msec ) respectively.

Lastly we return the the data from this function.

`def getChromagram( audioData ):`

# A

temp_data = audioData[ 21 ].reshape(

1,

audioData[ 21 ].shape[0]

)

chronograph = temp_data

# B

for i in range( 22, 33 ):

temp_data = audioData[ i ].reshape(

1,

audioData[ i ].shape[0]

)just say you love me

chronograph = np.vstack( [ chronograph, temp_data ] )

return chronographjust say you love me

Like we discussed earlier, we are planning to use only the chronagram feature of the audio signals hence we will separate out that data from the rest of the function.

**#A —** We pick the “chroma_1” feature from the audioData numpy matrix and create a new numpy matrix.

**#B — **We loop through the rest 11 chroma and stack them vertically on top of each other.

`def getNoteFrequency( chromagram ):`

numberOfWindows = chromagram.shape[1] #A

freqVal = chromagram.argmax( axis = 0 ) #B

histogram, bin = np.histogram( freqVal, bins = 12 ) #C

normalized_hist = histogram.reshape( 1, 12 ).astype( float ) / numberOfWindows #D

return normalized_hist

Like we discussed earlier, now we need to find the most prominent note in each window, and then we wish to find the frequency with which each of the 12 notes are hit.

**#A — **Total number of time frame aka windows. This will be useful while normalization step.

**#B —** Find the index of the most prominent note in the vertical axis ( axis = 0 ). This gives us a number between 0–11. The output might look something like this:

*[0, 0, 0, 1, 1, 3, 9, 9, 11, 0 …number of windows]*

**#C — **Now we basically get the count of each note using the inbuilt function.

**#D — **Finally we normalize the data for the reason we mentioned above. This is the final feature vector which will define our data point. This will be a (1x12) vector — The relative frequency for each of the 12 notes.

**Note: **The plotting functions are trivial and can be directly copied from matplotlib examples page. You can play around with these functions and different files to see how the chromagram and frequency plot changes with different genre of music.

fileList = []def getDataset( filePath ):

X = pd.DataFrame( )

columns=[ "G#", "G", "F#", "F", "E", "D#", "D", "C#", "C", "B", "A#", "A" ]

for root, dirs, filenames in os.walk( filePath ):

for file in filenames:

fileList.append( file )

feature_name, features = preProcess(filePath + file )

chromagram = getChromagram( features )

noteFrequency = getNoteFrequency( chromagram )

x_new = pd.Series(noteFrequency[ 0, : ])

X = pd.concat( [ X, x_new ], axis = 1 )

data = X.T.copy()

data.columns = columns

data.index = [ i for i in range( 0, data.shape[ 0 ] ) ]

return data

Nothing special in this, we are just iterating over all the files in the filepath directory. We have finally created a **pandas** **Dataframe** out of our feature vector. The reason I prefer *pandas dataframe* over *numpy matrix* is because it makes the data look beautiful with column names and indices. I believe in making my work look pleasing so it is totally fine if you disagree and stick to numpy. Here is my dataframe.

Each row is a data-point. Each column is a feature. Thus we have 19 files and 12 features each. We have also created a **file list. **This will come in handy later when we display our output.

Now we can start working on the second part of our objective —**Describe the model.**

*Note: Teaching Tensorflow is beyond the scope of this article. If you have no clue how to use it, I would recommend going through the **Sessions** and **Graphs** docs of Tensorflow*

k = 4epochs = 1000

We start by defining the hyper-parameters for the K-means clustering algorithm. **k **being the number of clusters we wish to segregate the data into. **epochs **is the number of iterations the algorithm will run for. Both of them can be changed as and when necessary.

`def initilizeCentroids( data, k ):`

centroids = data[ 0: k ] #A

return centroids

As we had discussed earlier, we will start the algorithm by selecting k data points as the initial centroids. This function does exactly that.

**#A — **Given **data** and **k, **return the first k data points and these k points will act as initial centroids.

`X = tf.placeholder( dtype = tf.float32 )`

C = tf.placeholder( dtype = tf.float32 )

C_labels = tf.placeholder( dtype = tf.int32 )

So these are the tensors that will act as placeholders for our data. Here **X** will represent the data, **C** is the list of k centroids and **C_labels** is the centroid index that has been assigned to each data point.

Now let us take a look at the computation graphs that we have described:

The code for this looks like this.

expanded_vectors = tf.expand_dims( X, 0 ) #Aexpanded_centroids = tf.expand_dims( C, 1 ) #Bdistance = tf.reduce_sum( tf.square( tf.subtract( expanded_vectors, expanded_centroids ) ), axis = 2 ) #CgetCentroidsOp = tf.argmin( distance, 0 ) #D

Each of the above line are really doing many things at once. Let use break them down one by one.

**#A — **We add another dimension to **X** at index 0. If **X** had a shape of (19x12), now it is (1x19x12).

**#B — **We add another dimenstion to **C** at index 1. If **C** had a dimension of (kx12), now it is (kx1x12).

**#C — **You must be awfully puzzled why on earth we did add dimension to **X** and **C** in the above two steps. This is to make use of ** broadcasting**. That is a topic in itself to explain but here is a video to get you up to speed. Now here is what we did.

Try to recollect what we are trying to achieve when we wish to assign new centroids to data points.

*We wish to subtract each data point with each centroid to find the distance between them, then select the centroid that gave the least distance for each data point.*

Thus we will do **(Kx19)** distance calculations, and for each calculation, we will process **12** features for each vector. Notice that the 3rd dimension is same for both — 12. Thus each data point is copied K times in the 1st dimension, and each centroid is copied 19 times in the 2nd dimension. Thus subtracting the two will give us (Kx19) elements each having 12 features in the 3rd dimension.

Let us slice this (k,19,12) tensor along the plane perpendicular to the 2nd dimension.

As you can see, this slice is basically the difference of the **i***th* data point with all the K centroids. I hope this visualization will help you understand the concept of broadcasting even better.

Now we need to do an element-wise square for the whole matrix and sum up the values along the 3rd dimension. And we will be left with the distance of each point with all the K vectors. All we have to do is find the least one for each of the data points. Thus we take the argmin across the 1st dimension.

Thus we are left with the new centroids for each data point.

`sums = tf.unsorted_segment_sum( X, C_labels, k ) #A`

counts = tf.unsorted_segment_sum( tf.ones_like( X ), C_labels, k ) #B

reCalculateCentroidsOp = tf.divide( sums, counts ) #C

This code basically calculates the new centroids from the assigned labels and the data values. The computation graph is as follows.

**#A —**This adds up the feature vectors of the data points with same label id. Will end up giving K feature vectors.

**#B —** This will do the same thing as the above step but instead of adding the feature vectors of data points, we add ones. This will be used to take the mean of the K feature vectors generated in the above step finally giving us the new centroids.

**#C — **Dividing the two tensors to generate the new centroids.

We have finally reached the final part of our objective —** Training**

data_labels = [] #A

centroids = [] #Awith tf.Session() as sess:

sess.run( tf.global_variables_initializer() )

centroids = initilizeCentroids( data, k ) #B

for epoch in range( epochs ):

data_labels = sess.run( getCentroidsOp, feed_dict = { X: data, C: centroids } ) #C

centroids = sess.run( reCalculateCentroidsOp, feed_dict = { X: data, C_labels: data_labels } ) #D

print( data_labels )

print( centroids )

This is the driver function for the whole algorithm. We feed the graph all the necessary data from here. The training loop is also defined here.

**#A —** Defining the variables outside the scope so that they can be used after outside the scope as well.

**#B —** initializing the centroids.

Now withing the training loop, for the given number of epochs, the following two steps will be done.

**#C —** Assign the centroids to the data points.

**#D — **Recalculate the centroids.

## Output

Now you can use the file list you have save earlier and the new data_labels you just calculated to see which audio files can be clustered together.

`final_labels = pd.DataFrame( { "Labels": data_labels, "File Names": fileList } )`

final_labels

# Parting words

With that you have successfully understood and implemented you very own K-means audio signal clustering algorithm. If you wish to improve the code I wrote or have some suggestions for me, lemme know.

## Edit:

I have added a new TF2.0 implementation for the same concept in this Kaggle notebook. Do check it if you are planning to use TF2.0 instead of TF1.x.

## Edit:

A followup article has been added for you all: