Classifying Music and Speech with Machine Learning
An audio classification walkthrough with code
Introduction
The difference between music and speech is crystal clear to human ears, but how do you train a machine to learn the same?
My goal is to create a classifier that can differentiate between music and speech.
Like my earlier articles on Pokémon and waste classification, I’ll do this using a convolutional neural network.
I based my approach and model off of this TensorFlow tutorial, which builds a speech recognition network that recognizes 10 different keywords:
Data Source
For this project, I’ll use the GTZAN music speech dataset:
It is part of the TensorFlow Datasets catalog and contains 120 tracks that are each 30 seconds long. Under “Display Examples…” at the above link, you can listen to samples from both the music and speech classes.
Setup
First things first, I pip the Pydub library, a Python library for manipulating audio.
%pip install pydub
You can read more about Pydub here:
Alternatively, you can install Pydub within your command line.
Now I’ll import all the libraries we’ll need for this project:
import numpy as npimport pandas as pdimport osimport pathlibimport matplotlib.pyplot as pltimport seaborn as snsfrom IPython import displayfrom sklearn.model_selection import train_test_splitimport tensorflow as tffrom tensorflow.keras.layers.experimental import preprocessingfrom tensorflow.keras import layersfrom tensorflow.keras import modelsimport tensorflow_datasets as tfds
Then, I load the dataset from TensorFlow, set up a path to the directory where the data is stored, and store the names of the categories in a list.
Note: WAV is an audio file format.
Now, we’ll get all the filenames:
From the output, we see that we have 128 samples, with 64 for each class.
Finally, we’ll split the dataset into training and validation sets in a 3:1 ratio. The TensorFlow audio recognition tutorial also creates a test set, but I’ll skip that here as we’re working with a tiny dataset.
Data Preprocessing
To start, let’s create a dataset with the waveform and label for each training file.
Here’s what the waveforms look like:
As we’re using a convolutional neural network for this project, we need to transform the waveforms into spectrograms, which are visual representations of the spectrum of frequencies of signals over time. We’ll create a function for this conversion:
As an example, this is the conversion for one music sample:
Note: unfortunately, IPython’s display does not render properly within GitHub gists. If you uncomment the last two lines in this code cell, you will have the option of playing the audio sample within the cell’s output under Audio playback
.
For comparison, we’ll plot both the waveform and spectrogram for this sample. In spectrograms, colors reflect the amplitudes of the frequency of the waveform.
Now, we’ll do the same preprocessing for the rest of the training set and the validation set.
Training the Model
We’re now ready to train our classifier! If you’re using Google Colab, I recommend using the GPU hardware accelerator to speed up the process.
Let’s start by selecting the batch size and optimizing performance using cache()
and prefetch()
:
batch_size = 32train_ds = train_ds.batch(batch_size)val_ds = val_ds.batch(batch_size)train_ds = train_ds.cache().prefetch(AUTOTUNE)val_ds = val_ds.cache().prefetch(AUTOTUNE)
Then we’ll build, compile, and fit the model:
Results
Now that we have our trained classifier, let’s plot its loss and accuracy during training:
From the loss curve, we see that both training and validation loss decrease sharply before leveling off. The opposite is true for the accuracy plot, where both training and validation accuracy improve. Remarkably, validation accuracy peaks and levels off at 100%!
Conclusion
We’ve successfully trained a classifier to differentiate between music and speech! But this is just a small step into the realm of audio-related machine learning. Here are some ideas for going further:
- Test out other machine learning models
- Experiment with data augmentation and hyperparameter tuning
- Use mel spectrograms instead of spectrograms and compare the resulting model’s performance to the one trained here
- Give transfer learning a try by following this guide:
- Look into other audio classification tasks and datasets! You can find a handful from TensorFlow at the following link:
References
In addition to the ones linked throughout this article, I wouldn’t have been able to complete this project without the help of this awesome tutorial:
[1] Medium | Music Genre Classification with Python by Parul Pandey