Classifying Music and Speech with Machine Learning

An audio classification walkthrough with code

Code AI Blogs

Published in

CodeAI

4 min readMay 15, 2021

Introduction
Data Source
Setup
Data Preprocessing
Training the Model
Results
Conclusion
References

Introduction

The difference between music and speech is crystal clear to human ears, but how do you train a machine to learn the same?

My goal is to create a classifier that can differentiate between music and speech.

Like my earlier articles on Pokémon and waste classification, I’ll do this using a convolutional neural network.

I based my approach and model off of this TensorFlow tutorial, which builds a speech recognition network that recognizes 10 different keywords:

Simple audio recognition: Recognizing keywords | TensorFlow Core

This tutorial will show you how to build a basic speech recognition network that recognizes ten different words. It's…

www.tensorflow.org

Data Source

For this project, I’ll use the GTZAN music speech dataset:

gtzan_music_speech | TensorFlow Datasets

TensorFlow Lite for mobile and embedded devices

www.tensorflow.org

It is part of the TensorFlow Datasets catalog and contains 120 tracks that are each 30 seconds long. Under “Display Examples…” at the above link, you can listen to samples from both the music and speech classes.

Setup

First things first, I pip the Pydub library, a Python library for manipulating audio.

%pip install pydub

You can read more about Pydub here:

jiaaro/pydub

Pydub lets you do stuff to audio in a way that isn't stupid. Stuff you might be looking for: Open a WAV file ...or a…

github.com

Alternatively, you can install Pydub within your command line.

Now I’ll import all the libraries we’ll need for this project:

import numpy as npimport pandas as pdimport osimport pathlibimport matplotlib.pyplot as pltimport seaborn as snsfrom IPython import displayfrom sklearn.model_selection import train_test_splitimport tensorflow as tffrom tensorflow.keras.layers.experimental import preprocessingfrom tensorflow.keras import layersfrom tensorflow.keras import modelsimport tensorflow_datasets as tfds

Then, I load the dataset from TensorFlow, set up a path to the directory where the data is stored, and store the names of the categories in a list.

Note: WAV is an audio file format.

Now, we’ll get all the filenames:

From the output, we see that we have 128 samples, with 64 for each class.

Finally, we’ll split the dataset into training and validation sets in a 3:1 ratio. The TensorFlow audio recognition tutorial also creates a test set, but I’ll skip that here as we’re working with a tiny dataset.

Data Preprocessing

To start, let’s create a dataset with the waveform and label for each training file.

Here’s what the waveforms look like:

As we’re using a convolutional neural network for this project, we need to transform the waveforms into spectrograms, which are visual representations of the spectrum of frequencies of signals over time. We’ll create a function for this conversion:

As an example, this is the conversion for one music sample:

Note: unfortunately, IPython’s display does not render properly within GitHub gists. If you uncomment the last two lines in this code cell, you will have the option of playing the audio sample within the cell’s output under Audio playback.

For comparison, we’ll plot both the waveform and spectrogram for this sample. In spectrograms, colors reflect the amplitudes of the frequency of the waveform.

Now, we’ll do the same preprocessing for the rest of the training set and the validation set.

Training the Model

We’re now ready to train our classifier! If you’re using Google Colab, I recommend using the GPU hardware accelerator to speed up the process.

Let’s start by selecting the batch size and optimizing performance using cache() and prefetch():

batch_size = 32train_ds = train_ds.batch(batch_size)val_ds = val_ds.batch(batch_size)train_ds = train_ds.cache().prefetch(AUTOTUNE)val_ds = val_ds.cache().prefetch(AUTOTUNE)

Then we’ll build, compile, and fit the model:

Results

Now that we have our trained classifier, let’s plot its loss and accuracy during training:

From the loss curve, we see that both training and validation loss decrease sharply before leveling off. The opposite is true for the accuracy plot, where both training and validation accuracy improve. Remarkably, validation accuracy peaks and levels off at 100%!

Conclusion

We’ve successfully trained a classifier to differentiate between music and speech! But this is just a small step into the realm of audio-related machine learning. Here are some ideas for going further:

Test out other machine learning models
Experiment with data augmentation and hyperparameter tuning
Use mel spectrograms instead of spectrograms and compare the resulting model’s performance to the one trained here
Give transfer learning a try by following this guide:

Transfer Learning with YAMNet for environmental sound classification

YAMNet is an audio event classifier that can predict audio events from 521 classes, like laughter, barking, or a siren…

www.tensorflow.org

Look into other audio classification tasks and datasets! You can find a handful from TensorFlow at the following link:

TensorFlow Datasets

Educational resources to learn the fundamentals of ML with TensorFlow

www.tensorflow.org

References

In addition to the ones linked throughout this article, I wouldn’t have been able to complete this project without the help of this awesome tutorial:

[1] Medium | Music Genre Classification with Python by Parul Pandey

Classifying Music and Speech with Machine Learning

An audio classification walkthrough with code

Table of Contents

Introduction

Simple audio recognition: Recognizing keywords | TensorFlow Core

This tutorial will show you how to build a basic speech recognition network that recognizes ten different words. It's…

Data Source

gtzan_music_speech | TensorFlow Datasets

TensorFlow Lite for mobile and embedded devices

Setup

jiaaro/pydub

Pydub lets you do stuff to audio in a way that isn't stupid. Stuff you might be looking for: Open a WAV file ...or a…

Data Preprocessing

Training the Model

Results

Conclusion

Transfer Learning with YAMNet for environmental sound classification

YAMNet is an audio event classifier that can predict audio events from 521 classes, like laughter, barking, or a siren…

TensorFlow Datasets

Educational resources to learn the fundamentals of ML with TensorFlow

References

Written by Code AI Blogs