Urban Sound Classification using Convolutional Neural Networks with Keras: Theory and Implementation

Adrian Yijie Xu, PhD
GradientCrescent
Published in
7 min readFeb 27, 2019

Introduction

Over the past five years, developments in artificial intelligence have moved into the medium of sound, whether it be in generating new forms of music (with varying degrees of success), or identifying specific instruments from a video. Some of these projects, like IBM’s Watson Beat, have already been released for commercial use — indeed, it’s creator claims that the network behind Watson Beat has learned the emotional response associated with specific musical elements, something that is strongly subjective and previously the exclusive domain of human composers.

Although Long-Short Term Memory neural networks (LSTMs) are usually associated with audio-based deep learning projects, elements of sound identification can also be tackled as a traditional image multi-class classification task using convolutional neural networks. In this tutorial, we will demonstrate how a simple neural network made in Keras, together with some helpful audio analysis libraries, can distinguish between 10 different sounds with high accuracy, using the UrbanSound dataset available on Kaggle.

Not the sound wave we’re looking for, sadly!

It may seem not intuitive to utilize convolutional neural networks to perform sound classification, but the theory is actually quite simple — all audio can be represented with a spectrogram image, depicting changes in frequency (Hz) and intensity/loudness (dB) over time. If the group of sounds considered are individually distinct, their spectrogram plots will appear different enough to allow for distinguishing with a CNN. Naturally, an approach like this wouldn’t necessarily be as capable in distinguishing individual words in a sentence (which is what LSTM’s are known for), but would be suitable for separating a barking dog from regular conversations, for example.

We’ll assume that the reader is familiar with the elements of deep learning and the principles behind convolutional neural networks.

Implementation

The UrbanSound dataset consists of 8732 labelled short (less than 4 s) sound excerpts of urban sounds from 10 different classes. Both our training and test datasets consist of .wav files and an accompanying .csv spreadsheet detailing their ID and in the case of the training data, their correct descriptions, which serve as their labels. In order to convert our data into spectrogram representations, we will utilize LibROSA, an open-source python package for music and audio analysis.

A LibROSA spectrogram of an input 1-minute sound sample

Our implementation was performed on Kaggle, but any GPU-enabled Python instance should be capable of achieving the same results. Our code is adapted from a attempt done with transfer learning with FAST.AI. In this tutorial, we’ll build a network from scratch using Keras, to better understand how network architecture will affect classification of highly similar images.

To begin with, we import the Pandas and Numpy packages, along with a memory profiler to monitor memory use due to the large amount of data conversion that we need to perform on Kaggle’s servers.

%matplotlib inline
from memory_profiler import memory_usage
import os
import pandas as pd
from glob import glob
import numpy as np

We then install libavtools, an open source video and audio processing framework in order to access LibROSA. If you’re doing this locally, you should do the same via console.

%%capture
!apt-get install libav-tools -y

Next, let’s import the necessary Keras libraries to build our network as well as other necessary auxiliary packages. Of particular note is the garbage collector package, which allows us to clean up RAM during our data conversion process. Finally, we also build working directories in our Kaggle instance in order to store our converted images.

from keras import layers
from keras import models
from keras.layers.advanced_activations import LeakyReLU
from keras.optimizers import Adam
import keras.backend as K
import librosa
import librosa.display
import pylab
import matplotlib.pyplot as plt
from matplotlib import figure
import gc
from path import Path
!mkdir /kaggle/working/train
!mkdir /kaggle/working/test

Next, we begin our data conversion process by defining the functions that will convert our .wav files into images, in .jpg format. Briefly, we extract the audio time-series and sampling rate of each .wav file using LibROSA, before building and plotting a spectrogram of the data and saving it as a corresponding image.

def create_spectrogram(filename,name):
plt.interactive(False)
clip, sample_rate = librosa.load(filename, sr=None)
fig = plt.figure(figsize=[0.72,0.72])
ax = fig.add_subplot(111)
ax.axes.get_xaxis().set_visible(False)
ax.axes.get_yaxis().set_visible(False)
ax.set_frame_on(False)
S = librosa.feature.melspectrogram(y=clip, sr=sample_rate)
librosa.display.specshow(librosa.power_to_db(S, ref=np.max))
filename = '/kaggle/working/train/' + name + '.jpg'
plt.savefig(filename, dpi=400, bbox_inches='tight',pad_inches=0)
plt.close()
fig.clf()
plt.close(fig)
plt.close('all')
del filename,name,clip,sample_rate,fig,ax,S

Similarly, for our test data directory:

def create_spectrogram_test(filename,name):
plt.interactive(False)
clip, sample_rate = librosa.load(filename, sr=None)
fig = plt.figure(figsize=[0.72,0.72])
ax = fig.add_subplot(111)
ax.axes.get_xaxis().set_visible(False)
ax.axes.get_yaxis().set_visible(False)
ax.set_frame_on(False)
S = librosa.feature.melspectrogram(y=clip, sr=sample_rate)
librosa.display.specshow(librosa.power_to_db(S, ref=np.max))
filename = Path('/kaggle/working/test/' + name + '.jpg')
fig.savefig(filename, dpi=400, bbox_inches='tight',pad_inches=0)
plt.close()
fig.clf()
plt.close(fig)
plt.close('all')
del filename,name,clip,sample_rate,fig,ax,S

With that defined, let’s start converting our training data. We will do these in batches of 2000 images at a time, utilizing the garbage collector package to optimize memory use in-between batches.

Data_dir=np.array(glob("../input/train/Train/*"))%load_ext memory_profiler%%memit 
i=0
for file in Data_dir[i:i+2000]:
#Define the filename as is, "name" refers to the JPG, and is split off into the number itself.
filename,name = file,file.split('/')[-1].split('.')[0]
create_spectrogram(filename,name)
gc.collect()%%memit
i=2000
for file in Data_dir[i:i+2000]:
filename,name = file,file.split('/')[-1].split('.')[0]
create_spectrogram(filename,name)
gc.collect()%%memit
i=4000
for file in Data_dir[i:]:
filename,name = file,file.split('/')[-1].split('.')[0]
create_spectrogram(filename,name)
gc.collect()

Doing the same for our test dataset:

Test_dir=np.array(glob("../input/test/Test/*"))%%memit 
i=0
for file in Test_dir[i:i+1500]:
filename,name = file,file.split('/')[-1].split('.')[0]
create_spectrogram_test(filename,name)
gc.collect()%%memit
i=1500
for file in Test_dir[i:]:
filename,name = file,file.split('/')[-1].split('.')[0]
create_spectrogram_test(filename,name)
gc.collect()

Unlike in our previous tutorial, our data labels are not located directly in the filenames of the images in this project, but rather in the accompanying .csv spreadsheet. You could go through the spreadsheet and spend time renaming every file accordingly, but Keras provides a one-step data generator that conveniently reads the spreadsheets for the correct labels while preparing data in specified batches for training and validation.

Let’s use it now. Note that we append the .jpg extension to the ID column of our training data spreadsheet to ensure correct file association.

from keras_preprocessing.image import ImageDataGenerator

def append_ext(fn):
return fn+".jpg"

traindf=pd.read_csv('../input/train.csv',dtype=str)
testdf=pd.read_csv('../input/test.csv',dtype=str)
traindf["ID"]=traindf["ID"].apply(append_ext)
testdf["ID"]=testdf["ID"].apply(append_ext)

datagen=ImageDataGenerator(rescale=1./255.,validation_split=0.25)


train_generator=datagen.flow_from_dataframe(
dataframe=traindf,
directory="/kaggle/working/train/",
x_col="ID",
y_col="Class",
subset="training",
batch_size=32,
seed=42,
shuffle=True,
class_mode="categorical",
target_size=(64,64))

valid_generator=datagen.flow_from_dataframe(
dataframe=traindf,
directory="/kaggle/working/train/",
x_col="ID",
y_col="Class",
subset="validation",
batch_size=32,
seed=42,
shuffle=True,
class_mode="categorical",
target_size=(64,64))

Now for the fun part, let’s build our Sequential neural network model utilizing the RMSProp optimizer (try out ADAM or other optimizers and see if you can squeeze out some more accuracy!). Our network architecture consists of 6 convolutional layers with increasing filter density in order to best extract the features of each image with each successive layer. The pooling and dropout layers serve to increase computational efficiency and to prevent overfitting, respectively.

In our attempt, it was observed that adding in layers with high filter number at the end of the network boosted accuracy by up to 3 %. Intuitively, this can be understood as the earlier layer filters targeting features that are essentially shared by all spectrograms (lines and curves). In contrast, once we reach the later layers together with their convolved output maps, the spectrograms are similar enough that a higher number of complex filters will better help in distinguishing between them. Feel free to modify the network architecture to observe this first-hand.

from keras.layers import Dense, Activation, Flatten, Dropout, BatchNormalization
from keras.models import Sequential, Model
from keras.layers import Conv2D, MaxPooling2D
from keras import regularizers, optimizers
import pandas as pd
import numpy as np

model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same',
input_shape=(64,64,3)))
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))
model.add(Conv2D(128, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(128, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
model.compile(optimizers.rmsprop(lr=0.0005, decay=1e-6),loss="categorical_crossentropy",metrics=["accuracy"])
model.summary()

Now that everything’s ready, let’s fit and evaluate our model!

#Fitting keras model, no test gen for now
STEP_SIZE_TRAIN=train_generator.n//train_generator.batch_size
STEP_SIZE_VALID=valid_generator.n//valid_generator.batch_size
#STEP_SIZE_TEST=test_generator.n//test_generator.batch_size
model.fit_generator(generator=train_generator,
steps_per_epoch=STEP_SIZE_TRAIN,
validation_data=valid_generator,
validation_steps=STEP_SIZE_VALID,
epochs=150
)
model.evaluate_generator(generator=valid_generator, steps=STEP_SIZE_VALID
)
Loss and accuracy values from our model, trained over 150 epochs with a learning rate of 0.0005

It seems like our model is fitting the data quite well, with an accuracy approaching 95%. This project demonstrates the inherent power of good, clean, labelled data — running the network on only a fraction of the training data, say the initial 2000 images, will yield you an accuracy of only 83%. This is linked to the similarity of our spectrograms — don’t worry, you won’t need 6000 images to tell cats and dogs apart!

Let’s observe it’s performance by predicting it on the test set. We need to build a test generator to feed test data to our network.

test_datagen=ImageDataGenerator(rescale=1./255.)
test_generator=test_datagen.flow_from_dataframe(
dataframe=testdf,
directory="/kaggle/working/test/",
x_col="ID",
y_col=None,
batch_size=32,
seed=42,
shuffle=False,
class_mode=None,
target_size=(64,64))
STEP_SIZE_TEST=test_generator.n//test_generator.batch_size

Now let’s predict on the first 7 sound clips in our test data

test_generator.reset()
pred=model.predict_generator(test_generator,
steps=STEP_SIZE_TEST,
verbose=1)
predicted_class_indices=np.argmax(pred,axis=1)

#Fetch labels from train gen for testing
labels = (train_generator.class_indices)
labels = dict((v,k) for k,v in labels.items())
predictions = [labels[k] for k in predicted_class_indices]
print(predictions[0:6])

Your output should look something like this:

['jackhammer', 'children_playing', 'drilling', 'dog_bark', 'street_music', 'jackhammer', 'air_conditioner']

How does this stack up? Very nicely. I’ve included the original sound clips in the playlist below — in short, our model definitely works!.

Let’s summarize what we’ve learned,

  • Convolutional neural networks can classify sound clips to a high degree of accuracy through the use of image representations.
  • Complex networks with more filters in later layers outperform simpler ones when working with similar images.
  • The amount of data is key to improving classification accuracy, particularly with similar images.

Thanks for reading. For the original code in .py format, please see my GitHub.

References

UrbanSoundDataset

Hershey et. al, CNN architectures for large-scale audio classification

--

--