Sound Classifier Using Convolutional Neural Network

Minhmovies
5 min readMay 2, 2020

During these times of self quarantining, I have been noticing all kind of different sounds near my apartment, which was one of the reason why I started thinking about a fun deep learning project that could perhaps identify different sounds in our daily life.

I started digging on how could a convolutional neural network model read audio file as input. One of the way I found out was to convert the audio file into a spectrogram. Then our model can learn the pattern of the spectrogram that we trained it on.

To convert audio file to images, I used Librosa, which is a Python module to analyze audio signals in general. First we will write a general function that can convert a sound file into a spectrogram (note that we are gonna be converting thousands of audio file so let’s try saving memory as much as possible by deleting objects after we used).

def spec_gen(filename,dest):
y, sr = librosa.load(filename, sr=None)
mel = librosa.feature.melspectrogram(y=y, sr=sr)
rescale = librosa.power_to_db(mel**2,ref=np.max)
fig = plt.figure(figsize=(1, 1),frameon=False).add_subplot(111)
fig.set_frame_on(False)

librosa.display.specshow(rescale, sr=sr,x_axis='time', y_axis='log')
fig.axes.get_xaxis().set_visible(False)
fig.axes.get_yaxis().set_visible(False)
plt.savefig(dest, dpi=400, bbox_inches='tight',pad_inches=0)
plt.close('all')
del fig, y, sr,mel,rescale

Now that we have a convert function, we will need to think what are our training data gonna be. This is when I encounter one of the biggest mistakes in this project, which is data’s insufficiency.

My first idea was to train the model on real cities sound through https://www.soundcities.com/index.php# database.

Getting all the sounds file from the page was a struggle and I put the code on github for anyone who is curious, but it takes me a really long time not only for trying to scarping the API, but also downloading the files and cleaning the error files.

The problem was that the database was categorized in 2 ways, either by cities or by mood. The database contain sounds from 140 cities, or 23 moods, but there are only around 3000 sounds file. Song when I tried to train the model based on these datas, I could not get through the problem of over fitting with the high training accuracy and low validation accuracy.

However, I don’t want to get rid of all the data I work so hard to get, so I thought about training the model on a different data set and test the model on the city sounds data set that I have collected to see what will the model predict.

Urban Sounds is a popular data set that I have come across on Kaggle. The data set contain around 8000 audio files, 5000 of them are training files while the other 3000 are for testing. The sounds are categorized in 10 different classes.

The next procedure is pretty straight forward, we just need to convert all the sounds file into spectogram using the function we have.

Then we will start building our model. I believe there are many ways to improve the model, but this is one of the possibility that I tried.

from keras.layers import Dense, Activation, Dropout
from keras.models import Sequential, Model
from keras.layers import Conv2D, MaxPooling2D
from keras import regularizers, optimizers
import pandas as pd
import numpy as np
model = Sequential()
model.add(Conv2D(32, (3, 3), activation = 'relu',input_shape=(64,64,3)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Conv2D(64, (3, 3), activation = 'relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Conv2D(128, (3, 3), activation = 'relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Conv2D(256, (3, 3), activation = 'relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))
model.add(GlobalAveragePooling2D())
model.add(Dense(10, activation='softmax'))
model.compile(optimizer='adam',loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

We then just need to classify our training and validation inputs. We just need to add the extension “.jpg” to the ‘ID’ column of our csv file with correct labels.

train=pd.read_csv('Train_song/train.csv',dtype=str)
test=pd.read_csv('Test_song/test.csv',dtype=str)
train['ID'] = train['ID']+'.jpg'
test['ID'] = test['ID']+'.jpg'
img_datagen=ImageDataGenerator(rescale=1./255.,validation_split=0.25)train_generator=img_datagen.flow_from_dataframe(
dataframe=train,
directory="Train_song/Specto",
x_col="ID",
y_col="Class",
subset="training",
batch_size=32,
seed=48,
shuffle=True,
class_mode="categorical",
target_size=(64,64))
valid_generator=img_datagen.flow_from_dataframe(
dataframe=train,
directory="Train_song/Specto",
x_col="ID",
y_col="Class",
subset="validation",
batch_size=32,
seed=48,
shuffle=True,
class_mode="categorical",
target_size=(64,64))

Then we just train the model.

train_step=train_generator.n//train_generator.batch_size
valid_step=valid_generator.n//valid_generator.batch_size
model.fit_generator(generator=train_generator,
steps_per_epoch=train_step,
validation_data=valid_generator,
validation_steps=valid_step,
epochs=200)
model.evaluate_generator(generator=valid_generator, steps=valid_step
)

And here is the result, which is not the best accuracy we want but we can change this by adding more layer and customizing the measurements.

[0.40851712226867676, 0.8958333134651184]

After finishing training we can test it, let’s first test it on the test files.

test_datagen=ImageDataGenerator(rescale=1./255.)
test_generator=test_datagen.flow_from_dataframe(
dataframe=test,
directory="Test_song/Specto",
x_col="ID",
y_col=None,
batch_size=32,
seed=42,
shuffle=False,
class_mode=None,
target_size=(64,64))
test_step=test_generator.n//test_generator.batch_size
test_generator.reset()
pred=model.predict_generator(test_generator,
steps=test_step,
verbose=1)
predicted_class_indices=np.argmax(pred,axis=1)
labels = (train_generator.class_indices)
labels = dict((v,k) for k,v in labels.items())
predictions = [labels[k] for k in predicted_class_indices]
print(predictions[0:20])

The results are these: [‘drilling’, ‘dog_bark’, ‘drilling’, ‘dog_bark’, ‘street_music’, ‘jackhammer’, ‘jackhammer’, ‘children_playing’, ‘dog_bark’, ‘siren’, ‘dog_bark’, ‘siren’, ‘dog_bark’, ‘children_playing’, ‘street_music’, ‘siren’, ‘dog_bark’, ‘drilling’, ‘drilling’, ‘jackhammer’], which is pretty accurate.

Now for fun let’s try testing the model through the sound cities files, and here are the results I got.

Here are the first 10 sounds it suppose to predict:

The results are obviously flawed because unlike the urban sounds data, which are high resolution and short. The soundcities data are much more complex and longer, each sound file can have multiple sounds effect in it.

We can have even more fun by testing the model through our real life sound. We just need to record a file, then put the audio file in the function and test the model on the result spectrogram. And through the results, we can add more data and customize the model so it will be more efficient.

My Problems during the project.

The biggest regret I had was definitely training the model on the soundcities data. I tried so hard to get all the audio files from the database and was too excited to notice that there are too many label classes and not enough data for the model to train. Hopefully in the future I can find a way around this problem, maybe gathering more data from other sources.

Github Link

https://github.com/nguymi01/SoundClassifer

References

https://medium.com/gradientcrescent/urban-sound-classification-using-convolutional-neural-networks-with-keras-theory-and-486e92785df4

https://www.kaggle.com/msripooja/steps-to-convert-audio-clip-to-spectrogram

--

--