Audio Classification using Deep Learning and TensorFlow: A Step-by-Step Guide

David Oluyale
5 min readSep 25, 2023

--

Source: Pexels.com

Audio classification is a fascinating field with numerous real-world applications, from speech recognition to sound event detection. In this article, we will walk through the process of building an audio classification model using deep learning and TensorFlow. We’ll cover everything from preparing the dataset to training the model and making predictions.

Dataset Preparation

To get started, we need a dataset of audio samples. In our case, we’ll work with a simple dataset containing two classes: “cat” and “dog.” The dataset is organized in a folder structure as follows:

training_data/
├── cat/
│ ├── cat_1.wav
│ ├── cat_2.wav
│ └── ...
└── dog/
├── dog_1.wav
├── dog_2.wav
└── ...

Our goal is to classify these audio files into the correct classes. But before we can do that, we need to preprocess the data.

Data Preprocessing

  1. Loading Audio Data: We use the librosa library to load the audio files. It's important to specify the sample rate (sr) as None to preserve the original sampling rate.
  2. Converting to Mel Spectrogram: Instead of using raw audio data, we convert it into a Mel spectrogram. A Mel spectrogram is a visual representation of audio data that’s easier for a neural network to process. We create these spectrograms using librosa.
  3. Resizing Spectrograms: To ensure that all spectrograms have a consistent shape, we resize them to a target shape (e.g., 128x128 pixels). This step is crucial to ensure compatibility with the neural network.

Building the Model

Now that we have preprocessed our data, we can build our deep learning model. In our case, we create a convolutional neural network (CNN), which is a common choice for image and spectrogram-based tasks.

Start by installing the needed libraries

import os
import librosa
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical
from tensorflow.image import resize
from tensorflow.keras.models import load_model

Model Architecture

Our model architecture consists of the following layers:

  • Input layer with the shape of our resized spectrograms.
  • Two convolutional layers with ReLU activation functions followed by max-pooling layers.
  • A flatten layer to prepare the data for the fully connected layers.
  • A dense hidden layer with ReLU activation.
  • An output layer with as many neurons as there are classes (in our case, two: “cat” and “dog”) and a softmax activation function to produce class probabilities.

# Define your folder structure
data_dir = 'training_data'
classes = ['cat', 'dog']

# Load and preprocess audio data
def load_and_preprocess_data(data_dir, classes, target_shape=(128, 128)):
data = []
labels = []

for i, class_name in enumerate(classes):
class_dir = os.path.join(data_dir, class_name)
for filename in os.listdir(class_dir):
if filename.endswith('.wav'):
file_path = os.path.join(class_dir, filename)
audio_data, sample_rate = librosa.load(file_path, sr=None)
# Perform preprocessing (e.g., convert to Mel spectrogram and resize)
mel_spectrogram = librosa.feature.melspectrogram(y=audio_data, sr=sample_rate)
mel_spectrogram = resize(np.expand_dims(mel_spectrogram, axis=-1), target_shape)
data.append(mel_spectrogram)
labels.append(i)

return np.array(data), np.array(labels)

# Split data into training and testing sets
data, labels = load_and_preprocess_data(data_dir, classes)
labels = to_categorical(labels, num_classes=len(classes)) # Convert labels to one-hot encoding
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)

# Create a neural network model
input_shape = X_train[0].shape
input_layer = Input(shape=input_shape)
x = Conv2D(32, (3, 3), activation='relu')(input_layer)
x = MaxPooling2D((2, 2))(x)
x = Conv2D(64, (3, 3), activation='relu')(x)
x = MaxPooling2D((2, 2))(x)
x = Flatten()(x)
x = Dense(64, activation='relu')(x)
output_layer = Dense(len(classes), activation='softmax')(x)
model = Model(input_layer, output_layer)

Compiling the Model

We compile the model using the Adam optimizer and categorical cross-entropy loss function. The accuracy metric is used to monitor the model’s performance during training.

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])

Training the Model

With our model architecture in place, it’s time to train the model on our preprocessed dataset. We split our data into training and testing sets to evaluate the model’s performance.

# Train the model
model.fit(X_train, y_train, epochs=200, batch_size=32, validation_data=(X_test, y_test))

Key Training Parameters

  • We train the model for 200 epochs with a batch size of 32. The number of epochs represents how many times the model will see the entire training dataset.
  • We monitor and report training metrics such as loss and accuracy.

Model Evaluation

After training, we evaluate the model on the testing dataset. We calculate and report the model’s accuracy, which tells us how well the model performs on unseen data. Additionally, we discuss the concept of overfitting and how it can be mitigated using techniques like regularization.

test_accuracy=model.evaluate(X_test,y_test,verbose=0)
print(test_accuracy[1])
Model Accuracy

Saving and Loading the Model

Once our model is trained and evaluated, we save it to disk using the HDF5 format. This allows us to load and use the model for future predictions without having to retrain it each time.

# Save the model
model.save('audio_classification_model.h5')

Testing the Model on New Audio

We provide a function to test the trained model on new audio files. This function loads an audio file, preprocesses it, makes predictions, and displays the predicted class along with class probabilities. This is a crucial step in applying the model to real-world scenarios.

# Load the saved model
model = load_model('audio_classification_model.h5')

# Define the target shape for input spectrograms
target_shape = (128, 128)

# Define your class labels
classes = ['cat', 'dog']

# Function to preprocess and classify an audio file
def test_audio(file_path, model):
# Load and preprocess the audio file
audio_data, sample_rate = librosa.load(file_path, sr=None)
mel_spectrogram = librosa.feature.melspectrogram(y=audio_data, sr=sample_rate)
mel_spectrogram = resize(np.expand_dims(mel_spectrogram, axis=-1), target_shape)
mel_spectrogram = tf.reshape(mel_spectrogram, (1,) + target_shape + (1,))

# Make predictions
predictions = model.predict(mel_spectrogram)

# Get the class probabilities
class_probabilities = predictions[0]

# Get the predicted class index
predicted_class_index = np.argmax(class_probabilities)

return class_probabilities, predicted_class_index

# Test an audio file
test_audio_file = 'dog_barking_4.wav'
class_probabilities, predicted_class_index = test_audio(test_audio_file, model)

# Display results for all classes
for i, class_label in enumerate(classes):
probability = class_probabilities[i]
print(f'Class: {class_label}, Probability: {probability:.4f}')

# Calculate and display the predicted class and accuracy
predicted_class = classes[predicted_class_index]
accuracy = class_probabilities[predicted_class_index]
print(f'The audio is classified as: {predicted_class}')
print(f'Accuracy: {accuracy:.4f}')
result of prediction

Conclusion

In this article, we’ve walked through the entire process of building an audio classification model using deep learning and TensorFlow. We started with dataset preparation and data preprocessing, followed by model architecture and training. We discussed model evaluation, saving and loading, and testing the model on new audio files.

Google Colab : https://colab.research.google.com/drive/1Ij2DjpBhCk66nUu1oKbxRTihFuQQqoTj?usp=sharing

--

--

David Oluyale

Over 8 years experience in Software Engineering | Data Scientist @ UNDP ExpertRos