Sounds Right: An Introduction to Audio Classification

11 min readNov 23, 2023

A beginner’s guide to audio classification, covering the audio classification process, and the basics of identifying and categorizing different types of sound using machine learning algorithms.

Written By: Mostafa Ibrahim

In this article, we will explore the topic of audio classification using machine learning. We will dive into the implementation of a simple audio classification example using Keras, one of the most popular deep learning libraries available, and discuss the importance of preprocessing audio data and utilizing convolutional neural network architectures to build accurate and efficient audio classifiers.

Overall, these answers provide a glimpse into the fascinating world of audio classification and its potential impact on our understanding of the auditory world around us.

What Is Audio Classification in Machine Learning?

In machine learning, audio classification describes a computer’s ability to classify different sounds and audio events within a sound clip. It’s a pretty cool application of artificial intelligence that helps us make sense of the immense amount of audio out there.

There are many ways audio classification can be useful. Consider figuring out a song’s genre by listening to its beat or rhythm. Or, consider the power of transforming someone’s spoken words into written text through speech recognition. Audio classification even lets us detect the emotions in a person’s voice or identify who’s speaking by their unique vocal patterns.

Essentially, audio classification is a game changer in many fields, from music to environmental monitoring to communication. It’s a testament to the incredible advancements in machine learning that computers can now “listen” and understand audio just like we do!

How Does Audio Classification Work?

Source

1- Spectogram: Turning an audio waveform into a spectrogram is like creating a colorful map that shows the frequencies in a sound as they change over time. To create a spectrogram from an audio waveform, we split the audio into overlapping snapshots, smooth the edges, convert each snapshot into its frequency components using the Fourier Transform, and then assemble these components into a 2D image with time on the horizontal axis and frequency on the vertical axis. The brightness or color indicates the intensity of each frequency at different points in time.

Source

2- CNN Architecture: The CNN audio classification architecture processes audio data transformed into a visual format such as a spectrogram. This CNN is divided into four main layers, starting with an input layer, which takes its input as spectrogram data. Then, there is the convolutional layer, which finds local patterns in the input data.

Moving into the Relu function, which acts as our activation function. Finally, there are the pooling layers, which assist CNN in focusing on the most important features. They shrink the feature maps by keeping only the strongest signals. Finally, the CNN squishes the remaining information into a one-dimensional format. Then, it feeds this data into a few final layers that help it decide the audio’s classification.

Source

3- Feature Maps: Feature mapping in audio classification is all about finding the important bits in an audio signal that help the model understand the sound better. It’s like creating a sketch highlighting the sound’s key parts.

When using a Convolutional Neural Network (CNN) for audio classification, we usually start with a spectrogram showing how frequencies change over time. Then, the CNN applies filters to this image to detect interesting patterns or structures. This filtering process results in a set of “feature maps” that emphasize different aspects of the sound.

Source

In short, feature mapping simplifies the audio data, making it easier for the model to learn patterns and classify sounds accurately.

4- Linear Classifier: Once the features are mapped to a lower-dimensional space, a classification algorithm is used to classify the audio into one of several predefined classes. Linear classifiers, such as Support Vector Machines (SVMs) and logistic regression, are commonly used for audio classification, as they are simple and efficient.

Source

Other algorithms commonly used in audio classification include decision trees, k-nearest neighbor (KNN) classifiers, and neural networks. These algorithms are often combined with feature mapping techniques to achieve better classification performance.

5- Final Output: The final output of an audio classification model is a prediction of the class label or a binary classification label for a given input audio sample based on the model’s learned relationships between the input features and the class labels. In the case of audio classification, the model’s output may include a car horn, dog parking, engine, and drill.

Audio Classification Made Easy: The Best Libraries for the Job

Keras

Keras is a go-to choice for audio classification thanks to its ease of use and intuitive interface. With a wide range of pre-built models and neural network layers, like CNNs and RNNs, Keras makes building audio classification models accessible to both beginners and advanced users. These models can be easily customized and fine-tuned by adding or removing layers and adjusting hyperparameters, making it a versatile and flexible library.

What sets Keras apart, though, is its powerful tools for data preprocessing and augmentation. Keras makes it easy to load and preprocess audio data by converting audio files to spectrograms. With tools to augment data by adding noise, shifting pitches, or changing the tempo, Keras can help improve the performance of audio classification models, even when working with small datasets.

In the tutorial below, we will create a Python audio classification model using the Keras library.

Pytorch

PyTorch is one of the most popular open-source deep-learning libraries out there! Lucky for us, it is great for audio classification tasks. Its dynamic computational graph sets it apart, which gives you more flexibility and control over the neural network architecture.

PyTorch also provides pre-built models like CNNs, RNNs, and transformers that can be easily customized and fine-tuned for specific audio classification tasks. And because it has a large and active community, there are plenty of resources and tutorials available for newcomers.

But what really makes PyTorch stand out is its powerful data preprocessing and augmentation tools. You can use these to convert audio files to spectrograms, add noise, shift pitches, and more, all of which can help improve the performance of your audio classification models. With excellent GPU support, PyTorch is a great option for anyone who needs to train and infer large datasets quickly.

Overall, PyTorch is a flexible and user-friendly library perfect for anyone looking to get more fine-grained control over their deep learning models.

Which Model Is Best for Audio Classification

YAMnet

Source

YAMNet is a really cool deep-learning model developed by the brilliant minds at Google for audio classification. Essentially, it’s a smart computer program that can identify all sorts of sounds — from musical instruments to human noises to environmental sounds — and put them into over 500 different categories.

One of the best things about YAMNet is that it can handle audio signals of different durations, from short bursts to longer soundscapes, thanks to its clever multi-scale convolutional neural network architecture. It also uses a multi-scale input to capture both short-term and long-term audio features, which makes it incredibly accurate at classifying different types of sounds.

YAMNet was trained on a massive dataset of audio recordings, where it learned to distinguish between all sorts of sounds and achieved high accuracy in audio classification benchmarks. With its accuracy and flexibility, YAMNet is an incredibly powerful tool for audio classification research and applications.

VGGish

VGGish is a deep learning model developed by Google researchers that can be used for audio classification. It is based on the VGG architecture, which was originally developed for image classification.

To use VGGish for audio classification, one typically provides the model with a spectrogram of an audio signal. A spectrogram is a visual representation of how the frequency content of a sound changes over time. The VGGish model processes the spectrogram and produces a fixed-length embedding vector that encodes information about the audio. This embedding vector can then be used as a feature vector for various audio classification tasks, such as sound event detection, audio tagging, and acoustic scene classification.

The VGGish model was trained on a large dataset of audio recordings, which is publicly available as part of the AudioSet dataset. The model consists of a series of convolutional and pooling layers, followed by a set of fully connected layers that produce the final embedding.

Overall, VGGish is a powerful and flexible tool for audio classification. It has been shown to perform well on a variety of audio classification tasks and is commonly used in research and industry applications.

An Example of Audio Classification Using Keras

The data set used in training this model: UrbanSound8K.

Step 1: Importing the Necessary Libraries

First, we need to import the necessary libraries.

import os
import numpy as np
import pandas as pd
import librosa
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten, Conv1D, MaxPooling1D

Step 2: Preprocessing the Data Set

Next, we load the audio data set and the CSV data set. We then iterate through each row of the metadata and use the information in each to locate the corresponding audio file in the data_path directory. We load the audio file using librosa.load() and resample it to a target sampling rate of 22050 Hz.

def load_data(data_path, metadata_path):
   features = []
   labels = []


   metadata = pd.read_csv(metadata_path)


   for index, row in metadata.iterrows():
     file_path = os.path.join(data_path, f"fold{row['fold']}", f"{row['slice_file_name']}")


     # Load the audio file and resample it
     target_sr = 22050
     audio, sample_rate = librosa.load(file_path, sr=target_sr)


     # Extract MFCC features
     mfccs = librosa.feature.mfcc(y=audio, sr=target_sr, n_mfcc=40)
     mfccs_scaled = np.mean(mfccs.T, axis=0)


     # Append features and labels
     features.append(mfccs_scale
     labels.append(row['class'])


  return np.array(features), np.array(labels)

Step 3: Importing the UrbanSound 8k Data Set

The data_path holds the path to the audio dataset, while the metadata_path holds the path to the CSV dataset.

data_path = "/kaggle/input/urbansound8k"
metadata_path = "/kaggle/input/urbansound8k/UrbanSound8K.csv"
features, labels = load_data(data_path, metadata_path)


# Encode labels
le = LabelEncoder()
labels_encoded = le.fit_transform(labels)
labels_onehot = to_categorical(labels_encoded)

Step 4: Splitting the Data Set

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, labels_onehot, test_size=0.2, random_state=42, stratify=labels_onehot)

Step 5: Building the 1D CNN Model

The 1D CNN model used in this code consists of several layers that aim to extract relevant features from the audio data and classify it into the correct category. The model starts with a 1D convolutional layer with 64 filters and a filter size of 3, followed by a max pooling layer and dropout layer to prevent overfitting. A second convolutional layer is added with 128 filters, followed by another max pooling and dropout layer.

The output of the second dropout layer is flattened and fed into a fully connected dense layer with 512 neurons and a ReLU activation function. Another dropout layer is added with a rate of 0.5 to prevent overfitting further. Finally, the output layer is a dense layer with a number of neurons equal to the number of classes in the dataset and softmax activation function.

Overall, this model aims to learn the most important features from the audio data and make accurate predictions on the audio classification task.

input_shape = (X_train.shape[1], 1)
model = Sequential()
model.add(Conv1D(64, 3, padding='same', activation='relu', input_shape=input_shape))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.25))
model.add(Conv1D(128, 3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(le.classes_), activation='softmax'))

Step 6: Compiling the Model

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Step 7: Reshaping the Data To Fit the Input Shape of the Model

X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], 1)
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], 1)

Step 8: Training the Model

model.fit(X_train, y_train, batch_size=32, epochs=100, validation_data=(X_test, y_test), verbose=1)

Step 9: Defining the Prediction Method

The predict_audio_class function is used to predict the class of a specific audio file. We will use this function on 5 audio files to check the model’s accuracy.

def predict_audio_class(file_path, model, le):
    # Load the audio file and resample it
    target_sr = 22050
    audio, sample_rate = librosa.load(file_path, sr=target_sr)


    # Extract MFCC features
    mfccs = librosa.feature.mfcc(y=audio, sr=target_sr, n_mfcc=40)
    mfccs_scaled = np.mean(mfccs.T, axis=0)


    # Reshape the features to fit the input shape of the model
    features = mfccs_scaled.reshape(1, mfccs_scaled.shape[0], 1)


    # Predict the class
    predicted_vector = model.predict(features)
    predicted_class_index = np.argmax(predicted_vector, axis=-1)


    # Decode the class index to its corresponding label
    predicted_class = le.inverse_transform(predicted_class_index)


    return predicted_class[0]

Step 10: Evaluating the Model’s Performance

We trained our model using 100 epochs, in which the model returned an accuracy of 0.9370.

Output:

Epoch 100/100

219/219 [==============================] — 1s 5ms/step — loss: 0.1221 — accuracy: 0.9641 — val_loss: 0.2957 — val_accuracy: 0.9370

We will also test our trained model using 5 audio samples while printing the results.

test_file_path1 = "/kaggle/input/urbansound8k/fold1/101415-3-0-2.wav"
predicted_class1 = predict_audio_class(test_file_path1, model, le)
print("Correct output is: Dog bark")
print("Predicted class:", predicted_class1)

Output: Correct output is: Dog bark

Predicted class: dog_bark

test_file_path2 = "/kaggle/input/urbansound8k/fold1/101415-3-0-3.wav"
predicted_class2 = predict_audio_class(test_file_path2, model, le)
print("Correct output is: Dog bark")
print("Predicted class:", predicted_class2)

Output: Correct output is: Dog bark

Predicted class: dog_bark

test_file_path3 = "/kaggle/input/urbansound8k/fold1/102305-6-0-0.wav"
predicted_class3 = predict_audio_class(test_file_path3, model, le)
print("Correct output is: Gun shots")
print("Predicted class:", predicted_class3)

Output: Correct output is: Gun shots

Predicted class: gun_shot

test_file_path4 = "/kaggle/input/urbansound8k/fold1/103074-7-0-2.wav"
predicted_class4 = predict_audio_class(test_file_path4, model, le)
print("Correct output is: Jack hammer")
print("Predicted class:", predicted_class4)

Output: Correct output is: Jack hammer

Predicted class: jackhammer

test_file_path5 = "/kaggle/input/urbansound8k/fold1/103074-7-4-3.wav"
predicted_class5 = predict_audio_class(test_file_path5, model, le)
print("Correct output is: Jack hammer")
print("Predicted class:", predicted_class5)

Output: Correct output is: Jack hammer

Predicted class: jackhammer

Conclusion

In this article, we have delved into the fascinating world of audio classification in machine learning. We have explored how audio classification can transform how we interact with and understand the sounds around us, from music genre identification to speech recognition and emotion detection. We’ve discussed the importance of preprocessing audio data, creating spectrograms, and utilizing CNN architectures to build accurate and efficient audio classifiers.

We also highlighted some of the best libraries available for audio classification, such as Keras, PyTorch, and TensorFlow, which offer powerful tools for building and training state-of-the-art models. Additionally, we touched upon advanced models like YAMNet and VGGish, which have been developed by Google researchers and have proven to be highly effective in tackling various audio classification tasks.

In conclusion, audio classification is an exciting and rapidly evolving field in machine learning, with a wide range of applications across various domains. As technology advances and more sophisticated models emerge, we can expect even greater achievements in the realm of audio understanding, enhancing our ability to make sense of the auditory world around us.

Sounds Right: An Introduction to Audio Classification

What Is Audio Classification in Machine Learning?

How Does Audio Classification Work?

Audio Classification Made Easy: The Best Libraries for the Job

Keras

Pytorch

Which Model Is Best for Audio Classification

YAMnet

VGGish

An Example of Audio Classification Using Keras

Step 1: Importing the Necessary Libraries

Step 2: Preprocessing the Data Set

Step 3: Importing the UrbanSound 8k Data Set

Step 4: Splitting the Data Set

Step 5: Building the 1D CNN Model

Step 6: Compiling the Model

Step 7: Reshaping the Data To Fit the Input Shape of the Model

Step 8: Training the Model

Step 9: Defining the Prediction Method

Step 10: Evaluating the Model’s Performance

Output:

Conclusion

Written by Dave Davies