Audio Deep Learning in Plain English

21 min readJun 10, 2024

As part of our commitment at Hypa AI to bringing artificial intelligence innovation and education to technologically underrepresented communities, we regularly share our learnings and the results of our research along the way. In this blog, we will unpack the steps involved in classifying different sounds using a machine learning algorithm. We will be working with the Urban8k sound dataset, which contains over 8000 sound samples of 4 seconds or less, spread across 10 categories. This will be done using — you guessed it — the transformer architecture introduced for language modeling. The relative simplicity of the transformer architecture and its parallelizability (phew!) have contributed to its immense popularity in the field of machine learning. Its self-attention and cross-attention mechanisms allow it to understand long-range dependencies between different parts of an input. This is particularly useful when information needed to process a particular input is distant from that input. This core design gives it its versatility across different modalities.

So how do we feed an audio sample into a machine learning algorithm? You guessed right: we first convert it to an image. By visualizing an audio sample, we can loosely apply image processing techniques. At this point, we will take a useful detour to understand the intuition behind image classification using transformers.

Image Processing

A popular area of machine learning is teaching a neural network to classify, identify and generate images. As with every machine learning problem, the first problem that needs to be tackled is how to feed an image data to a machine learning algorithm. As explained in a previous post on constructing a language model, a machine learning algorithm is a formula that take in numerical inputs and generate numerical outputs. Our first problem is then reduced to: how do can convert a picture into a (set of) number(s). A general approach to do this is to take advantage of digital makeup of image — a grid of tiny squares of varying hues, with each square representing a pixel. Digitally speaking, a pixel is the atomic representation of an image and, depending on the color scheme used, can take on a varied but finite number of colors and brightness values. The higher the number of pixels in an image, the higher the resolution of the image. When we conceptualize it in this manner, we can then use a matrix to represent an image, where each entry in the matrix is a numerical representation of the color in the corresponding pixel. Once we have a numerical representation of the image, we can invoke necessary data manipulation techniques on it before feeding it to the model.

For grayscale images, the color of the pixel can be numerically represented by an 8-bit value from 0 (00000000b) to 255 (11111111b), with 0 being the darkest gray (or black) and 255 being the lightest gray (or white). Essentially, every pixel represents a shade of gray, whose intensity correlates to the pixel value. To represent color images, we need a color schema, where each pixel color can span the spectrum of colors perceptible to the human eye. Fortunately, any color can be produced by combining red, green, and blue in the right intensities. This means that, as opposed to having one dimension, as in grayscale images, a colored image has pixels with three dimensions. You can think of the pixel value (or vector) as its intensity represented across a red channel, a green channel, and a blue channel. This means that every image can be digitally represented in a three-dimensional space — height, width, and number of channels.

Color image vs grayscale image. Source: ResearchGate

Image Transformers

Currently, there are two popular algorithms for working with images — convolutional neural networks (CNNs) and image transformers. CNNs represent an older approach and are more effective when you have a relatively small data size. They learn the features in an image using filters or kernels that slide across the image in predetermined sizes, using the extracted local features to create a feature map. With image transformers, on the other hand, we first convert the image into a sequence of non-overlapping patches — essentially submatrices. These patches, like subwords in language modeling, constitute our tokens. This sequence of tokens is prepended with an additional classification token. This special token is added to the input sequence to aggregate information from all image patches. It allows the transformer to generate a global representation of the image, which can then be used for classification tasks. Next, the patches are projected into a higher-dimensional space (latent space) using a linear transformation that is learned during training. This linear transformation generally entails multiplying each patch submatrix by a learnable embedding matrix so that the result is a vector that the transformer can process effectively.

Since we are talking about images, a visualization of this preprocessing step seems in order.

Original image is a three-dimensional matrix of pixels, with each plane representing a color channel.

Image is split in predetermined number of patches. The higher the size (height * width) of each patch, the lower the number of tokens that will be generated from an image. This would result in a lower computation cost but it also reduces the model’s ability to pay attention to details in the image. Conversely, the smaller the size of the patches, the higher the computation cost and the model’s ability to understand the finer features in an image.

Each patch is flattened out along each channel.

Patches are concatenated along the channel dimension to get a sequence of tokens. The classification token is added at the beginning of the sequence.

Once we have this sequence of tokens, the rest of the process is similar to how the encoder works in natural language processing — each token is embedded in the latent space, learnable positional encodings are added, and the resulting tokens are passed through the transformer block. The embedded “patches” are then fed into an image transformer encoder model.

Image transformer encoder architecture. Source: https://arxiv.org/abs/2010.11929

After the transformer block, a classification head — a simple feedforward neural network — is fitted on the classification token. This linear transformation brings the dimension of the classification token from the embedding dimension size to the number of classes in the training dataset. For a more in-depth treatment on the image transformer, as well as an exploration of its performance on various image datasets, see our previous post.

Back to Audio Processing

A relevant question at this point is: Given an audio clip, how do we get an “array of pixels” which we can then subject to the preprocessing steps we used for images?

Time Domain to Frequency Domain

We hear sounds by perceiving changes in air pressure caused by vibrations emanating from the sound source. Our ears detect these changes in air pressure and transmit them to the brain, which then interprets them as sounds. The speed of the air particle vibrations determines the pitch of the sound. Slower vibrations produce lower-pitched sounds, while faster vibrations produce higher-pitched sounds. The eardrums vibrate inwards and outwards in response to the sound waves. The intensity of the vibrations determines the loudness of the sound. Humans can generally detect sounds in a frequency range of 20 Hz to 20 kHz, but infants can hear slightly higher frequencies.

For the purposes of audio processing, the two main characteristics of a sound wave are its amplitude and its frequency. In the time domain, we can visualize how the amplitude changes over time; however, it is more difficult to see the different frequencies change over time. To start off simple, let us consider a snapshot of a sound waveform in time. We will use a random example from the Urban8k sound dataset — one of an air conditioner.

Time-domain representation of an air conditioner sound

The audio clip presented is a 4-second stereo recording of an air conditioner, hence the two channels. From the chart, we can immediately get a sense of how the amplitude varies over time. We can also tell that the waveform is not a pure sinusoid, so we cannot directly talk about its frequency. Fortunately, as ingeniously proven by Joseph Fourier, non-periodic signals can be represented as a sum — more accurately, an integral — of sine and cosine functions of different frequencies. This is called the Fourier transform. For discrete signals, like the ones we will be working with, we speak of the Discrete Fourier Transform (DFT). The DFT of a signal tells us to what extent different frequency components are present in the signal — this is commonly known as the frequency spectrum of a signal. The DFT is usually computed using an algorithm known as the Fast Fourier Transform — this method reduces the complexity of the calculation, yielding a quicker computation time.

If we examine the first 23ms of our waveform, we can see how difficult it is to tell what frequencies are present.

First 23ms of air conditioner sound waveform

However, if we apply the DFT to this waveform, we get the frequency spectrum shown in the picture below. The horizontal axis represents the different frequency bins used in the computation of the DFT. For our transform, we are using 1025 frequency bins. The lowest frequency component that can be extracted from a discrete signal corresponds to the duration of the signal, while the highest frequency component that can be extracted corresponds to the sampling rate.

Frequency spectrum for first 23ms of waveform

Spectrogram

For audio processing, it is useful to gather information about how the spectral content of our signal changes over time. With this information, we can construct what would amount to a “fingerprint” of our audio clip. To do this, we have to divide the signal into a specified number of time frames and perform a sliding DFT across these frames. This is called the Short-Time Fourier Transform (STFT), and it gives us a view of how the frequency components change over time. To obtain the STFT, we split the audio signal into predefined time steps, compute the Fast Fourier Transform (FFT) of each time step, and then combine the results.

If we chop our waveform into 23ms windows and take the DFT of each time frame, the plot would look like the ones below. Each color represents a distinct time frame, and the highest magnitude of ~28 occurs somewhere between the 0th and the 200th frequency bin.

Frequency spectra for all 23ms timeframes of waveform

While this plot shows us the frequency content across different time frames, it does not present the information in a temporal sequence, which would be more intuitive. This is where the spectrogram shines. The spectrogram is a color map plot that shows how the frequency content changes over time — it has the frequency on the y-axis and the time frame on the x-axis. This plot is sometimes known as a waterfall plot.

Spectrogram of air conditioner audio clip

The spectrogram for the air conditioner audio clip is shown above. The color bars on the right of each channel plot represent the magnitude — the darker the color, the lower the magnitude or intensity of that frequency component. Like our frequency spectrum plot earlier, this color bar tops out at a magnitude of ~28. From the spectrogram (and the frequency spectrum plot), it is clear that the highest amplitude occurs at the lower frequencies.

One of the pitfalls of STFT is its fixed resolution, which involves a trade-off between sampling time and the length of the window. As pointed out earlier, the number of frequency components that can be extracted from a signal, using the DFT, is directly proportional to the number of samples taken. For our spectrogram, the larger each time frame is, the more samples within the frame and, consequently, the higher the number of frequency components that can be extracted from it. This results in a high-frequency resolution spectrogram. However, increasing each time frame means reducing the overall number of time frames in an audio sample, resulting in a lower time resolution. Conversely, the shorter the time frame, the smaller the number of samples within them, and the smaller the number of extractable frequency components. We, however, would get more time frames out of an audio sample, giving us better time resolution. This is known as the time-frequency uncertainty principle.

Melody Scale

This is the audio signal transformed from the time domain to the frequency domain. The mel scale, short for the melody scale, is a nonlinear scale that approximates human auditory perception of pitch. It is used to convert frequency values to a scale that better aligns with how humans perceive sound, making it valuable in various audio processing applications. This conversion is necessitated by the peculiar way our sense of hearing works.

When we plot a graph, the tick marks on the y-axis are spaced out by the same amount. For example, let’s say we want to plot a graph of the height of a child (on the y-axis) vs their age (on the x-axis). The x-axis tick marks (in years) would be something like: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, and the y-axis tick marks (in centimeters) would be something like: 40, 60, 80, 100, 120, 140, 160, 180, 200, 220, 240, 260, 280, 300, 320, 340. We label the y-axis this way because our eyes perceive the height difference between someone who is 60cm and someone who is 80cm the same way it does the height difference between someone who is 320cm and someone who is 340cm. However, our ears do not perceive pitch the same way — the pitch of a sound is a measure of its frequency. Consider a scenario where we are played two pairs of sounds: the first pair has frequencies 100Hz and 200Hz, the second pair has frequencies 1500Hz and 1600Hz. We are better able to discern the difference in pitch between the first pair than we are between the second pair. This is because we are able to perceive the “distance” between sounds at lower frequencies than we are at higher frequencies.

Log scale vs linear scale. Source: Datawrapper

Mel Spectrogram

As mentioned earlier, our spectrogram is a matrix whose number of rows corresponds to the number of frequency components in our FFT calculation and whose number of columns corresponds to the number of timeframes. Each entry in the matrix is the amplitude at a certain frequency and in a given timeframe — this can be understood as a representation of the spectral energy at a given moment in time. To convert it to a mel spectrogram, we need to convert the frequencies to a log scale and use a specified set of bandpass filters to capture the average energy for different frequency bands. These filters are also known as the mel filter banks. The filter banks are spaced so that they are narrower at lower frequencies and wider at higher frequencies. This mimics the human hearing experience, where we can better discern frequency differences at lower frequencies than we can at higher ones.

Mel filter banks comprising 128 bandpass filters

Applying the transform to the spectrogram results in a mel spectrogram with the number of frequency bins equal to the number of mel filter banks. As can be seen in the mel spectrogram plot above, the energy is accentuated at the lower frequency components, similar to what we observed in the regular spectrogram.

Mel Spectrogram of air conditioner audio clip

Differences to note between the spectrogram and the mel spectrogram are the presence of more bright spots and higher tick values in the color bar. These can be attributed to the aggregation of energies across frequency bins by the mel filter banks. However, we can still see that a lot of the plot is dark. This is because, as we advance through time, the amplitudes range widely across the different frequency components. Hence, we need to compress the amplitudes into a range that we can work with.

Logarithmic Loudness

In addition to the logarithmic perception of pitch (or frequency), humans also perceive the loudness of a sound logarithmically. This means that equal ratios of sound pressure level are perceived as equal differences in loudness. Consequently, another transformation we have to do on the mel spectrogram is to move the amplitudes calculated to the decibel scale. By doing this, we can adjust the range of the amplitude so that most of the amplitudes are within the useful range of human hearing. We do this by specifying a decibel range for our signal — this is usually referred to as the top dB. The minimum dB to be represented is calculated by subtracting the top dB from the maximum dB. For example, if we have a signal where the highest decibel value is 110 dB and a specified top dB of 80, the following happens:

High amplitude (110 dB): Stays at 110 dB.
Amplitude corresponding to 70 dB: Stays at 70 dB.
Amplitude corresponding to 40 dB: Stays at 40 dB.
Amplitude corresponding to 20 dB: Clipped to 30 dB because it is lower than the threshold (110 dB — 80 dB).

This compression allows us to extract the most useful audio information our of our signal for processing. The mel spectrogram with amplitudes in dB is what we set out to get, when we first spoke of the “fingerprint” of our audio signal.

Mel Spectrogram with amplitudes converted to dB

Building the Data Pipeline and Model

An audio data preprocessing pipeline was first built to convert the audio clip into a normalized mel spectrogram. This spectrogram was then augmented — by masking out a specified number of timeframes and a specified number of frequency bins. The result was a matrix whose values corresponded to the input audio clip. With a 2-D numerical representation of our audio clip, we can treat the rest of the process as we would a matrix of pixels about to be fed into an image transformer. Refer to the section above on image processing for more details on this. The details of the preprocessing pipeline are as follows:

Standardize channel size: Two channels were used for this exercise, since the use of only one channel would result in discarding stereo recordings. Audio samples with 1 channel were duplicated across a different channel.
Standardize sampling rate: This ensures that a second of recording for every audio clip holds the same number of samples.
Standard audio duration: Once we have same number of samples per second for each audio clip, we then proceed to make each clip the same number of seconds. We do this by either truncating or padding the signal.
Time roll: This is a data augmentation technique used to improve the robustness of the model performance by introducing slight variations in the training data. Here, we shift the signal (with wraparound) to the left or right by a randomly specified percent.
Create and normalize spectrogram: The audio signal is transformed into a mel spectrogram with amplitude in dB. This spectrogram is then normalized so its values lie between 0 and 1 — this allows faster convergence and prevents issues like vanishing or exploding gradients, which can occur when the input data has a wide range of values.
Mask timeframes and frequency bins: This is another data augmentation technique in the frequency domain used to improve model robustness. We replace a specified percentage of number of rows and number of columns in the mel spectrogram matrix by their average values.

The model used is a multi-layer vanilla transformer encoder architecture with multi-head attention — 6 layers and 6 heads. A classification head is attached to the classification token which is output from the final layer of the transformer. This classification head is a simple feedforward neural network that takes the dimension of the classification token from the latent embedding space dimension to the number of classes in our dataset. This output, called the logits, is then passed to a softmax activation function for conversion into probabilities.

Discussion of Results

The Urban8k sound dataset which was used to train the model had audio clips belonging to 1 of 10 categories: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music. Dataset was partitioned using an 80:10:10 split — this means a 6986 audio clips were used for training, 873 were used for testing and 873 were reserved for model validation. For this exercise, we wanted to evaluate the effect of the latent space embedding dimension on model performance. Several training runs were executed, each one doubling the embedding size of the previous run.

Run 1

n_embd = 192
device = 'cuda' if torch.cuda.is_available() else 'cpu'
learning_rate = 4e-5
step_size = 20
gamma = 0.1
n_head = 6
n_layer = 6
n_class = 10
dropout = 0.2
epochs = 20
batch_size = 64
patch_width = 4
patch_height = 32
n_mels = 128
n_fft = 2048
hop_length = 1024
std_sampling_rate = 44100
std_audio_duration = 4000

This results in a model size of 2.75043M. After training for 25 epochs, the results were as follows:

epoch: 25 training loss: 1.4176 training accuracy: 54.61% test loss: 1.4294 test accuracy: 53.81%

The prediction accuracy of ~54% across 10 classes and the slightly better performance (higher classification accuracy and lower loss) on the test data point to an insufficiently sized model. Ideally, the model should have better prediction accuracy on the training data, since this is the data used to update its weights during the optimization.

Run 2

n_embd = 384
device = 'cuda' if torch.cuda.is_available() else 'cpu'
learning_rate = 4e-5
step_size = 20
gamma = 0.1
n_head = 6
n_layer = 6
n_class = 10
dropout = 0.2
epochs = 20
batch_size = 64
patch_width = 4
patch_height = 32
n_mels = 128
n_fft = 2048
hop_length = 1024
std_sampling_rate = 44100
std_audio_duration = 4000

This results in a model size of 10.809246M. After training for 25 epochs, the results were as follows:

epoch: 25 training loss: 1.0712 training accuracy: 70.38% test loss: 1.0654 test accuracy: 70.48%

Here, we increase the model size to ~11M by doubling the latent space embedding dimension. We see an improvement in the steady state classification accuracy after 25 epochs, even though the performance on the test data still slightly outdoes the performance on the training data. A good consideration in choosing the size of the model is the avoidance of overfitting — this happens when model performance on test data is questionably worse than it is on the training data. We posit that as long as your model’s performance on the training data is identical to the model’s performance on the test data, there is still room for model growth without the risk of overfitting. Obviously, we are far from that point.

Run 3

n_embd = 768
device = 'cuda' if torch.cuda.is_available() else 'cpu'
learning_rate = 4e-5
step_size = 20
gamma = 0.1
n_head = 6
n_layer = 6
n_class = 10
dropout = 0.2
epochs = 20
batch_size = 64
patch_width = 4
patch_height = 32
n_mels = 128
n_fft = 2048
hop_length = 1024
std_sampling_rate = 44100
std_audio_duration = 4000

This results in a model size of 42.852126 M. After training for 25 epochs, the results were as follows:

training loss: 0.7426 training accuracy: 82.22% test loss: 0.8514 test accuracy: 77.62%

For the 3rd run, we increase the embedding dimension by an additional twofold to 768. This swells the model size to ~43M and results in a considerable improvement in model performance. We can also see that the model performance on the training data starts to outstrip the performance on the test data. We are definitely at a point where we need to start being cognizant of overfitting our training data, which would most likely happen if we trained for more epochs.

Run 4

n_embd = 768
device = 'cuda' if torch.cuda.is_available() else 'cpu'
learning_rate = 4e-5
step_size = 20
gamma = 0.1
n_head = 6
n_layer = 6
n_class = 10
dropout = 0.2
epochs = 25
batch_size = 64
patch_width = 4                                                                 #should be a factor of n_timeframe
patch_height = 16                                                               #should be a factor of n_mels
n_mels = 128
n_fft = 2048
hop_length = 1024
std_sampling_rate = 44100
std_audio_duration = 4000

This results in a model size of 42.885918M. After training for 25 epochs, the results were as follows:

epoch: 25 training loss: 0.8026 training accuracy: 79.93% test loss: 0.8610 test accuracy: 77.71%

For our fourth run, we take a break from expanding the latent space embedding dimension, to increasing the resolution of the training tokens. While holding the embedding dimension at 768, we halve the patch height — from 32 to 16. This increases the frequency resolution of the tokens that each mel spectrogram decomposes into. The result of this is the mel spectrogram would now break down into twice as many tokens as the previous runs. The intuition behind this is to see if the model performance would be improved if the model is exposed to more granular frequency variations in the audio clip. This approach unfortunately resulted in slightly degraded training accuracy after the model was trained for 25 epochs — from ~82% to ~80%.

We suspect the increase in number of tokens per mel spectrogram should have been accompanied by an increase in the number of attention heads in the model and would need to run some more experiments to verify this.

Run 5

n_embd = 1536
device = 'cuda' if torch.cuda.is_available() else 'cpu'
learning_rate = 4e-5
step_size = 20
gamma = 0.1
n_head = 6
n_layer = 6
n_class = 10
dropout = 0.2
epochs = 25
batch_size = 64
patch_width = 4                                                                 #should be a factor of n_timeframe
patch_height = 32                                                               #should be a factor of n_mels
n_mels = 128
n_fft = 2048
hop_length = 1024
std_sampling_rate = 44100
std_audio_duration = 4000

This results in a model size of 170.706462M. After training for 25 epochs, the results were as follows:

epoch: 25 training loss: 0.5261 training accuracy: 89.89% test loss: 0.6097 test accuracy: 85.85%

This is the trial run in the experiment with the largest model size — a latent embedding dimension of 1536. In keeping with the trend observed so far, this model displays the most superior performance of all the models training — achieving a training accuracy of ~90% after 25 epochs and a test accuracy of ~86%. This exemplary performance comes at a memory and computation cost, both at training and inference time. These can be mitigated using proven methods such as KV caching.

Conclusion

In this blog, focused on developed an audio data processing pipeline and transformer model that takes in an audio clip from the Urban8K dataset and tries to predict which of 10 classes it falls into. This process resulted in a 2D numerical representation of the audio clips, akin to an image matrix ready for processing by an image transformer model. The preprocessing details included standardizing channel size, sampling rate, and audio duration, as well as applying data augmentation techniques like time rolling and spectrogram normalization.

Five training runs were executed, each doubling the embedding size of the previous run. For the fourth run, we take a detour from scaling up the embedding size to increasing the number of ‘patches’ gotten from each mel spectrogram. The results showed that increasing the embedding dimension led to improved classification accuracy, with the largest embedding size of 1536 achieving the best performance.

Plot of classification accuracy vs latent space embedding dimension

The study’s limitations encompass several key areas that could affect the overall robustness and generalizability of the model for audio classification tasks. Firstly, the reliance on the Urban8k dataset, while widely used, may limit the model’s ability to generalize to a broader range of audio types and environments. Additionally, the specific parameters chosen for data augmentation, such as the number of timeframes and frequency bins masked, may not be optimal for all types of audio data, potentially impacting the model’s performance. In future posts, we hope to study the effect of varying timeframes and frequency bins on model performance. Another noteworthy point is that increasing the model’s complexity, particularly by raising the embedding dimension, can improve performance but also raises the risk of overfitting. This highlights the need for exploration of regularization techniques to mitigate this risk. Furthermore, the evaluation metrics primarily focused on accuracy, neglecting other important metrics such as precision and recall, which could provide a more comprehensive assessment of the model’s performance, especially in imbalanced datasets.

About the Authors

Christopher Ibe and Okezie Okoye continue to lead Hypa AI towards new frontiers in AI translation. Their dedication to leveraging advanced AI for genuine understanding and connection across language barriers is what sets Hypa AI apart in the field of artificial intelligence.

Hypa AI remains steadfast in its mission to pioneer intelligent solutions that are not just technologically advanced but are also culturally aware, ensuring that the future of AI is as diverse and inclusive as the world it serves.

AfroVoices, a subsidiary of Hypa AI, is dedicated to amplifying African voices, languages, and cultures in the intelligence age. Focused on bridging the digital representation gap, AfroVoices curates datasets and resources for African languages, promoting inclusivity and cultural appreciation in AI technologies. Their mission goes beyond technological innovation, aiming to celebrate the richness of African linguistic diversity on a global stage.