Unveiling the Transformer: Impact of Layers and Attention Heads in Audio Classification

Introduction

Christopher Ibe
7 min readJun 19, 2024

Transformers have revolutionized various machine learning tasks, including audio classification. But how do we optimize these models for peak performance? In our prequel, we explored how transformers can be adapted from language models to classify sound clips by converting them into images and processing them as sequences of patches. Our initial experiments focused on varying the latent space embedding dimension, showing a clear trend: larger embedding dimensions generally improved classification accuracy. However, we also encountered limitations, including potential overfitting and the increased computational cost associated with larger models.

As part of our commitment at Hypa AI to bringing artificial intelligence innovation and education to technologically underrepresented communities, we regularly share our learnings and the results of our research along the way. Hypa AI remains steadfast in its mission to pioneer intelligent solutions that are not just technologically advanced but also culturally aware, ensuring that the future of AI is as diverse and inclusive as the world it serves.

In this second installment, we delve deeper into the fascinating interplay between two key transformer components: the number of layers (depth) and the number of attention heads. We used the same Urban8k sound dataset as in the prequel and kept all parameters and preprocessing steps consistent, except for the specific ablations we are testing on. Our goal is to uncover how these architectural modifications impact model performance and to provide actionable insights for building more efficient and effective models.

# Template Config for a 6 layer & 6 heads Audio Transformer
n_embd = 192 # Number of embedding dimensions
device = 'cuda' if torch.cuda.is_available() else 'cpu' # Use GPU if available, otherwise CPU
learning_rate = 4e-5 # Learning rate for training
step_size = 20 # Step size for learning rate scheduler
gamma = 0.1 # Learning rate decay factor
n_head = 6 # Number of attention heads
n_layer = 6 # Number of layers in the model
n_class = 10 # Number of output classes
dropout = 0.2 # Dropout rate
epochs = 15 # Number of training epochs
batch_size = 64 # Batch size for training
patch_width = 4 # Patch width (factor of n_timeframe)
patch_height = 32 # Patch height (factor of n_mels)
n_mels = 128 # Number of Mel frequency bins
n_fft = 2048 # Number of FFT points (# of samples are included in each FFT window)
hop_length = 1024 # Number of samples between successive FFT windows (step size for moving the FFT window along the audio signal)
std_sampling_rate = 44100 # Standard sampling rate (Hz)
std_audio_duration = 4000 # Standard audio duration (milliseconds)

Experiment 1: Unveiling the Power of Depth

Our first experiment explored the impact of varying the number of layers in the transformer model on its ability to classify sounds accurately. We maintained consistent parameters across all runs, varying only the number of layers. We tested models with 2, 4, 6, 8, 12, 16, 24, 32, 48, and 96 layers. Each model was trained for 15 epochs using the same dataset and preprocessing pipeline described in our prequel.

We observed a clear trend: as the depth increased, so did the model’s performance on both training and test accuracy, with diminishing returns beyond 32 layers. Models with fewer layers (2 to 4) underfit the data, reflected in high losses and low accuracies. As we increased the layers, the models learned more complex relationships & patterns within the data, resulting in reduced training and test loss and increased accuracy. This aligns with the core idea of scaling deep learning models. However, beyond 24 layers, the gains in accuracy were minimal, and we began to see signs of overfitting, characterized by a divergence between the training and test losses.

Optimal Configuration Models with 12 to 32 layers provided the best balance between model complexity and performance, avoiding the pitfalls of underfitting and overfitting.

Experiment 2: Demystifying Attention Heads

Next, we examined the effect of varying the number of attention heads on model performance. Attention heads act as multiple processors within the model, thus allowing the model to focus on different parts of the input sequence simultaneously, potentially capturing more nuanced patterns.

We tested models with 2, 4, 6, 8, 12, 16, 24, 32, 48, and 96 attention heads while keeping the number of layers constant at 6.

Surprisingly, increasing the number of attention heads did not lead to improved performance. In fact, models with more attention heads performed worse, likely due to overfitting and the model learning noise instead of meaningful patterns. Models with 2 attention heads provided the best performance, balancing complexity and capacity effectively.

Optimal Configuration Models with 2 attention heads showed the best results, suggesting that more heads do not necessarily capture better information. A hypothesis is that there’s an “attention bottleneck” where the model can’t effectively utilize information from too many heads. This could be due to limitations in the model’s capacity (number of embedding features n_emb=192) and we speculate that increasing this (to maybe n_emb=712) would reverse this observed inverse relationship between model performance and number of attention heads.

It’s worth noting the computational costs associated with this as increasing the number of layers, attention heads, or the embedding dimensions also increases the computational complexity of the model. This might become a significant factor for resource-constrained environments.

Experiment 3: The Intricate Dance of Depth and Heads

The final experiment investigated the combined effect of varying both layers and attention heads. Here, we witnessed a synergy between the two. We tested models with configurations ranging from 2 to 48 layers and attention heads.

Combining increased layers and attention heads showed similar trends to varying layers alone, with diminishing returns in training loss and accuracy beyond 24 heads and layers. Models with fewer layers and heads underfit the data, while those with moderate complexity balanced performance and generalization effectively. Increasing both beyond a certain point led to overfitting, where the model learned noise rather than useful patterns. Similar to the analysis with just varying heads, there might be an “attention bottleneck” where the model struggles to utilize information from too many heads, even with additional layers.

Optimal Configuration Models with 12 to 24 heads and layers provided the best trade-off between complexity and performance.

Conclusions and Recommendations

Our experiments underscore the importance of balancing model complexity with the capacity to generalize effectively. An interesting observation is how increasing the number of attention heads progressively dropped the classification accuracy. This unexpected result indicates that simply adding more heads does not necessarily capture better information and can, in fact, lead to overfitting.

There is another notable comparison between increasing a model’s number of layers versus its embedding dimension. For a model with n_emb=192, n_head=2 and n_layer=96, we had a parameter count of 42.73635M, achieving training and test accuracies of 74.59% and 70.12%, respectively. This is in contrast to our prequel where our model with n_emb=768, n_head=6 and n_layer=6, and a similar parameter count of 42.852126M achieving higher training and test accuracies of 82.22% and 77.62%, respectively.

This suggests that increasing the embedding size might yield better results for classification tasks since it allows the model to capture more diverse features. More layers, on the other hand, are beneficial when the data composition is hierarchical, as each layer can pick up different aspects of this hierarchy. For example, in image classification, early layers detect edges and textures, while deeper layers identify more complex patterns.

To further enhance model performance:

  • Regularization Techniques: Implement dropout, L2 regularization, and data augmentation to prevent overfitting.
  • Hyperparameter Tuning: Fine-tune other parameters like learning rate and batch size.
  • Early Stopping: Use early stopping based on validation loss to avoid overfitting.
  • Ensemble Methods: Combine predictions from models with varying configurations to leverage their strengths.

About the Authors

Christopher Ibe and Okezie Okoye continue to lead Hypa AI towards new frontiers in AI translation. Their dedication to leveraging advanced AI for genuine understanding and connection across language barriers is what sets Hypa AI apart in the field of artificial intelligence.

Hypa AI remains steadfast in its mission to pioneer intelligent solutions that are not just technologically advanced but are also culturally aware, ensuring that the future of AI is as diverse and inclusive as the world it serves.

AfroVoices, a subsidiary of Hypa AI, is dedicated to amplifying African voices, languages, and cultures in the intelligence age. Focused on bridging the digital representation gap, AfroVoices curates datasets and resources for African languages, promoting inclusivity and cultural appreciation in AI technologies. Their mission goes beyond technological innovation, aiming to celebrate the richness of African linguistic diversity on a global stage.

--

--