Continuing from our Lukthung (music genre) Classification — Part 1, we will go over the audio model in this blog. As in the last post, you can get full details of our work from our paper here. Now let’s get started!
For each song in our dataset, we extracted a 10-second audio spectrogram during the chorus part using the parameters below. We used the python library Librosa to perform the extraction. You can refer to this blog post: Using LibROSA to extract audio features.
- Sampling rate: 22050 Hz i.e. frame size = 4.53e-5 s
- Frequency range: 300–8000 Hz
- Number of Mel bins: 128
- Window of length (n fft): 2048
- Time advance between frames (hop size): 512
Here are two examples of the extracted spectrograms. Different genres have different characteristics in the spectrograms which we will discuss later in the blog.
Our model aims to automatically learn timbral and temporal properties of the songs. First, the spectrogram inputs are passed through an inception network to extract features and followed by a residual network. Then, the output from the residual network is fed to a fully connected layers to predict a binary output. The architecture is illustrated in the bottom part of the picture below.
In the Inception Network, both timbral and temporal features are extracted in parallel using a set of convolutional filters in different dimensions. Examples of vertical and horizontal filters are shown in the spectrogram picture above. As Lukthung songs usually contain Thai traditional instruments such as Khene and Phin, the vertical filters aim to detect unique timbral features of these instruments along the spectrogram’s frequency axis.
In parallel, we place horizontal filters along the time axis to learn temporal features such as tempos and rhythms. We put these horizontal blocks before mean-pooling along the frequency axis in order to preserve vibrato or coloratura, a remarkable trait presented in Lukthung singing style (characteristic of sound oscillating within a small range of frequencies over time). The above spectrogram on the left shows the wavy bright lines considered as vibrato.
The table below summaries all the filter sizes and their number used in the Inception Network.
After obtaining the representations of both timbral and temporal features, we concatenate them along the frequency axis and pass them into the binary classification module to predict whether a given audio clip is considered a Lukthung genre. Within the classification module, a residual network with three convolution layers followed by a fully connected network is implemented. Details of the residual network architecture can be found in the reference of our paper. Briefly, a residual network is stacked of convolution layers with alternative connections that skip from one layer to the next. Such bypass connections mitigate the effect of gradient vanishing and importantly allow the model to learn to identity functions, ensuring that the higher layers perform at least as well as the lower layers.
Below are the results of this audio model, as well as those from the previous blog’s lyrics model and our audio baseline models.
Stay tuned for our next blog post on the model that combines both lyrics and audio models to achieve better performance!