music2vec: Generating Vector Embeddings for Genre-Classification Task

Takk Learning - Ashish Bharadwaj Srinivasa, Rajat Hebbar

Rajat Hebbar
8 min readNov 28, 2017

It has become very common in recent years to use vector-embedding as a way of feature representation for text-based information retrieval models. The aim of our project was to obtain similar vector representation for music segments. We hope to capture the structural and stylistic information of the music in this low dimensional space. using genre classification as the end task. Multiple methods have been attempted to this effect, using architectures such as DNN-HMMs, CRNNs , LSTMs, etc. Usually, vector embeddings are obtained as the latent descriptor of a specific classification task. Since the genre of a song represents the broad stylistic nature of the song, we aim to achieve music embeddings through the task of classifying genre.

Dataset

Free Music Archive(FMA) is a recently released free dataset consisting of a large collection of 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. It provides full-length and high-quality audio, presentation for text-based information retrieval models. The aim of our projectpre-computed features, together with track- and user-level metadata, tags, and free-form text such as biographies. It provides audio in .mp3 formats. Along with this set, there is also a smaller class-balanced set of 30s audio segments from 8000 songs, balanced across 8 genres which have been shown below. This set is computationally much more feasible to work with from a learning standpoint as compared to using the whole length of songs, and is what we’ll be using for our experiments.

Fig. 1: Genres included in Training Data-Set

Since a 30s segment of a song is too long to process as a single sample at 44.1kHz, we proceed to split a single song into 20 overlapping 3 second segments. This results in a total of approximately 158200 samples, after getting rid of faulty segments. These samples are then split into training, validation and test sets in the ratio 85:5:10.

Spectrograms as input features

Spectrograms are essentially 2-D time-frequency plots typically used to visualize harmonic structure in audio. Mel spectrograms are a compressed version of spectrograms which are computationally faster to process. An example of a log-mel spectrogram has been visualized below:

Fig. 2: Mel Spectrogram for Electronic Music

Here, the x-axis denotes the time axis and the y-axis denotes the frequency axis. For our experiments, we re-sample all our input data to 44.1kHz. This is necessary to preserve generality in our model, since the dataset includes audio segments sampled at 44.1kHz and 48kHz. We then extract the spectrogram for 3s segments which result in a spectrogram of dimension (120,300), which serves as the input feature for our spectrogram- based architectures. Of the many experiments that we ran, we present the three most representative architectures and their results:

Establishing a baseline…

Since the dataset is quite recent, there has been no published work on it. Hence, we deemed it necessary to establish a recurrent baseline. For this purpose, we treat the spectrogram as a time-series of dimension 120, and feed this into a single lstm layer with 256 cells. The output of the LSTM layer is then fed into a fully-connected layer with 64 neurons, which feeds into our output layer which is a softmax with 8 output nodes corresponding to each of our 8 classes. The layers used have been depicted in the figure below:

Fig. 3: Baseline Model

As expected, the baseline is an over-simplified model for an 8-class classification task, and as such doesn’t perform extra-ordinarily well, resulting in peak training accuracy of 52% and test-set accuracy of 43%. However, it does beat the hard baseline of 12.5% by a margin.

Architecture I : Conv-LSTM-DNN(CLDNN)

In this first approach, we use a Convolution-LSTM-DNN network. The Convolutional part consists of 3 convolutional layers and 2 max pool layers. The convolutional layers have rectangular filters of size [2,1] such that they convolve only in frequency. Also, the stride of the max pool layer is only along the frequency dimension. The output of the convolutional segment has shape [300x30x4], which is reshaped to [300, 120] so as to retain the sequence structure of our input feature. This output is then fed into a stacked-LSTM layer consisting of two stacked LSTMS with hidden dimension 120. The output from the last time-step of the stacked LSTM is then fed into two fully connected layers consisting of 64 and 8 nodes respectively. The architecture has been depicted below:

Fig. 4: CLDNN Model

The CLDNN gives us the best performing model using Spectrogram features. Due to the high complexity of the model, consisting of stacked-LSTM layers, the model is also one of the slowest to train.

Architecture II : Joint Auto-Encoder with supervised Genre Classifier

For the second approach, we attempt to tackle the problem as a multi-task problem, where we jointly train an auto-encoder and a genre-classifier. The encoder segment is shared by both these tasks, and consists of 4 conv+max pool layers, which bring down the input dimension to [10, 25] with 128 filters. This is then fed into a dense layer of 250 units, which acts as our latent descriptor. The decoder then reverses the operation performed by the encoder and computes the reconstruction loss as the mean square error for each pixel of the reconstructed and original images. The genre classifier is constructed using 3 fully connected layers at the output of the latent descriptor, which is designed to minimize the loss in classification of genre. The architecture has been envisioned below:

Fig. 5: Joint Auto-Encoder Genre Classifier

In-spite of the complex architecture employed, the model seems to be overfitting the training data, resulting in a training set accuracy of 97% but a test set accuracy of only 50%. However, adding Dropout to the FC-Layers only improves the performance on the test set to 52%.

Shifting to raw Audio:

From the previous section, it is clear that even sufficiently complex architectures fail to improve on the fairly simple baseline set. This is an indication that our features are a bottleneck to the task we are aiming at. Since most recent advances in Audio-based Deep Learning techniques, including WaveNet and SoundNet, we decide to briefly delve into raw-audio based models.

For this purpose, we downsample the input raw-audio from the original sampling rate of 44.1kHz to 8kHz, and as before split each 30s segment of a song into 20 overlapping segments of 3.75s each. This results in 1-D features of length 30000, which serve as the input to our SoundNet based architectures. Once again, we present three of our most representative experiments and their results:

Architecture I : SoundNet based Convolutional Architecture

In our first method, we try to emulate the architecture in SoundNet by making minor modifications to the architecture used so as to suit our input dimension. The SoundNet Architecture typically consists of two conv+max pool layers followed by 3conv layers, a max pool layer and finally 3 more conv layers. The convolutional blocks have increasingly higher number of filters and inversely, decreasing filter size as the depth of the model increases. For our model, we simulate most of the architecture with minor changes in filter size. The first two convolutional layers have stride 8 each so as to bring the input dimension down to a reasonable scale. We also reduce the final convolutional block to 2-conv layers. The output of this so called SoundNet block (of dimension 6x1024) is then flattened fed into three densely connected layers with 512, 128 and finally 8 nodes respectively.

The final architecture employed has been shown in the figure:

Fig. 6: SoundNet + DNN

The architecture based on SoundNet showed a stark improvement when compared to spectrogram-based methods, showing a training accuracy of 76% and an accuracy of 62% on the test-set. This implies that the SoundNet based model generalizes much better than our previous attempts.

Architecture II : SoundNet + Stacked-LSTM

In order to add some temporal context to our model, we decide to introduce a stacked-LSTM layer to the network. For this, we remove the final convolutional-layer, which changes our output dimension of the SoundNet block to 15x512. We treat this as a time-sequence of 15 time-steps with dimension 512 and feed this to a 2-layer stacked-LSTM of 256 cells each. The rest of the model follows from our previous architecture.

Fig. 7: SoundNet + Stacked-LSTM

We expect the addition of temporal context to our model to improve our results, and it does. Both the training and test accuracies are boosted by the addition of these LSTM layers to 81% and 66% respectively.

Architecture III : Deeper SoundNet + Stacked-LSTM

In order to better understand the architectures used in I and II, we attempt to model the architecture in such a way so as to bring down the input dimensionality more gradually, i.e., instead of using stride 8 for the first 2 convolutional layers, we use 3-conv layers each of stride 2 in place of each of these. This maintains the dimensionality of the output of the SoundNet block as depicted in Fig. 7 for Arch II, while deepening the Conv-Model. However, this doesn’t seem to perform as well as the first two architectures.

Results and Future Possibilities

As seen from the table, the model with SoundNet based Conv + LSTM architecture with raw audio obtained the highest accuracy. In general, raw audio based models performed better than spectrogram based models. SoundNet based methods also take lesser time to train as the data for the LSTMs has been drastically decreased. One can also observe that with deeper architectures, the network seems to underfit as is evident from the last entry in the table. To avoid this, we could adopt residual networks based architecture.

Fig. 8: Tabulated Results

Future goals would be to train models for other labels present in the metadata such as popularity, country, etc. This would allow us to encode the music to a general purpose vector embedding which can be used for a wide range of tasks. Additionally, we also need to train the models on the complete dataset of ~100,000 songs.

References:

  1. Michal Defferrard, Kirell Benzi, Pierre Vandergheynst, Xavier Bresson : FMA: A Dataset For Music Analysis.
  2. Yusuf Aytar, Carl Vondrick, Antonio Torralba : SoundNet: Learning Sound Representations from Unlabeled Video
  3. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu : WaveNet: A Generative Model for Raw Audio

--

--