In Thai music, Lukthung is a very unique and popular genre. Some people listen to both Lukthung and other typical genres like Pop and Rock, while others prefer only Lukthung. Although music classification is a widely studied topic, classifying Lukthung songs is a unique challenge we want to tackle at JOOX as we improve our music recommendation engine. Resulting from our research, our paper will be published at ICSEC 2019 and can be found here.
We will go through the paper in a series of blog posts. In this first part, we will explain the lyrics model used to classify Lukthung songs.
Our dataset consists of 10,547 Thai songs dated from the year 1985 to 2019. The figure below shows the number of songs labeled in each genre and era.
Since most Lukthung songs have a different set of vocabulary compared to other genres, especially words from Isan dialect, we build a classification model based on lyrics. We construct word-based features using the entire lyrics from the beginning to the end of the songs. The lyrics are firstly tokenized using the deepcut library.
We build a simple bag-of-words (BoW) model. The vocabulary V is constructed using a larger set of roughly 85k songs (both unlabeled and labeled lyrics). The lyrics of each song is represented by a normalized bag-of-words vector ai using the vocabulary V. Let ci,j denote the number of occurrences of word j in the lyrics of song i. The normalized count of this word, ai,j , is computed as in
The logarithm transformation is applied to smooth the discrete count values. The BoW is then fed into a two layer fully connected multi-layer perceptron (MLP). The input layer is comprised of |V| nodes where |V| is the vocabulary size, followed by 100 hidden nodes in each intermediate layer before connecting to a single neuron in the output layer. We put rectified linear unit (ReLU) activation functions on the hidden nodes to allow the model to learn non-linear mapping, and place a sigmoid activation function on the output node to obtain the probability whether a given song is Lukthung. The architecture of the model is shown below.
We evaluated the models based on precision, recall, and F1 scores for handling the imbalanced data as Lukthung songs consist of 20% of our dataset. Below are the results of the lyrics model.
In the following blogs, we will discuss the audio model that uses audio spectrograms as input to train inception and residual networks, as well as the final model that combines the lyrics model and the audio model. If you can’t wait, you can have a look at the full paper first!
Part 2 is now published here! Lukthung (music genre) Classification — Part 2