Topic Modelling Song Lyrics from the Million Song Dataset

We are currently looking into a genre classification method, along with which musical features most strongly contribute to what defines a song’s genre. Part of this initiative included looking at lyrics for each song. This is to investigate a potential relationship between song lyric ideas and song genre. The dataset used was the subset of the Million Song Dataset (MSD). The subset was then further narrowed to included only songs which are also located in both the Tagtraum dataset and the MSD lyrics subset.

A topic model was performed using the lyrics from all the songs in our dataset, and Mallet’s “train-topics” tool. Multiple different models were created, using different numbers of models. The number of topics was finally settled on 50, as this it was found to lack any sort of “grab-bag” type of topics. However, this number still has room for optimization. One of the files outputted by the Mallet function, which was titled “song_topic_data.txt”, contains a list of each song in our dataset, and information regarding the weights for each topic to that song. This file was taken and modified to include the song’s track ID (TID), artist name, song name, and given genre. The python script used for this modification also edited the way the weight information is saved. In the new file created, the weights were sorted in decreasing order, and also explicitly included the topic number.

After this cleaned file was created, it was used to look into the existence of any relation between genre and topic number. A simplistic method was used, where only the first genre and first (largest weighted) topic were used. Further research into such a relationship should use a more nuanced approach to looking into topics. For example, looking at all topics with a weight greater than a certain amount in relation to a specific song and its given genre(s) would probably provide more insight. Our analysis showed 2=315182, leading to a reduced 2 of 249, and p = 0.0.

Fig. 1. This image shows the percentage of the main topic for each genre.
Fig. 2. This image plots the songs in the most relevant topic along with all the songs in our data set for a specific genre.

The bar graph above shows, for each genre, the percentage of songs in that genre that match to its most common topic. The most common topic number for the genre is shown on top of the bar. As we can see, 4 genres (Pop, Electronic, Rock, and Blues) all have topic 0 as the most common topic, with only about 10% of songs from each genre falling into the category. So, these genres most likely contain a wide variety of lyrics, and these topics also contain general ideas found in many types of songs. With a larger sample size, it could show a lyrical relationship between those genres. However, we see a large percentage of songs fall into the same topics for New Age, World, and Latin music. This shows that these genres likely have similar songs, based off of our dataset.

Works cited:

[1] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.

[2] Hendrik Schreiber. Improving genre annotations for the million song dataset. In Proceedings of the 16th International Conference on Music Information Retrieval (ISMIR), pages 241–247, 2015.

[3] musiXmatch dataset, the official lyrics collection for the Million Song Dataset, available at:

[4] McCallum, Andrew Kachites. “MALLET: A Machine Learning for Language Toolkit.” 2002.

Modeling Music

Machine learning and music analysis


8 claps
paul salminen

Written by

Modeling Music

Machine learning and music analysis