How I taught a neural network to understand similarities in music audio

7 min readAug 6, 2018

The automated curation of music playlists has become a significant problem in the last decade with the rise of colossal streaming platforms. Current state-of-the-art recommender systems depend on the collaborative filtering model. Nevertheless, these systems experience the cold start problem; they break down when no historical data is available and as a result they cannot recommend new and unpopular songs.

Within the last year, I worked on producing a content-based music recommendation system using deep learning (using Python). My aim was to be able to recommend songs to a user without popularity bias and solely on the musical qualities within the audio, avoiding the cold-start problem. This project was inspired by the work of Sander Dieleman, who demonstrated the power of using a CNN to create playlists based on a query song.

This system uses a Convolutional Neural Network (CNN) to classify songs into different genres based on the audio signal. The genre probability distributions output are then used to create a high-dimensional genre space, similar songs can then be found in this space to generate playlists for the listener.

The Mel-Spectrogram — The king of spectrograms

A first step in better understanding of an audio signal is to transform it so that features are better accessible for subsequent processing steps. A popular feature representation in MIR is the mel-spectrogram. It is a time-frequency representation of sound. Unlike raw spectrograms, it uses a logarithmic scale for the frequencies. This reduces the dimensionality and represents the audio closer to how humans perceive sound (which is awesome).

This mel-spectrogram represents the first 30 seconds of Joey Bada$$ — World domination

The orange dots represent peaks in power over time (horizontal axis) and frequency (vertical axis). As this is a Hip-Hop song, there is a drum loop which naturally occupies the lower frequencies around 100Hz-500Hz which explains the interesting pattern in the lower regions. Furthermore, we can see that there is a pause at the 19th second causing a drop in the peaks of power.

Convolutional Neural Network — The engine

Normally used for image recognition, the CNN has been used in recent music recommendation systems, to extract features from spectrograms. In this proposed system, the CNN is used to predict the genre of a song. This prediction is expressed as a probability distribution in the Softmax layer.

For example, in a 2-genre classification a given song could be predicted 60.3% Hip-Hop and 39.7% Classical.

Dataset and training CNN

Two datasets were used for the training phase. The GTZAN and the FMA dataset. The Free Music Archive (FMA) contains 106,574 songs of 161 genres. The full dataset boasts a size of 917GB (lol) however due to storage constraints, a subset consisting of 8000 songs (30 second samples) was used instead. The FMA dataset was used to train the CNN and use the GTZAN dataset to evaluate the CNN’s accuracy in the genre classification task.

A significant challenge the author faced was constructing a suitable dataset. CNNs require a large amount of data to train their parameters, a large collection of audio tracks needed to be used for the training phase.
However, it is very difficult to find large-scale datasets of music audio due to the legal requirements of free music.

The situation is improving with the recent release of the FMA dataset which contains 106,574 songs but this still pales in comparison to industry level datasets used by Spotify and Apple Music. The majority of the available datasets are created by volunteers and researchers in the MIR field, therefore there is little support when problems are faced with the datasets. For example, human error can lead to mislabelling of song genres which can be hard to detect.

Playlist Generation

Cocoa butter kisses by Chance The Rapper is a popular Hip-Hop song with gospel influences. The pie chart below visualises the song’s genre composition after it was processed by the system. The CNN was able to correctly predict the song’s genre (Hip-Hop receiving the highest probability score of 70%). It was evidently able to identify other genre elements such as Classical (possibly due to the gospel elements). The CNN did not think much rock was contained within the song, explaining the low probability score of 5%.

The genre composition of Cocoa Butter Kisses

The probabilities output from the CNN are then used to create a high-dimensional genre-space, in which the query song will be plotted against a library of songs from the dataset. A playlist generation algorithm then uses a similarity metric (Euclidean distance) to find and and rank the top 10 most similar songs, which will be output to the user. An example playlist within the genre space, can be seen in the plot below for the query song Cocoa Butter Kisses.

Playlist | Cocoa Butter Kisses

plot.ly

The resulting playlist is interesting because 3 out of the 4 closest songs were also made by the same artist, Chance the rapper. This indicates that the CNN could be identifying similar higher-level features such as vocal style and range from the data. One of the songs on the playlist was Santa on tour by Dee Yan-key which is a jazz instrumental track. Surprisingly the track contained a similar ambience and slow-paced drums to Cocoa Butter Kisses.

This highlights the system’s ability to recommend similar songs regardless of genre!

Results/Evaluation

The CNN model achieved 75% for 3 genre classification (Hip-Hop, Rock and Classical). By using genres as a meaningful measure of music rather than simple categories, a new way to find similar songs is possible with this genre space.

The task of evaluating the system’s performance was one of the more difficult parts of the project. Especially when dealing with music, it is probably one of the most subjective art forms known today. Although many automatic playlist generation algorithms have been proposed over the years, there is currently no standard evaluation procedure.

Current quantitative evaluation schemes either reduce the problem to an information retrieval setting or rely on simplifying assumptions that may not hold in practice.The Semantic Cohesion test is one that uses the frequency of meta-data co-occurrence (e.g. songs by the same artist or of the same genre) or the entropy of the distribution of genres within a given playlist. Thus, the higher the semantic cohesion, the ‘better’ the playlist.

This assumption is truly unrealistic, as songs generally map to multiple meta-data tags. Assigning each song to precisely one semantic description discards a great deal of information. A more fundamental flaw lies in the assumption that cohesion accurately characterises playlist quality. In reality, this assumption is rarely justified, and evidence suggests that users often prefer playlists that are diverse in style and genre.

Therefore, I decided to evaluate the system qualitatively using a human evaluation study. Each subject was asked to choose a query song constraint by the genres the CNN was able to predict (Hip-Hop, Classical and rock). A playlist was then generated for each subject based on their chosen song for them to listen to. They were then each given a survey to fill in, so that they could evaluate the playlist. The results showed that the model produces recommendations of a sensible nature.

The system was made using Python, Librosa and Tensorflow.

Fork me on Github!

Conclusion/Future Work

In conclusion, this project demonstrates how deep learning can be applied to a music recommendation system to solve the cold start problem. This project created a new recommendation model using solely audio signals without relying on historical usage data like collaborative filtering. Furthermore, this system not only has the potential for stand-alone use but could enhance a hybrid music recommendation system.

A final thought for future work, would be to use the deep learning component in a completely different manner. Instead of the system choosing songs to recommend, it could generate the music instead. Recent studies in MIR have attempted to generate music using deep learning. As progress is made in this area, it could lead to a revolution of music production.