For years now I’ve been heavily involved in local live music. I go to shows, perform in shows, and have even booked and organized a lot of shows. To pull together my live music and data science interests, I thought it would be interesting to take a data science approach to helping people discover live music in their area.
The result of this exploration utilizes unsupervised learning techniques, audio feature extraction with the LibROSA python library, and both the Spotify and Songkick APIs to generate a playlist of songs by artists with upcoming shows in the user’s city based on the user’s favorite artists.
For my train data, I revisit the Pitchfork top 200 albums of the 2010s list, web-scraping a list of the artists it features using Selenium and BeautifulSoup. I use the Spotify API to get top tracks and corresponding pre-defined audio features.
I then use Selenium and YouTubeDL to scrape the mp3s for each song in the dataset and use LibROSA to extract audio features from the mp3s, namely the Mel Frequency Cepstral Coefficients (MFCCs).
MFCCs are frequently used features for problems like speech recognition. They allow for the spectral content of sound to be boiled down evenly spaced frequency bands on the mel scale, which is intended to better represent human auditory perception than a linear scale.
Further following the lead of speech recognition research, I drop all coefficients except for coefficients 2–13. These coefficients exist for each frame of an audio file, so to get my MFCC features to be song-wise instead of frame-wise, I aggregate them by taking the mean and variance of each coefficient within a 30-second clip of each song. Ultimately, that leaves me with 24 MFCC features for each song in my dataset.
For a deeper discussion of MFCCs, check out this article… http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/
In an attempt to extract some genre information from my MFCCs, I apply k-means clustering with four clusters. First, to avoid some of the issues associated with distance-based algorithms and high dimensionality, I apply PCA dimensionality reduction to my MFCC features. I end up with two PCA components that explain roughly 13% of the variance in the 24 MFCC features. Clustering on these components, I assign each song in my dataset to one of four clusters.
My dataset is not as large as I would like it to be for clustering, and many of its artists and songs are somewhat genre-fluid, so the genre clusters bleed into one another a little bit but can be characterized broadly as follows: cluster 1 = electronic/pop, cluster 2 = rap/hiphop, cluster 3 = rock, and cluster 4 = soft rock/indie.
To turn my results into the playlist recommender I set out to make, I need location data. Using the Songkick API, I obtain lists of artists with upcoming shows in a given set of cities and then apply the Spotify/LibROSA data acquisition pipeline to those lists of artists. This creates location datasets for the given cities. I also solicit the user to provide 3 of their favorite artists and then apply the same data acquisition pipeline for each artist’s top 3 tracks (9 tracks total).
I assign all of the songs to clusters based on their MFCC PCA components, and then use euclidean distances calculated on the Spotify audio features to find, for each of the user’s favorite artist’s top songs, the most similar song from the location data that is within the same cluster. The result is a 9 song playlist recommendation of songs by artists with upcoming concerts in the user’s city.
When testing the recommender, I was able to rationalize almost all of my results. Entering Pinegrove, an alternative folk/indie rock band, I got all folky/country/acoustic results. Entering Eyehategod, a sludge metal band, I got only results that could be characterized by heavy distortion. Then entering Ela Minus, an experimental pop/electronic artist, I got mostly results with a similar dreamy/airy quality. My tests were done using NYC location data, and I expect recommendation quality to diminish for cities with less live music.
The main issue with my model is that using web-scraping to obtain mp3s is slow and imperfect. I view my model as a proof-of-concept project, and imagine a commercial implementation would be done by a company with easier access to the necessary data. The other issue with my model is the fluidity of the clusters, which I believe can be easily improved with a larger dataset. Again, this is a task that is easily achieved with the data accessibility of a company in the music streaming industry.
Instructions for replicating this project and using the recommender are available on my Github (https://github.com/gab992/Content-Based-Live-Music-Recommender). Unfortunately, credentials for the Spotify and Songkick APIs are necessary for replication, but are available upon request from the following pages…