Radio DJs hate this one simple machine learning model [SpotifAI Part 1]

Published in

VantageAI

6 min readDec 2, 2021

It’s Friday evening and you’re in a bar with your friends. One drink follows the other when a quite familiar song comes along.

Having a good time talking about the variance-bias tradeoff

At the tables around you the patrons light up and all seem to enjoy this new tune. Whisperings go back and forth of a new so-called ‘summer hit’, everyone is thrown aback by this undiscovered gem. Everyone, that is, but you and your group of friends, who have been playing this song non-stop since you first told them about it. Instantly you are regarded as a genius, a musical prophet, because of your recommendation months ago. Little do your friends know that you have a nifty way of predicting future hits.

Enter SpotifAI; a model that uses historical data from Spotify to predict hits from the future.

These hits are then nicely packaged in a weekly Spotify playlist just for you. In this three part series we’re going to take a look at how you can build your very own SpotifAI clone, from (1) sourcing and visualising data to (2) building a model and (3) — the crown jewel — deployment to the cloud.

How does it work?

Ok but really, how does this app work? Well first of all we need a dataset of songs and their rankings on Spotify to train our model on. The model tries to learn the hidden patterns in what makes a hit stand out from the rest, the signal within the noise. We then feed it a playlist of new songs, generated by Spotify each week, and ask it to rank these tunes by ‘hit potential’. From our ranking then, we can see which songs are predicted to fare better than others, and which ones won’t fare at all. As a last step, we take the top 20 in this ranking, the real bangers, and release them to the public via an automatically generated Spotify playlist.

Data Sourcing

Yeah yeah, most people just skip ahead to the guts of the model in part 2, but as a true modelling specialist, you know that the data is where it’s at. So dear reader, let’s get started with some data.

In this project, we use data from two different sources. For historical data, we make grateful use of the work of 6 brave Russian university students who compiled an impressive 3-year database of Spotify’s daily top 200 charts. The new potential hits for which we make predictions are sourced from Spotify’s very own API, through where we collect its ‘New Music Friday’ playlist. Let’s take a look at these two data sources.

Historical Kaggle Data

For the purpose of training our model, we will not be using any custom engineered features, sticking only to the features provided via the Spotify API, since we can acquire those as well for the New Music Friday Spotify list. This is in order to guarantee that future datasets from Spotify will be compatible with the model we train.

We start with creating a virtual environment. A virtual environment is a tool that helps to keep packages and version dependencies required by different projects separate by creating isolated python virtual environments for them. There are different tools available for this, but we used conda in the terminal and named the environment spotify.

To give you an idea of the data that we are working with, here is a sample:

The ‘pandas_profiling’ python library is a handy tool to scrutinise the data yourself. We generated a nice-looking and interactive profile report like this:

The data itself is squeaky clean. No missing values or duplicates to speak of, seems like they train their students well in Russia. Although some songs are listed multiple times, each time with a new URL (the ‘unique’ identifier) but with different featured artists, remixes or edits. Of course it’s possible that a song only becomes a hit once it’s been chopped up, sampled and glued back together, so we’ll leave this nuance in the data as-is.

A heat-map of the correlation matrix containing all the features looks as follows:

Looking at these interesting features (read more about them here)— it’s not every day you see danceability as an input variable — we can already notice some relationships. loudness and energy seem to track well, while both have a negative relationship with acoustics. It’s also interesting to note that songs containing explicit language oftentimes also have a high danceability score; fruity language and silky moves mix well together.

Getting new data from Spotify’s API

Once we have trained a model on historical data, which you will read all about in part 2 of this series, we of course want to apply this model to predict future hits! We extract newly released tracks by scraping the New Music Friday playlist from Spotify (https://open.spotify.com/playlist/37i9dQZF1DX4JAvHpjipBk ). How? I hear you ask?

We start off with importing the spotipy python package, which works as a python wrapper around the Spotify API. Be sure to pip install spotipy before moving on to the next section.

In order to connect to the Spotify API, we need a client id and a secret. To obtain the required credentials, you will need to take care of the following:

Have a Spotify account
Log in with that account on Spotify for developers (https://developer.spotify.com/dashboard/)
Create your own Spotify application there

Now you have an ID and secret ready for use, you can set up the connection as follows. Of course you need to replace the empty cid and secret strings with your credentials.

It is recommended to store your credentials safely instead of putting it hard-coded in your program. For this you can make use of environment variables or a secrets manager in the cloud.

Now that we have the connection all set up, it is time to explore! We can paste a playlist URL into the playlist function, extract our first song in the playlist and take a look at its track dictionary:

The track value as part of the response:

If we want to extract the name of the artist, we can use the following code:

As you can see, the artist of the first song is the one and only Anuel AA. Here’s how we got most of the other features from the API response:

The best way to get familiar with the data is by visualising it. Let’s explore 2 of the most popular songs by Johnny Cash (“Hurt” and “Ring of fire”) by plotting them on a selection of audio features:

As you might expect, Ring of fire scores much higher on valence in comparison to Hurt.

Data, data and more data. That’s what we looked at in this first of a three-part series on how you, yes even you, can build your very own SpotifAI app. In the next blog, post we will dive into the modelling part and in the final blog post we’ll take a look at an enigma that is missed in many failed data science projects: deployment to the cloud. Spooky right? Stay tuned.

Written by Andrei Pascanean and Karlijn Schipper

We’re always on the lookout for cool data nerds, so feel free to hit us up on LinkedIn here and here :)

Shoutout to Björn van Dijkman and Emiel de Heij for all the feedback and help! Feel free to let us know what you think down in the comments, dear reader.

Radio DJs hate this one simple machine learning model [SpotifAI Part 1]

How does it work?

Data Sourcing

Historical Kaggle Data

Getting new data from Spotify’s API

Written by Andrei Pascanean