Clustering the Most Listened to Songs of the 2010s Using Spotify Data.
Create your own content-based recommendation engine with the K-Means algorithm.
I think to bring together interesting datasets with handy machine learning methodologies helps forming a boosting effect on learning. In this way, users can feel the data better and get a clear opinion on usage of practical machine learning implementations. This is the main reason why I choose the Spotify dataset for this article.
On the other hand, clustering is one of the most popular areas of unsupervised machine learning and it can be used on customers, products or every type of object. It is useful for:
- Segmentation: You can segment your business entities and plan the marketing efforts with a higher targeting accuracy.
- Exploratory analysis: You can comprehend the dataset deeper and find out new meanings inside of it.
- Deriving new features: By interpreting the clusters, you can get new utilizable inputs for following machine learning models.
- Recommendation: If a person likes something, other entities in the same segment can get the attention of a person, too, and it enables clusters to perform as recommendation lists.
The last item above is not suitable for all cases and it might not as successful as other recommendation engine algorithms, however, I think for our case it fits well. In this article, I will try to explain how to apply it and add all the required python scripts. I hope you enjoy it!
What is K-means?
K-means is one of the fundamental and popular clustering algorithms. It is also known as Lloyd’s algorithm. k in the name of the algorithm stands for the count of the clusters. All features of the dataset are represented as a vector in space and all samples are represented as a point in this space. The aim of the algorithm is to determine the centroids, which classify each cluster meaningfully, in this space, regarding the positions of samples.
Basically, it consists of 4 steps:
1 — Initializing with randomly created k centroids in space.
2 — Assigning every sample the nearest centroid according to the euclidian distance.
3 — Calculating the means of every cluster and move the centroids to these means.
4 — Repeating steps 3 and 4 until it converges.
Some notes on k-means:
- By its nature, it is fragile to the outliers and thus, handling outliers before modeling is critical. Scaling is also essential for the features, otherwise, it is expected to get meaningless outputs due to the different scaled features in the vector space.
- It initializes with a random factor, which makes K-means a nondeterministic algorithm. For this reason, every time we train the model might end up with different outputs. In addition, there is also a chance to get the local minima after the convergence. This means we can get low performing results, although we had a chance to get better. To overcome this situation, it is recommended to train the model more than once with different seeds and use the most successful one.
Getting Spotify Data
After a very brief explanation of k-means, let’s get our hands dirty with some real-world data. I chose Spotify as my clustering playground. The first reason for this choice is that it is interesting to work with such music data. Secondly, it is easy to access Spotify API through python using spotipy library. Last but not least, Spotify is providing some audio features for free and our clustering will be based on these features of the tracks.
To access the Spotify API, you need to login to the developer dashboard and create a client id for your own. This is a pretty straight-forward process and after obtaining client id and secret id, we will use them in our python code.
As an input dataset, we are using the tracks from a Spotify list: Most Streamed Songs of the Decade. This list contains the most popular 50 songs of the 2010s and we will cluster these songs according to their features.
Most Streamed Songs of the Decade, a playlist by Spotify
In the script, after accessing Spotify API, we get track information and then audio features of these tracks. You can reach the explanations of features in the Spotify API documentation webpage.
Histograms of all features we will use:
Training the model
The clustering code starts with the normalization of the columns with a scaling function. As you noticed, all features that are provided by Spotify range between 0 and 1, except 2 of them: loudness, and tempo. We scale them to [0,1] in order to make them compatible with other columns in the vector space.
The next step is one of the challenging parts of the k-means algorithm, deciding the optimal size of the clusters, in other words, k. We will use elbow methodology for this purpose. In the script, it starts with k=2 and continues to increase to a specific number iteratively and in each step, we measure the performance of the model using the inertia. Inertia represents the sum of squared distances of the samples to their cluster centroid. Lower inertia value indicates a more successful model. When we visualize the inertia of every step with different k values, we see a convergence point that is similar to the shape of an elbow, which is the optimum k size for us.
In this part of the code, the random_state argument of the kMeans function is important to make our model deterministic. In this way, we can compare the outputs of the functions. You are free to change it, but do not forget to use the same seed number afterward.
According to the chart below, the elbow shape is not very distinct, however, there is a slight curve at k=5 and it is better than nothing. 😏 It seems for a certain convergence, the size of clusters should be more than 9 and to keep it reasonable to interpret, I terminated this search at k=5. k=7 also looks like a good alternative though. Bear this in mind, there is no trivial solution for this problem and you can carry on trying with increasing sample size, handling outliers, changing distributions of features, applying PCA, etc. For some more information about the feature engineering techniques, you can get help from my article about it.
It is time to see the results of our model. One of the best ways of interpreting the clusters is to investigate the averages of features for every cluster. So that we can sense the logic behind clustering. For instance, in the table below you can see the averages of all input features. The cells with bright red and bright green colors are indicating a determinant feature for that cluster.
Let’s interpret our clusters in consideration of the table above. Please note that the characteristics of clusters are deduced by comparing them with each other and thus, they are relative. When we say a cluster includes songs with a low tempo, this phrase does not regarding all music industry.
| 0 | Drake, WizKid, Kyla | One Dance |
| 1 | Major Lazer, MØ, DJ Snake | Lean On (feat. MØ & DJ Snake) |
| 10 | Clean Bandit, Jess Glynne | Rather Be (feat. Jess Glynne) |
| 16 | Imagine Dragons | Radioactive |
| 23 | Drake | God’s Plan |
| 26 | Post Malone, 21 Savage | rockstar (feat. 21 Savage) |
| 27 | Bruno Mars | Grenade |
| 30 | Kendrick Lamar | HUMBLE. |
| 34 | Katy Perry, Juicy J | Dark Horse |
| 41 | Jennifer Lopez, Pitbull | On The Floor — Radio Edit |
| 44 | LMFAO, Lauren Bennett, GoonRock | Party Rock Anthem |
| 45 | Drake | In My Feelings |
| 46 | The Chainsmokers, Coldplay | Something Just Like This |
This is our first cluster with a high value of liveness, which means an audience sound is detected in the songs. Low values of speechness, acousticness, instrumentalness, and valence are other significant characteristics of this cluster. There are songs of different genres in this cluster. 🎹
| 2 | Post Malone, Swae Lee | Sunflower — Spider-Man: Into the Spider-Verse |
| 13 | Ed Sheeran | Shape of You |
| 14 | The Chainsmokers, Halsey | Closer (feat. Halsey) |
| 19 | XXXTENTACION | SAD! |
| 20 | Ed Sheeran | Thinking out Loud |
| 49 | Eminem | Not Afraid |
This cluster has high danceability, acousticness and loudness characters with a low tempo. 🎤
| 5 | Macklemore & Ryan Lewis | Can’t Hold Us — feat. Ray Dalton |
| 12 | fun., Janelle Monáe | We Are Young (feat. Janelle Monáe) |
| 24 | The Chainsmokers, Daya | Don’t Let Me Down |
| 29 | Luis Fonsi, Daddy Yankee | Despacito |
| 37 | Luis Fonsi, Daddy Yankee, Justin Bieber | Despacito — Remix |
| 40 | Pharrell Williams | Happy — From “Despicable Me 2” |
This is our cluster with high energy, valance, and tempo. This is definitely the most dynamic cluster. 💃
| 4 | Adele | Rolling in the Deep |
| 7 | Avicii | Wake Me Up |
| 8 | Eminem, Rihanna | Love The Way You Lie |
| 11 | Carly Rae Jepsen | Call Me Maybe |
| 15 | OMI, Felix Jaehn | Cheerleader — Felix Jaehn Remix Radio Edit |
| 17 | Shawn Mendes, Camila Cabello | Señorita |
| 18 | B.o.B, Hayley Williams | Airplanes (feat. Hayley Williams) |
| 22 | Calvin Harris | Summer |
| 25 | Macklemore & Ryan Lewis, Wanz | Thrift Shop (feat. Wanz) |
| 28 | Mike Posner, Seeb | I Took A Pill In Ibiza — Seeb Remix |
| 31 | Lil Nas X, Billy Ray Cyrus | Old Town Road — Remix |
| 32 | Shakira, Freshlyground | Waka Waka (This Time for Africa) |
| 33 | Don Omar, Lucenzo | Danza Kuduro |
| 35 | Sia | Cheap Thrills |
| 36 | Mark Ronson, Bruno Mars | Uptown Funk |
| 38 | Rihanna | Only Girl (In The World) |
| 42 | Nicki Minaj | Starships |
| 43 | Flo Rida | Whistle |
Another cluster with high energy but low acousticness. These look a bit all-around classified songs like the cluster 0. 🎼
| 3 | Gotye, Kimbra | Somebody That I Used To Know |
| 6 | Ariana Grande | 7 rings |
| 9 | Billie Eilish | bad guy |
| 21 | Wiz Khalifa, Charlie Puth | See You Again (feat. Charlie Puth) |
| 39 | Post Malone, Ty Dolla $ign | Psycho (feat. Ty Dolla $ign) |
| 47 | John Legend | All of Me |
| 48 | Passenger | Let Her Go |
Finally, our last cluster, high acousticness and instrumentalness describe best this one. This cluster has the slowest music style among other clusters with a little R&B and rock style.🎻
Finally, our analysis is completed and we end up with 5 clusters. The songs in the clusters might not be compatible with the genres in your mind and seem irrelevant. But please keep in mind that this work is based on the features, which are mostly technical indicators, provided by Spotify and the outputs are highly dependent on them. Regarding the quality of the results, I think they are not perfect but reasonable and they are convenient to be used for recommendation purposes. A person who listens to a song from a cluster might be delighted to hear another one from the same list. In a similar way, you can analyze your own playlists and discover hidden gems that you may like.