BYOC — Build Your Own Clustering Algorithm
Welcome back to the second stage of our project using the Spotify API in Python! If you want to learn the basics about APIs in general and how the Spotify response payload is structured, check out part 1.
In this part, we’ll understand clusters in our song data and better understand clustering algorithms in general by building our own!
Data Querying and Processing
Let’s start by functionalizing the query to the API we built in part 1:
Now, we can call this function to generate our training data:
Notice the danceability and valence columns — these are the features that I want to investigate. Let’s start by defining and calling a function for a basic scatter plot of these two variables:
No distinct groups jump out to me, but that is why techniques like clustering are so powerful! Clustering algorithms are unsupervised learning, meaning we give the computer no target variable to predict (and thus check its accuracy with). This is useful in situations like the scatter plot above, where we ourselves don’t have a predetermined idea of what groups my music taste falls into.
Clustering Conceptually
In order to generate “ideal” clusters, we want each observation to belong to one and only one cluster, for points within the same cluster to be as close as possible, and for points within different clusters to be as far as possible. The intuition for this is that we want clear separation within our groups, yet individuals within each group to be similar. By these criteria, Case II is preferable to Case I.
We can calculate metrics like inertia — a sum of the distance between each point and the centroid of the cluster that point belongs to — to evaluate these criteria. In fact, a common method of determining how many clusters to use requires inertia.
Elbow Method
The elbow method plots the within-cluster sum of squares (sum of the squared distance between each point and its respective cluster) as the number of clusters increases.
For our data, the elbow method looks like this:
Logically, inertia will decrease as the number of clusters we subdivide the data into increases. The “elbow” in the name comes from how at a certain number of clusters, inertia stops noticeably decreasing. One could argue that with anywhere from 3 to 6 clusters, we have achieved a balance between under and overfitting our data.
K-Means Theory
Okay! Now we know how many clusters to form, but where do we pinpoint them? This is where the K-Means algorithm comes in. For a given number of clusters n, K-Means will:
- Plot initial centroids using the (x, y) values of n randomly selected points
- Assign all data points to the closest cluster
- Recalculate the centroids by averaging the (x, y) values of all points currently in that cluster
- Repeat steps 2 and 3 of assigning points to the closest centroid until the centroids no longer change, points stop being assigned to different clusters, or some number of maximum iterations (I’ve seen 300 as a convention) is reached
K-Means Implementation
We’ll try to implement K-Means clustering from scratch using base Python and pandas, and we can compare its performance vs. the sklearn package!
Now, we can begin the iterative process of step 4, reassigning and recalculating cluster centroids:
Calling our function is easy, and we can see how our clusters moved as the model fit our training data. In simple terms, clear groups can now be seen!
Now, we benchmark ourselves with sklearn’s version:
Our plot output for sklearn shows that we weren’t far off with our version!
I decided to evaluate the difference between my model and sklearn’s by counting the number of songs matched to different clusters when using sklearn. I also calculated the sum of the absolute distances between each of my centroids and the corresponding sklearn centroid.
Calling our validation function on the models built on the training data:
By these standards, we incorrectly assigned 15 songs, or 15% of the training dataset. The distance between Cluster 1 of the manual method and Cluster 1 of sklearn’s method + distance between both Cluster 2s, etc. was less than 0.2! Our manual model’s performance isn’t too bad relative to sklearn’s!
K-Means Testing
Now, we have to put our K-Means to the test in a more applicable way — using test data. Let’s load in a dataframe for a new, 21 song playlist I made:
We’ll have to define a new function for the test data, using ourcluster centroids fit on the training data.
Now, when we call the sklearn function above, we get:
On the sklearn side, the function will have to be different as well to only predict rather than fit and predict:
When we call the manual function above, we get:
Our model from scratch looks especially good on the test data! Cluster 1 is off compared to sklearn’s output, but other differences in Clusters 2, 3, or 4 seem very minor. Let’s confirm with our validation function:
In our 21 song test data set, we achieved over 90% accuracy to sklearn cluster predictions and a 0.14 total distance from sklearn centroids. Not bad!
Model Conclusions
The purpose of this project was to see if groups emerged in the songs I listened to. Both the training and test data had clusters at similar coordinates, which makes our findings seem more credible.
Songs in Cluster 4 have both high danceability and high valence — these are probably my go-to hype songs that I headbop to on sunny days, probably pop or hip hop. I would interpret songs with moderate valence but high danceability in Cluster 2 as slow dance songs, probably R&B or indie. The beauty of a labels column is that we can confirm our theories!
Final Thoughts
Admittedly, our manual model isn’t the most stable. If I were to rerun the code, it would produce slightly different clusters each time (and at times, wildly different). I think that the random sampling of points at the start of the training process is the source of this. If the initial random centroids are placed on the periphery of the data, it would likely pull the clusters farther out (similar to the effect of an outlier on a model).
Sklearn’s version probably does not stop iterating when cluster assignments or centroids remain the same for a single additional iteration. I could address this within the conditional block of the kmeans_training_from_scratch function by stopping the function if cluster assignments or centroids are stable for 2 or 3 further iterations.
Finally, iterrows is known to be inefficient. I could look for how to apply other methods like itertuples since we’re dealing with two variables and use other dataframes or lists to hold intermediate calculations or assignments.
I’d like to thank Analytics Vidhya for this helpful article on K-Means and clustering concepts, and encourage you to approach puzzles like this your own way! Click here for the full code, and thank you for reading!