Scaling Data on an AWS instance For an Spotify Recommender System

5 min readDec 18, 2021

The Data

As mentioned in my previous entry, the modeling was done with 2% of the data (20,000) samples. Which for some might or might not be large enough.

Talking about big data, definitely 20k samples is not big enough. For our dataset, 1 Million playlists, 66.3 Million tracks among the playlists and 2.2 Million unique tracks, we are dealing with a serious dataset.

Distribution of the dataset — Each file contains 1 thousand playlists

All the processing for the 2% of the data was done locally with a 32GB RAM computer without any problems, the most challenging part was the training and scoring that took about 3 hours.

When trying to do the training and scoring on the full 1M dataset, it simply did not fit in memory and we needed to move to an AWS instance.

AWS

AWS Instance

AWS has a variety of options and you can use the instance that best fits your needs. As you can see, there is the typical instance found in every day's computer to a big instance of 384GB Memory (not pictured here).

We tried different instances and after trial and error we settled for the 192GB Memory with 48 vCPU because it was the lowest cost instance that was able to fit our entire data set in memory.

Even though the price per hour does not seem high, it starts to add up when your processes need to run for hours or days which it was the case for our processing.

AWS Processing

Now that we had everything setup on AWS with the necessary notebooks and data we started to train the models on the full 1M dataset. What we found after 5 hours of running was that a single train and scoring was taking 4 hours which it was simply out of our budget and available time (at least 16 days and 1 thousand dollars).

You need to remember that it is time intensive because to select the right model and number of clusters, the 4 models (KMeans, Birch, Agglomerative and Gaussian Mixture) need to be train with k = 2 to 100.

Why it was taking too long?

Yes, that Big-O notation was chasing us. We are using 3 different scores function: Silhouette, Davies-Boulding and Calinski-Harabasz with time complexity of O(n2), O(nlogn) and O(nlogn) respectively. After analyzing this, there is no surprise it was taking that long.

At this moment, we knew that in 2% of the data KMeans with k=17 is the best model but we did not know if this was going to hold true with the rest of the data and to be fair, 2% was too low.

We needed to regroup and think what was the best approach to solve this. After discussing all the ideas and estimating the processing time we decided to run it with 10% of the data and evaluate the results.

Modeling

The processing lasted for 24 hours and we had very similar results to the 2% of the data.

Sure enough the best model and number of k was the same, KMeans and k=17. These results gave us confidence on go straight to the 1M dataset with KMeans and k=17.

After training the model these are the results. The cluster formation is close to the 2% of the data and the same for the playlist names.

Analysis

Why are we getting similar results?

After analyzing these results, it makes total sense to have these results because we were careful enough to have a random sample with the first 2% of the data and when scaling to the 1M it is no surprise to have similar results. At the end of the day, the dataset has the same distribution and barely any new unique songs added as playlists are being added, this can be seen in the distribution graph at the top of this article.

What does that mean for our recommender system?

After trying the system with

Lessons learned

Working with a large dataset is never easy and you need to be as efficient as possible. We were in need to optimize different parts of our code because we got to a point where it was not possible or too expensive to run it with inneficient code.

One of the first things we needed to optimize was the Spotify API calls and move from a single API call per track to a single API call for 100 tracks. This saved us around 300 hours of processing time.

Another area of improvement was to build a sqlite database instead of using a json file as database, this saved us time and storage space. From 35GB to 8GB.

Finally, understanding the data and the structure underneath is key when working with large datasets because as seen in this analysis, we ended up with the same model as the 2%.