Why Spotify Wrapped Drops in November

Aditya Chheda
GDSC VIT Vellore
Published in
8 min readJan 7, 2024

Before we begin, no, I do not have an answer to how Pritam showed up among all our top artists yet again this year. That is a man-sized “How?” we’re talking about.

So, back to our burning question: Why November? While it’s tempting to believe that they’re just trying to dodge the inevitability of “Last Christmas” by Wham! topping everyone’s Wrapped, the reality is that this timing is a thoroughly thought-out decision.

With a whopping 574 million active users as of Q3 2023, Spotify isn’t just streaming music; it’s collecting a massive amount of data. Imagine every track played, every song skipped, and every playlist created by all these users throughout the year. The task of analyzing a full year of this data is nothing short of colossal. Releasing Wrapped in November means they can use most of the year’s data, giving them enough time to create something that feels personal for each user.

It is also a genius marketing move because they could also easily just postpone dropping the version prepared for November to New Year’s, but Spotify is smarter than that. In 2020, Wrapped campaigns led to a 20% increase in app downloads. Releasing Wrapped in November creates a buzz when most other platforms are just gearing up for holiday campaigns.

The graph shows user growth peaking with the release of Wrapped. Via MoEngage

Understanding how Spotify pieces together gigantic amounts of data full of countless interactions into our beloved personal Wrapped is vital for us to be able to empathize with their Wrapped team. Before we unravel this, let’s take a step back and start where Spotify does:

Disclaimer: This is our best guess on how Spotify’s procedure for the creation of Wrapped might work, based on the limited sources available for the same and common practices in Data Analysis.

Data Collection: Consensual for a Change

This step is an enormous undertaking because every interaction from 574 million users is logged over the year in detail. This data lake is not just a simple log of song plays; it includes a collection of metadata such as timestamps, device types, and even locations. To manage this massive, continuously flowing lake of data, Spotify uses database services, namely Google Cloud Bigtable and Apache Cassandra.

Google Cloud Bigtable is a NoSQL database service designed for high performance and scalability. It’s particularly good for handling various types of unstructured data that Spotify generates, such as user interactions and song playbacks. This is crucial for real-time features like song recommendations.

Apache Cassandra is another NoSQL database service built to manage huge amounts of data across many servers, maintaining high performance even as the dataset grows. Cassandra’s ability to add more servers easily (horizontal scaling) is especially useful for Spotify as its user base and data volume continue to grow. It is fault-tolerant, which means the system stays up even if parts of it fail, ensuring Spotify’s service remains stable and uninterrupted.

These systems are specifically designed to handle vast amounts of data with low latency, which is crucial for real-time data processing. This initial collection of data marks the beginning of the pipeline, where petabytes of raw data start its journey towards becoming more interesting and pretty.

A visual representation of a petah-byte

This year, Spotify changed things up by counting music data past October 31st. It’s a small but important win. And for those curious, your offline listening counts too. The songs you download and play offline are included in your Wrapped once you reconnect to the internet. Anything you play for over 30 seconds is counted into your final Wrapped.

Pro tip: Say, you are an ASMR enjoyer (or the musical equivalent would be liking Machine Gun Kelly) and do not want to see “40 Minutes of Collarbone Tapping” on your Wrapped. You can just exclude the playlist from your taste profile, or better yet, start a private session so whatever you’re listening to won’t show up on your top songs but will still be counted in the total listening minutes.

Data Cleaning and Standardization:

Once Spotify is done collecting the data, the phase of cleaning and standardization comes into play. With the massive amount of data gathered, it’s bound to be dirty (inconsistent). To detect said inconsistencies and correct them, a common practice is to use automated scripts to scan through the dataset. They correct discrepancies like mismatched song metadata and filter out anomalies like accidental song repeats (because nobody willingly repeats an MC Stan song) because they could skew data analysis.

“No, not that kind of cleaning!”

The next part of this process is normalization, where data from various sources is standardized to a common format through various techniques like feature scaling and data imputation, making it easier to analyze. Imagine you’re looking at two different types of data from users — the number of songs played and the total hours spent listening. These are measured in entirely different units (count of songs vs. hours). Directly comparing or analyzing these can be misleading because of their different scales.

Feature scaling solves this by adjusting these values to a common scale. A common method (for the sake of explanation) is to use a range, like 0 to 1, where 0 represents the lowest value in the dataset and 1 represents the highest. So, if a user listened to 100 songs, and that’s the highest in the dataset, this would be scaled to 1. If another user listened for 50 hours, and this is the highest for the listening time, it also gets scaled to 1. Now, both of these top values, despite being different in actual terms (100 songs and 50 hours), are comparable on the same scale.

Data Imputation is another important technique in this stage. It’s used to fill in missing data, which is inevitable given the scale of Spotify’s operations. For example, if a user’s listening data is missing for a week, complex regression or other predictive models can estimate those missing values based on correlations with other aspects of the user’s listening habits.

This ensures that the final dataset is not just large but also complete and consistent, ready for the analysis that follows.

Data Processing and Analysis:

In the next phase, Spotify’s data scientists utilize a clean and standardized dataset to start the intricate process of data processing and analysis. Here, a combination of advanced machine learning algorithms and analytical techniques are employed to make a LOT of sense out of otherwise seemingly useless data. Some key algorithms and techniques used are:

Collaborative Filtering: This machine learning technique is based on the idea that users who agreed in the past will probably agree in the future about certain preferences. In Spotify’s case, collaborative filtering helps predict user preferences based on their listening history and similarities to other users. For instance, if User A and User B have similar tastes in music and User B likes a new song, that song might be recommended to User A. This is also an oversimplification of the start of a situationship.

Clustering Algorithms: These algorithms group users into different clusters based on their listening patterns. For example, one cluster might consist of users who predominantly listen to indie rock, while another cluster might be made up of classical music enthusiasts. Understanding these clusters helps Spotify deliver more targeted and personalized content recommendations.

Natural language processing (NLP): It plays a significant role, especially in interpreting human inputs, such as search queries, and translating them into a language that machines can understand. This allows the platform to recommend songs and artists that align closely with these queries. Spotify also uses NLP to analyze podcast listening behaviors. It helps Spotify understand the content and context of podcasts by extracting topics, sentiments, and even emotions from the spoken words.

Personalization and Visualization: Yassifying Data

Personalization is a pretty complex process enhanced by modern data engineering techniques, particularly through Reverse ETL (Extract, Transform, Load) pipelines. Unlike traditional ETL, which moves data from various sources to a centralized data warehouse for analysis, this advanced method allows Spotify to efficiently move detailed user data from its large-scale data warehouse directly to various systems and applications.

Reverse ETL simplifies Data Activation, which is the process of using the data for specific purposes like personalized marketing or customer experience enhancement. It involves transforming data to be readily accessible and actionable in different environments, such as marketing channels and product experiences.

Posted by Tejas Manohar on Hightouch, in an interesting, in-depth blog about the science behind Wrapped.

For Visualization, Spotify’s design team collaborates with their data scientists using Reverse ETL to access and translate complex data into engaging and interactive stories, through a variety of appealing visuals. This ensures that the Wrapped experience is not just informative but also aesthetically pleasing, providing users with an intuitive and funky way to explore their year in music.

Quality Assurance:

Before Spotify Wrapped is unveiled to the public, it goes through a rigorous quality assurance process. This stage of the pipeline involves severe testing across different devices and platforms to ensure both consistency and optimal performance. The QA team checks for data accuracy, visual consistency, and functional correctness. They simulate various user scenarios to ensure the Wrapped feature performs reliably under all possible conditions.

The 2021 QA team was clearly not competent enough because the design that year was a COMPLETE disaster.

Made me want to switch to Apple Music for a solid second.

Cue Exit Music (For A Film):

As a little going-away gift for the geeks out there, you can send Spotify a mail asking for your “endsongs.json” file. It’s a bit of a hassle, but you’ll get a ton of detailed statistics about your listening habits. It’s a cool way to get even more insight into the development of your taste in music over the year. I personally wouldn’t check mine because my data is surely tainted by the countless number of times my grandma played Om Bhur Bhuva Swaha on Alexa.

So, when you check out your Spotify Wrapped this year, remember that there’s a lot of work that goes into it. From handling huge amounts of data to making smart marketing moves, there’s a reason why it hits your app in November. It’s not just a pretty summary of your music taste; it’s somewhat of a cultural event for our generation (we are doomed), many of whom use their Wrapped results to flaunt, joke, and connect. In 2023, over 122 million users shared their Wrapped stories on social media, generating 5 billion views, which only proves the sheer brilliance of Spotify’s marketing team.

At the end of the day, I know what I was grateful for last year — Del Water Gap and Dominic Fike dropping before November.

--

--