From Zero to dbt: How to Analyze and Build Data Models from Spotify’s Million Playlist Data

Douenergy
In the Pipeline
Published in
10 min readApr 12, 2024
Spodbtify — Data modeling in dbt with the Spotify Million Playlist Dataset
Spodbtify — Data modeling in dbt with the Spotify Million Playlist Dataset

Over the coming weeks, I will guide you through using dbt (data build tool) to transform Spotify’s one million playlists dataset into an end-to-end analytics project. You will learn how to transform real world middle-small-big raw data (inspired by George Orwell’s lower-upper-middle class 😄) into an interactive data model with which you can explore and discover data insights.

What you’ll learn

  1. Transforming 30GB of raw JSON data into a 5GB Parquet file for efficiency and scalability.
  2. Converting the Parquet file into multiple dbt models for in-depth exploration and analysis.
  3. Understanding why using dbt is a best practice for the data transformation processes.
  4. Learning how to validate every dbt model change to ensure data integrity and accuracy (Spoiler: Just use Recce — an open source dbt model code review tool).

Answer important questions from the data

Each week, I’ll post two questions about the Spotify dataset on the Recce LinkedIn page, such as:

  • How many playlists contain at least 3 Taylor Swift songs ?
  • What are Jay Chou’s top 10 popular songs ?
  • How many playlists contain both BLACKPINK 💗 and Post Malone ?

I will then implement the question that receives the most votes in the open source repository.

Who’s it for?

Anyone interested in learning how to use dbt for data transformation, whether for business BI or ML, is welcome to join us. Additionally, I’ll share some dbt best practices along the way that will be beneficial for your continued work as a data or analytics engineer.

Preparing the one-million-playlist dataset

Ready to started? Good. Let’s get to it. For this project, we’ll be using the Spotify Million Playlist Dataset.

The Spotify Million Playlist Dataset
The Spotify Million Playlist Dataset

“ In 2018, Spotify helped organize the RecSys Challenge 2018, a data science research challenge focused on music recommendation… As part of that challenge, Spotify introduced The Million Playlist Dataset: a dataset of 1 million playlists consisting of over 2 million unique tracks by nearly 300,000 artists. ” — Source

Download the dataset

Following the Spotify policy, you need to register and download the raw data from here. We‘ll use the spotify_million_playlist_dataset.zip (which is 5.4 GB 👀).

This zip expands to 31GB, so make sure you’ve got enough space for it! (Don’t worry, the conversion to Parquet later will shrink it back down)

31 Gigabytes!

After downloading and decompressing the dataset, you’ll discover it’s organized into a thousand partitioned JSON files in the data folder. These files are named with the following pattern:

  • mpd.slice.0–999.json
  • mpd.slice.1000–1999.json
  • mpd.slice.999000–999999.json

A typical example of a playlist entry from one of the slice json files look like this:

Example of a playlist entry from the Spotify Million Playlist Dataset
Example of a playlist entry

Each JSON file in the dataset represents 1,000 playlists, collectively amounting to a total of 1 million playlists. The file prefix “mpd” stands for “Million Playlist Dataset”.

The Spotify team has provided scripts in the ./src folder to help validate the integrity of these JSON files using MD5 checksums; and to compute some basic statistics with the command:

$ python src/stats.py data

As noted in Spotify’s README documentation, the outcome of this program should align with the contents of ‘stats.txt’. The execution time of stats.py may vary, potentially exceeding 30 minutes depending on the performance of your Laptop.

Preliminary Insights

We’ll first start by doing some exploration to get preliminary insights and understand the dataset more. This will be done with the raw json, however, for more advanced data interaction, converting the data into a more efficient format will be necessary. We’ll use Parquet for that (detailed below), which is ideal for use with dbt and transforming the raw data.

Converting the dataset to Parquet is considered a best practice for optimizing data processing and analysis workflows.

Let’s rewind back to the beginning with Spotify’s 1000 raw data files, and I’ll show you the sweat and tears encountered in all data analysis processing workflows.

DuckDB with jq

DuckDB and jq are both excellent tools for interacting with JSON data. Use your preferred package manager to install these awesome tools. For example:

$ brew install duckdb
$ brew install jq

Understand the structure of the json

We can use jq to take a quick glance at the JSON data, followed by employing DuckDB for a more in-depth analysis. Let’s take a look at one of the 1000 files:

$ cd spotify_million_playlist_dataset/data
$ jq 'keys' mpd.slice.0-999.json
Keys in each of the json files

All slices of the JSON contain just two keys.

  • info — this is merely the JSON file’s metadata
  • playlists — the data we actually care about

As you can probably guess, “playlists” is an array, so let’s take a look at the first element in the playlist.

Playlists are arrays
Each playlist is an array

Partial playlist data (showing just the first track) would look like this:

A partial example from one of the playlists

Analysis with DuckDB

Once we’ve obtained an overview of the JSON data’s structure, we can begin to analyze the data using DuckDB. DuckDB provides powerful features, including automatically detecting the SQL data types within your JSON file. This allows you to effortlessly apply SQL syntax for your analysis.

Open the DuckDB Interactive Shell

By typing duckdb in your terminal, you will enter an interactive shell, similar to those provided by psql for PostgreSQL and the SQLite shell.

$ duckdb

Adjust DuckDB’s maximum_object_size

Run the following command and you’ll see the error below:

SELECT * FROM read_json_auto('./mpd.slice.0-999.json');

-- Invalid Input Error: "maximum_object_size" of 16777216 bytes exceeded
-- while reading file "./mpd.slice.0-999.json" (>33554428 bytes).
-- Try increasing "maximum_object_size".

This error is caused because the Million Playlist Dataset’s sliced JSON is larger than DuckDB’s default maximum_object_size, so we need to fix that by adjusting it to 40MB 🫰:

SELECT * 
FROM read_json_auto('./mpd.slice.0-999.json', maximum_object_size = 40000000);

┌──────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ info │ playlists │
│ struct(generated_o… │ struct("name" varchar, collaborative varchar, pid bigint, modified_at bjigint, num_tracks bigint, num_albums bigint, num_followers bigint, tracks struct(pos bi… │
├──────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {'generated_on': 2… │ [{'name': Throwbacks, 'collaborative': false, 'pid': 0, 'modified_at': 1493424000, 'num_tracks': 52, 'num_albums': 47, 'num_followers': 1, 'tracks': [{'pos': … │
└──────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

UNNEST the json

We know that we are only interested in the playlist, which is a nested structure in the columns. Therefore, we can use UNNEST to normalize the playlists column:

SELECT UNNEST(playlists) 
FROM read_json_auto('./mpd.slice.0-999.json', maximum_object_size = 40000000)
LIMIT 5;

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ unnest(playlists) │
│ struct("name" varchar, collaborative varchar, pid bigint, modified_at bigint, num_tracks bigint, num_albums bigint, num_followers bigint, tracks struct(pos bigint, artist_name varch… │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {'name': Throwbacks, 'collaborative': false, 'pid': 0, 'modified_at': 1493424000, 'num_tracks': 52, 'num_albums': 47, 'num_followers': 1, 'tracks': [{'pos': 0, 'artist_name': Missy … │
│ {'name': Awesome Playlist, 'collaborative': false, 'pid': 1, 'modified_at': 1506556800, 'num_tracks': 39, 'num_albums': 23, 'num_followers': 1, 'tracks': [{'pos': 0, 'artist_name': … │
│ {'name': korean , 'collaborative': false, 'pid': 2, 'modified_at': 1505692800, 'num_tracks': 64, 'num_albums': 51, 'num_followers': 1, 'tracks': [{'pos': 0, 'artist_name': Hoody, 't… │
│ {'name': mat, 'collaborative': false, 'pid': 3, 'modified_at': 1501027200, 'num_tracks': 126, 'num_albums': 107, 'num_followers': 1, 'tracks': [{'pos': 0, 'artist_name': Camille Sai… │
│ {'name': 90s, 'collaborative': false, 'pid': 4, 'modified_at': 1401667200, 'num_tracks': 17, 'num_albums': 16, 'num_followers': 2, 'tracks': [{'pos': 0, 'artist_name': The Smashing … │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

DuckDB provides some really handy options in the UNNEST function, including 'recursive := true' , which will recursively unnest the column:

SELECT UNNEST(playlists , recursive := true ) 
FROM read_json_auto('./mpd.slice.0-999.json', maximum_object_size = 40000000)
LIMIT 5;
┌──────────────────┬───────────────┬───────┬─────────────┬────────────┬────────────┬───────────────┬────────────────────────────────┬───────────┬─────────────┬─────────────┬─────────────┐
│ name │ collaborative │ pid │ modified_at │ num_tracks │ num_albums │ num_followers │ tracks │ num_edits │ duration_ms │ num_artists │ description │
│ varchar │ varchar │ int64 │ int64 │ int64 │ int64 │ int64 │ struct(pos bigint, artist_na… │ int64 │ int64 │ int64 │ varchar │
├──────────────────┼───────────────┼───────┼─────────────┼────────────┼────────────┼───────────────┼────────────────────────────────┼───────────┼─────────────┼─────────────┼─────────────┤
│ Throwbacks │ false │ 0 │ 1493424000 │ 52 │ 47 │ 1 │ [{'pos': 0, 'artist_name': M… │ 6 │ 11532414 │ 37 │ │
│ Awesome Playlist │ false │ 1 │ 1506556800 │ 39 │ 23 │ 1 │ [{'pos': 0, 'artist_name': S… │ 5 │ 11656470 │ 21 │ │
│ korean │ false │ 2 │ 1505692800 │ 64 │ 51 │ 1 │ [{'pos': 0, 'artist_name': H… │ 18 │ 14039958 │ 31 │ │
│ mat │ false │ 3 │ 1501027200 │ 126 │ 107 │ 1 │ [{'pos': 0, 'artist_name': C… │ 4 │ 28926058 │ 86 │ │
│ 90s │ false │ 4 │ 1401667200 │ 17 │ 16 │ 2 │ [{'pos': 0, 'artist_name': T… │ 7 │ 4335282 │ 16 │ │
└──────────────────┴───────────────┴───────┴─────────────┴────────────┴────────────┴───────────────┴────────────────────────────────┴───────────┴─────────────┴─────────────┴─────────────┘

This is very convenient for deeply nested JSON data.

Combine the json into a single table

Currently, we are only dealing with a single JSON file. What if we want to combine all 1,000 sliced JSON files into a single table?

DuckDB provides glob syntax that allows you to read multiple JSON files in one pass. You simply modify './mpd.slice.0-999.json' to './mpd.slice*.json'.

CREATE TABLE playlists AS 
SELECT UNNEST(playlists , recursive:= true)
FROM read_json_auto('./mpd.slice*.json', maximum_object_size = 40000000);

It takes 30 seconds to create the playlists DuckDB table on my laptop (M3 MacBook ). We’re now ready to convert the data to Parquet.

Convert to Parquet

The conversion process may require some temporary files to avoid an out-of-memory error so, still inside the DuckDB shell, first run:

SET temp_directory='./tmp';

You’re now ready to export the playlists DuckDB table to a Parquet file, which is also very simple. When you use the copy command specifying a file with the .parquet extension, DuckDB will automatically know you want to export it as a Parquet file.

COPY playlist TO 'playlists.parquet' ;

Easy, huh?

Parquet vs. JSON

So, why Parquet?

The Parquet logo

Parquet is a superior file format for analytics, offering notable advantages over JSON with its columnar storage approach. This design enhances data compression and encoding, reducing storage space and speeding up data access for data analysis workflows.

It also supports complex nested data structures, offering a flexible schema evolution that enables adding new columns without modifying existing data. This makes it an ideal format for scenarios where schema changes over time are common.

Moreover, its compatibility with major data warehouses makes Parquet especially valuable for dbt users, streamlining data integration and analysis workflows. You can easily import and export Parquet files from your favorite data warehouse.

With DuckDB and jq, analyzing gigabytes of JSON data becomes simple on your laptop.

Repeat “ON — YOUR — LAPTOP” 💻

Summary

The original dataset consisted of 1,000 JSON files, totaling 31GB.

Original size of the dataset: 31GB
Original size of the dataset: 31GB

After executing three simple DuckDB queries and spending 1 minute processing, you obtain a single 5.7GB Parquet file, which represents a 500% improvement.

After converting to Parquet: 5.7GB
After converting to Parquet: 5.7GB

Now, you can quickly answer questions like, “How many Taylor Swift songs are on the playlists?” in seconds and on your laptop, which is mind-blowing.

Query the Parquet with DuckDB
Query the Parquet with DuckDB

That’s it for part one of this data analytics endeavor, come back for part two soon.

Next time …

dbt — “Data Build Tool”

dbt is jQuery, not Terraform. — Kshitij Aranke

Having successfully transitioned our dataset into a Parquet file, our next endeavor will be to harness the power of dbt to unlock deeper, more compelling analytical insights from the One Million Playlists dataset.

Apply software engineering best practices to data

While the DuckDB shell offers us the capability to interactively analyze our dataset, the need for a more structured, collaborative, and version-controlled approach to our SQL transformations becomes apparent.

This is where dbt shines 🤩 . dbt allows us to treat ourData Transformations As Code , which means we can apply software engineering practices like version control, code review (Recce shines 💖), and automated testing to our data workflows.

Build on each other’s work

By organizing multiple SQL queries into logical dbt models, we not only enhance the clarity and maintainability of our data transformations, but also enable our data team to build upon each other’s work iteratively. This collaborative approach ensures that our data models are robust, accurate, and reflect the latest business logic and analytical insights.

A trustworthy environment

Moreover, dbt’s documentation features allow us to generate comprehensive documentation of our data models automatically, making it easier for new team members to understand the data landscape and for stakeholders to trust our data-driven decisions.

Data-driven development

In summary, dbt equips us with the necessary tools to efficiently manage our SQL transformations, fostering a collaborative and iterative data culture that is essential for staying competitive in today’s data-driven world.

See you in Part 2

Make sure to follow us on LinkedIn to stay updated and catch more exciting developments! 🤩

Update: Part 2 is now available. Follow along with my sample project as we take a look at why dbt is great for modeling data projects. Oh, and we also answer some of those important Spotify playlist questions!

More Articles from In The Pipeline

--

--