From Zero to dbt: Part 2 — Modeling Spotify’s Million Playlist Dataset

Douenergy
In the Pipeline
Published in
12 min readMay 13, 2024

In the first part of our Spodbtify series, we tackled the challenge of converting Spotify’s 31GB Million Playlist Dataset into a more manageable 5GB Parquet file. This transformation streamlines the data and also significantly enhances the ease of manipulation for analysis.

In this part, we will do the following:

Now, we can explore the more interesting aspects of the Spotify Million Playlist dataset. 🎧

Data analysis groundwork

Our aim is to create a dbt pipeline from the Spotify MPD data, but we must first delve into further exploration to deepen our understanding of the dataset. This detailed analysis is essential for creating accurate models in dbt.

We’ll continue to use DuckDB to learn more about the dataset and help us to:

  • Identify the presence of missing or NULL values.
  • Confirm the datatype of each column in the dataset.

These steps will provide a solid foundation for further analysis.

Describe and Summarize

We’ll be using the DESCRIBE and SUMMARIZE commands. Make sure your playlists.parquet file (Part 1 explains how to generate this) is located in the current directory, and then start the DuckDB shell by typing duckdb in your terminal.

The DESCRIBE command in DuckDB reads the metadata to provide information about the datatype, null constraints, and other attributes of each column in the dataset:

D DESCRIBE SELECT * FROM './playlists.parquet';
┌───────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├───────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ name │ VARCHAR │ YES │ │ │ │
│ collaborative │ VARCHAR │ YES │ │ │ │
│ pid │ BIGINT │ YES │ │ │ │
│ modified_at │ BIGINT │ YES │ │ │ │
│ num_tracks │ BIGINT │ YES │ │ │ │
│ num_albums │ BIGINT │ YES │ │ │ │
│ num_followers │ BIGINT │ YES │ │ │ │
│ tracks │ STRUCT(pos BIGINT, artist_name VARCHAR, track_uri VARCHAR, artist_uri VARCHAR, track_name VARCHAR, album… │ YES │ │ │ │
│ num_edits │ BIGINT │ YES │ │ │ │
│ duration_ms │ BIGINT │ YES │ │ │ │
│ num_artists │ BIGINT │ YES │ │ │ │
│ description │ VARCHAR │ YES │ │ │ │
├───────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┤
│ 12 rows 6 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

The data types all appear to be correctly assigned.

The SUMMARIZE command will calculate statistics such as the percentage of null values, as well as the minimum, maximum, and other relevant statistics for each column in the dataset.

SUMMARIZE SELECT * FROM './playlists.parquet'

Tip: Since we only want to get a broad overview of the data, use the SAMPLE clause to sample a portion of the data. This is faster than summarizing the entire dataset, and still provides useful insights.

IMHO, DuckDB’s SAMPLE clause is more intuitive than the traditional TABLESAMPLE clause used in other databases. This simplicity makes it easier to quickly obtain a sample of the data for preliminary analysis:

D SUMMARIZE SELECT * FROM './playlists.parquet' USING SAMPLE 10%;;
┌───────────────┬──────────────────────┬──────────────────────┬──────────────────────┬───────────────┬───┬────────────┬────────────┬────────┬─────────────────┐
│ column_name │ column_type │ min │ max │ approx_unique │ … │ q50 │ q75 │ count │ null_percentage │
│ varchar │ varchar │ varchar │ varchar │ int64 │ │ varchar │ varchar │ int64 │ decimal(9,2) │
├───────────────┼──────────────────────┼──────────────────────┼──────────────────────┼───────────────┼───┼────────────┼────────────┼────────┼─────────────────┤
│ name │ VARCHAR │ // indie // * │ 🦋🦋🦋 │ 33544 │ … │ │ │ 126976 │ 0.00 │
│ collaborative │ VARCHAR │ false │ true │ 2 │ … │ │ │ 126976 │ 0.00 │
│ pid │ BIGINT │ 11000 │ 982759 │ 126281 │ … │ 450960 │ 703797 │ 126976 │ 0.00 │
│ modified_at │ BIGINT │ 1276214400 │ 1509494400 │ 1958 │ … │ 1490297453 │ 1505519352 │ 126976 │ 0.00 │
│ num_tracks │ BIGINT │ 5 │ 250 │ 246 │ … │ 49 │ 92 │ 126976 │ 0.00 │
│ num_albums │ BIGINT │ 2 │ 239 │ 232 │ … │ 37 │ 68 │ 126976 │ 0.00 │
│ num_followers │ BIGINT │ 1 │ 10911 │ 182 │ … │ 1 │ 1 │ 126976 │ 0.00 │
│ tracks │ STRUCT(pos BIGINT,… │ [{'pos': 0, 'artis… │ [{'pos': 0, 'artis… │ 126980 │ … │ │ │ 126976 │ 0.00 │
│ num_edits │ BIGINT │ 1 │ 201 │ 183 │ … │ 10 │ 22 │ 126976 │ 0.00 │
│ duration_ms │ BIGINT │ 235739 │ 635073792 │ 123833 │ … │ 11436494 │ 21428596 │ 126976 │ 0.00 │
│ num_artists │ BIGINT │ 3 │ 238 │ 212 │ … │ 29 │ 52 │ 126976 │ 0.00 │
│ description │ VARCHAR │ !! │ 😜😈🌞 │ 2313 │ … │ │ │ 126976 │ 98.15 │
├───────────────┴──────────────────────┴──────────────────────┴──────────────────────┴───────────────┴───┴────────────┴────────────┴────────┴─────────────────┤
│ 12 rows 12 columns (9 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Findings

After using the SUMMARIZE and DESCRIBE commands, we can observe the following:

  • pid is probably the primary key.
  • The null value percentage is quite low, indicating that Spotify released a fairly clean dataset.
  • DuckDB's type-inference of has performed well.

Verify the Primary Key

Let’s verify that pid is the primary key by running the following SQL query in DuckDB:

D SELECT COUNT(DISTINCT pid) FROM './playlists.parquet';

┌─────────────────────┐
│ count(DISTINCT pid) │
│ int64 │
├─────────────────────┤
│ 1000000 │
└─────────────────────┘

This query counts the distinct pid values. The result matches exactly the total number of rows in the playlists (1,000,000). We’ve verified our primary key.

Why dbt?

We could continue analyzing the data like this, but the nested tracks column makes it less convenient. That means it’s time to introduce dbt and model the data properly so we can perform better analysis.

dbt also has the following benefits, which makes it perfect for teams and growing data projects:

  • Version controlled — Managing code versions with branches is at the core of dbt, and means you can fully test code and data before merging branches into main and pushing data to prod.
  • Scalable and maintainable — dbt’s project structure makes it easier to maintain.
  • Repeatable — By organizing SQL into logical blocks in models, means that transformations are repeatable.
  • Testable — dbt provides built-in testing, which provides a good foundation for data integrity. We’ll also be augmenting dbt tests with Recce data checks later on.

Let’s get stuck in

Nothing is better than just doing ! Let’s see how dbt organizes SQL code, elevating data engineering to adhere more closely to software engineering best practices.

dbt 📊+ Spotify 🎧 => Spodbtify

The only preliminary dependency we use is dbt-duckdb, so just use your favorite Python virtual environment to install dbt-duckdb.

pip install dbt-duckdb  #It will install both dbt and dbt-duckdb

Sample Repository

For beginners, it’s simpler to clone this repository and start from here, rather than creating a new project with dbt init. This approach saves time and streamlines the learning process. Additionally, I’ve included some useful Makefile commands to facilitate project management.

git clone https://github.com/douenergy/Spodbtify
cd Spodbtify

Database

As mentioned earlier, dbt orchestrates our SQL code. Therefore, there must be a database in place to execute our SQL, right ? Indeed, we’ll utilize DuckDB for this purpose 🦆.

There is a profiles.yml in the project root that contains the configuration for the project:

  • spodbtify is the profile name
  • outputs specifies different environments (e.g., dev and prod)
  • type indicates the database type (DuckDB)
  • path defines the file path for the database
  • threads specifies the number of threads to use
  • The target section specifies the default environment to use.

DuckDB operates similarly to SQLite — it’s a single-file database where all your data is efficiently stored within one file.

spodbtify:
outputs:
dev:
type: duckdb
path: 'dev.duckdb'
threads: 4

prod:
type: duckdb
path: 'prod.duckdb'
threads: 24

target: dev

Models directory

There are several directories in the dbt project, but the most important one will always be the models directory. We'll skip the other directories for now and will revisit them another day.

What are dbt models?

A model is a piece of SQL code, and each file in the models folder represents a specific data transformation, like aggregating data or joining tables. Models help organize these transformations, simplifying data management and change tracking.

People might choose to write assembly or machine code, but the majority prefer using Python for programming. Likewise, while you could opt for raw SQL for transformations, many people turn to dbt for this purpose.

Why would you choose to use dbt models, instead of writing raw SQL?

Raw SQL versus dbt model

Let’s explore how to address a question about the MPD dataset by writing raw SQL and then compare it to using dbt.

I am a Taylor Swift fan, so diving into a dbt project analyzing Spotify’s Million Playlist dataset was a dream come true (BTW Taylor Swift’s release of “Fortnight” during the series article was truly amazing, just as expected).

When I first got my hands on the MPD data, one burning question 🔥 consumed my thoughts:

“What’s the percentage of playlists with a Taylor Swift song ?”

Using raw SQL

To address the query, let’s denormalize (unnest the nested column) the raw Parquet file generated from Spodbtify Part 1. Starting from the raw Parquet file, we iteratively extract data, enabling us to respond to any inquiries regarding the dataset.

Mentally model the data

Mentally, we envision a graphical representation akin to a Directed Acyclic Graph (DAG)

We can accomplish this task solely using SQL, and then execute that SQL within DuckDB:

While this works there are limitations, which would become more and more apparent as the complexity of your project increased.

  • Lack of version control, making it challenging to track and manage changes over time.
  • Readability can suffer, particularly with complex queries, hindering comprehension.
  • Understanding the DAG (Directed Acyclic Graph) relation between Common Table Expressions (CTEs) becomes difficult, impeding debugging efforts.
  • As CTEs proliferate 🦠 , maintenance can become unwieldy, and their reusability in addressing different Spotify-related queries may be limited.

Using dbt models

Using dbt, we can break down these lengthy CTEs into three separate models, each focusing on a specific aspect of the transformation process.

The modular approach

This modular approach enhances code organization and readability, simplifies maintenance, and facilitates reusability of code components across different transformations.

Here’s the raw SQL broken up into the logical SQL blocks that form the dbt models:

Encapsulated logic

By encapsulating related logic within individual models, dbt promotes a more structured and manageable development workflow, enabling data engineers to efficiently build and maintain complex data pipelines with greater ease and flexibility.

These three models (again , think of them as pieces of SQL code) are situated within the models folder:

  • Playlists and Tracks are located in models/staging
  • Swifties resides in models/marts

Why do we create Staging and Mart subdirectories ?

Within the models directory, dbt recommends creating staging and mart models, each resembling a piece of SQL code.

Staging models

Staging models, located in models/staging, act as an intermediary step where raw data is loaded and prepared for further transformation. It ensures data consistency and integrity before moving to more complex transformations.

Playlists (staging model)
Playlists (staging model)
Tracks (staging model)
Tracks (staging model)

Marts models

Mart models, found in models/marts, contain finalized datasets optimized for analysis. This separation promotes modularity and scalability, facilitating efficient data processing from raw to analytical-ready datasets.

Swifties (mart model)
Swifties (mart model)

Sources

You might also observe a difference in the staging models — We’re not directly reading playlist.parquet from a hardcoded location, as commonly done in raw SQL, but instead referencing a source:

{{ source(‘external_source’, ‘raw’_spodbtify_playlists’) }}

Sources are defined in source.yaml. We have the option to define an environment variable named SPODBTIFY_SOURCE_PARQUET, which dbt can automatically substitute. Furthermore, we can establish an alias for this environment variable, simplifying its reference within the models.

models/staging/source.yaml

sources:
- name: external_source
meta:
external_location: "{{ env_var('SPODBTIFY_SOURCE_PARQUET') }}"
tables:
- name: raw_spodbtify_playlists

Everything is set, now simply type the command

dbt run --target dev 

The setting --target dev is specified in the profiles.yml file, informing dbt to execute all transformations within the dev.duckdb environment. This command compiles the models' code and executes all the transformations we've written.

Now, if you open the newly generated dev.duckdb database you can query the new models:

duckdb dev.duckdb 
D SELECT * FROM Swifties;
┌─────────────────┬────────────────────────┬─────────────────────────────┐
│ total_playlists │ taylor_swift_playlists │ taylor_swift_playlist_ratio │
│ int64 │ int64 │ double │
├─────────────────┼────────────────────────┼─────────────────────────────┤
│ 1000000 │ 40701 │ 0.040701 │
└─────────────────┴────────────────────────┴─────────────────────────────┘

Ta-dah! We’ve discovered more Swifties ! 😊 But this remains my second most favorite aspect of dbt.

dbt docs

Remember, we adhere to a DAG mental model, guiding us step by step through data transformations until we arrive at a final dataset capable of answering our questions. dbt automates this process effortlessly with just a simple command.

dbt docs generate && dbt docs serve

dbt doc generate creates your project's documentation website, while dbt docs serve initiates a web server on port 8080.

Here’s the lineage graph automatically generated by dbt doc, closely resembling our mental model, right? Since we use “ref” inside the model code, dbt can’t automatically figure out the DAG dependencies. This feature is quite handy when debugging SQL transformations.

dbt Lineage DAG

A model DAG graph is invaluable for visualizing relationships between models and simplifies debugging compared to navigating through a 500-line giant CTE. It provides a clear overview of dependencies, aiding in understanding data flows and identifying potential issues more efficiently.

This is definitely my favorite dbt feature.

Once you’ve tried an ETL framework like dbt, it’s hard to return to using raw SQL alone 😉

Adding more dbt models

Now, let’s explore how we can leverage existing dbt models for reusability.

Let’s say we wanted to determine “What’s the percentage of playlists with a Post Malone song ?”. We can efficiently extract and analyze the relevant data by leveraging existing dbt models within our new dbt models.

Here’s a new model to answer that question:

Posty (mart model)

Run dbt again to build the new model, and then we can query it.

dbt run --target dev
duckdb dev.duckdb
D SELECT * FROM Posty;
┌─────────────────┬───────────────────────┬────────────────────────────┐
│ total_playlists │ post_malone_playlists │ post_malone_playlist_ratio │
│ int64 │ int64 │ double │
├─────────────────┼───────────────────────┼────────────────────────────┤
│ 1000000 │ 93227 │ 0.093227 │
└─────────────────┴───────────────────────┴────────────────────────────┘
dbt docs generate && dbt docs serve

Then check the lineage DAG to see what the project looks like now:

The Lineage DAG after adding a new model

See how effortlessly we extend transformations by reusing existing models. It streamlines development and ensures consistency across data transformation processes.

Conclusion and next steps

In Part 2, I took you on a journey through the wonders of dbt in transforming data. We explored how dbt brings a whole new level of ease to managing, reading, and reusing code compared to the old-school SQL methods. And, let’s not forget about the dbt DAG — that nifty graph showing us how all our models are connected. It’s like having a roadmap for our data journey, making troubleshooting a breeze.

As we look ahead to the next part, get ready for even more excitement. We’ll be diving deeper into dbt, adding more models, tweaking the ones we’ve got, and managing complexity in day-to-day dbt development.

Trust me, applying best practices and using some dbt CI tools from the day-1 will save you from future headaches. You’ll thank yourself later for taking that extra step early on. It’s like investing in a good pair of earphones — your ears (and your data) will thank you!

Recce Sneak Peak

What if I told you there’s a fantastic tool that makes reviewing changes to dbt models 10 times easier? You can effortlessly see the DAG(Lineage) and query differences, giving you the confidence to make changes to your dbt models.

Fortunately, such a tool existsssssssssssssss ! ✌️

In part 3 I’ll also introduce Recce, a data validation toolkit for validating changes to your data pipeline. One of its great features is Lineage Diff, which displays the differences in your DAG before and after making changes. For example, here’s what the DAG Diff looks like after adding the Posty model (the new model is highlighted in green).

I’m looking forward to the next part! Be sure to follow us on LinkedIn and Twitter to stay updated on the following series and other dbt tips. Don’t miss out! 😎

--

--