How to Build a Modular Data Stack — Data Platform with Prefect, dbt and Snowflake
Build a data platform with Snowflake & dbt, and use Prefect to observe and coordinate your data stack
Data platform engineers are increasingly crossing the boundaries of teams, business domains, systems, repositories, and data pipelines. Running things on a schedule alone often doesn’t cut it anymore. Some dataflows are real-time, event-driven, or triggered ad-hoc. A reliable development lifecycle with CI/CD and flexible building blocks are now required to meet the demands of the rapidly changing world with a growing number of tools, data stores, and frequent migrations.
This series of articles covers possible ways of addressing some of these challenges. We discuss how you can build a scalable and painless data platform with Prefect, dbt, and Snowflake. It’s presented in a top-down approach, starting from a big-picture perspective and revealing increasingly more details. The code for the entire blog post series is included in the prefect-dataplatform repository.
If you found this repository helpful, feel free to give it a star! ⭐️
Part 1: defining the problem
The first post describes the problems that data platform engineers still struggle with despite the Modern Data Stack. It discusses the desired solution and possible ways to implement it using Prefect.
Part 2: the end outcome of this tutorial series
The second post presents the end outcome to show what you can expect as the final result of this tutorial series. It discusses how you can implement the Modular Data Stack using Prefect Blocks.
This post also shows how Prefect Cloud workspaces help you create development and production environments with all required dependencies to interact with dbt, Snowflake, and Prefect, and how to switch between environments using profiles.
Part 3: implementing the building blocks
Part 3 is hands-on and demonstrates:
- how to create blocks to interact with your stack in a secure and extensible way
- how to create deployments using various types of storage and infrastructure blocks
- how to leverage Prefect Cloud workspaces and Infrastructure as Code to create
dev
andprod
environments and establish environment parity between them so that building new flows and deploying them todev
andprod
is (finally) painless
- how to automate the creation of new deployments based on a simple config file for defining flow entrypoints
- …and more!
Part 4: scheduling, data ingestion, backfilling, and getting started with local dataflow development
This hands-on post demonstrates how to run your workflows ad-hoc and on a regular schedule and how to inspect the state of both parent and child flow runs from the UI. It shows various ways how you can attach a schedule to your deployment. It also highlights the benefits of blocks and the separation of scheduling from flow run execution.
The next part of this post walks you through how to get started with local development and then dives into data ingestion and backfills.
Finally, it explains how you can create a custom block (such as the SnowflakePandas
block from the demo) to build modular components that can be reused across various use cases and teams.
Part 5: failure handling with alerts & retries, dbt transformations & separate tasks per dbt node
This post covered automated and manual retries. It also covered two ways of orchestrating dbt from Prefect:
- Simple parametrized dbt CLI commands
- More complex but also more observable dbt CLI commands
Here are key concepts covered in part 5 with links to the relevant sections:
· Failure handling: alert notifications & retries
∘ Slack alerts on failure
∘ Automated retries
∘ Manual retries from the UI · Simple dbt transformation flow
∘ Parametrization
∘ Alerts on failure in dbt tests
· Parsing the dbt manifest
∘ Orchestrating dbt from manifest
∘ Slack alerts and troubleshooting of dbt runs
Link to the post:
Part 6: ML, analytics & BI reporting flows, and running dbt from a GitHub repository block
Here are key concepts covered in part 6 with links to the relevant sections:
Run code from other repositories — demo of a GitHub block
Triggering a dbt build using the GitHub block
Flows for analytics & reporting
Flows for ML and forecasting
Part 7: Parent flow tying all components together
In this final part, we investigated two ways of managing dependencies across multiple teams using a parent flow — subflows and deployments.
Orchestrating data platform with subflows
Orchestrating data platform with deployments
The key takeaway about both approaches
Thanks for reading, and happy engineering!