The Prefect Blog
Published in

The Prefect Blog

Modular Data Stack — Build a Data Platform with Prefect, dbt and Snowflake

This is a conceptual post about building a data platform with Prefect, dbt, and Snowflake — hands-on demos follow later

Scheduling is just one aspect of the dataflow coordination spectrum

DAGs are optional

Problems data practitioners still struggle with despite the Modern Data Stack

Too tight dependency on the scheduler

  • “I want to trigger (or schedule) a data processing flow from my custom application when needed via an API call regardless of whether this flow usually runs on schedule or not.”
  • “I’m scraping data from an API and loading it into a Snowflake table. Once I hit a certain request limit, I want to stop further processing and dynamically schedule the next parametrized run at a not yet determined time. I want to provide the last scraped timestamp as a parameter value to the next run.”

Static DAGs force boilerplate code and are hard to change

  • “I want to reuse and schedule the same workflow logic, but just with different parameter values and in a different environment.”
  • “I want to get notified when my dataflow fails without having to write redundant code checking for status and configuring alerts.”
  • “I want to start my parametrized dbt run independently from another flow orchestrating a full data platform refresh pipeline and without having to create a separate DAG for it.”

Managing environments and infrastructure is painful

  • “Switching between environments is too difficult. I want to run the same workflow in dev and prod without modifications to my code.”
  • “My transformation flow requires a Kubernetes job with GPU resources, while the ingestion process only requires a CPU.”
  • “One team is using a different programming language. I want to run their workflows within separate Docker containers, Kubernetes jobs, or ECS tasks to make code dependency management easier.”

Handoff between data teams leads to conflicts

  • “I’m responsible for the data platform, but I don’t write ML flows or dbt transformations myself. I want to orchestrate work from multiple teams without stepping on each other’s toes. I don’t want merge conflicts or communication overhead.”
  • “One step in my workflow depends on a process maintained by another team — I want to trigger a workflow from that team’s deployment and wait for its completion before starting my task or flow.”

Observability and failure handling require unhealthy tradeoffs

  • “I got a requirement from my client to run each dataflow in a separate container or pod, but I still want to leverage custom integrations and get visibility into what’s happening within that container.”
  • “Something broke, and I need to trigger a backfilling ingestion process to Snowflake or rerun my dbt transformations from a model that failed, but I don’t want to rewrite my code only to gain backfill and restart from failure features.”

The desired solution

The good news is that you can solve all of the above-mentioned data platform challenges using Prefect in combination with dbt and Snowflake.

The implementation

  1. Parent flows orchestrating subflows
  2. Parent flows orchestrating runs from deployments

Monorepo or separate repository for dbt and ML?

Link to the poll on Twitter
Link to the poll on LinkedIn

There is no single right or wrong approach — a good orchestration tool can support you in implementing both.

Next steps

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store