Modular Data Stack — Build a Data Platform with Prefect, dbt and Snowflake
This is a conceptual post about building a data platform with Prefect, dbt, and Snowflake — hands-on demos follow later
Orchestration platforms historically only allowed managing dependencies within individual data pipelines. The typical result was a series of DAGs and brittle engineering processes. Those massive DAGs were difficult to change. To avoid breaking anything, you would think twice before touching the underlying logic.
Today, data practitioners, especially data platform engineers, are crossing the boundaries of teams, repositories, and data pipelines. Running things on a regular schedule alone doesn’t cut it anymore. Some dataflows are event-driven or triggered via ad-hoc API calls. To meet the demands of the rapidly changing world, data practitioners need to react quickly, deploy frequently, and have an automated development lifecycle with CI/CD.
DAGs are optional
DAGs can no longer keep up with the dynamic world of data and modern APIs. Moving from static DAGs (including DAGs of datasets defined in static declarative code) to API-first building blocks makes your dataflows more adaptable to change and easier to deploy and manage. In Prefect, you can still build DAGs, but they are optional — you can use a DAG structure only when you need it, and it can be dynamically constructed from runtime-discoverable business logic.
In this series of posts, we’ll build a data platform Proof of Concept using Prefect, dbt, and Snowflake. The “M” in the MDS we’re talking about here is focused on modularity as well as proven engineering concepts rather than modern hype. We’ll demonstrate how you can start from a simple, local, parametrized data pipeline running dbt CLI commands to multiple (but still simple) observable ingestion and transformation flows deployed with Docker.
Once we’re done with the development stage demo, switching between development and production environments will be as simple as pointing to a different Prefect Cloud workspace from your terminal.
Problems data practitioners still struggle with despite the Modern Data Stack
Imagine you want to run data transformation jobs with dbt. Before you can do that, you need to ensure that all necessary ingestion pipelines have finished. Otherwise, your dbt models might be built upon stale data.
This scenario is just one of many problems that data practitioners struggle with and don’t get enough help from the existing Modern Data Stack tools. Below are common user stories, categorized into buckets of pain.
Too tight dependency on the scheduler
- “I want to trigger (or schedule) a data processing flow from my custom application when needed via an API call regardless of whether this flow usually runs on schedule or not.”
- “I’m scraping data from an API and loading it into a Snowflake table. Once I hit a certain request limit, I want to stop further processing and dynamically schedule the next parametrized run at a not yet determined time. I want to provide the last scraped timestamp as a parameter value to the next run.”
Static DAGs force boilerplate code and are hard to change
- “I want to reuse and schedule the same workflow logic, but just with different parameter values and in a different environment.”
- “I want to get notified when my dataflow fails without having to write redundant code checking for status and configuring alerts.”
- “I want to start my parametrized dbt run independently from another flow orchestrating a full data platform refresh pipeline and without having to create a separate DAG for it.”
Managing environments and infrastructure is painful
- “Switching between environments is too difficult. I want to run the same workflow in dev and prod without modifications to my code.”
- “My transformation flow requires a Kubernetes job with GPU resources, while the ingestion process only requires a CPU.”
- “One team is using a different programming language. I want to run their workflows within separate Docker containers, Kubernetes jobs, or ECS tasks to make code dependency management easier.”
Handoff between data teams leads to conflicts
- “I’m responsible for the data platform, but I don’t write ML flows or dbt transformations myself. I want to orchestrate work from multiple teams without stepping on each other’s toes. I don’t want merge conflicts or communication overhead.”
- “One step in my workflow depends on a process maintained by another team — I want to trigger a workflow from that team’s deployment and wait for its completion before starting my task or flow.”
Observability and failure handling require unhealthy tradeoffs
- “I got a requirement from my client to run each dataflow in a separate container or pod, but I still want to leverage custom integrations and get visibility into what’s happening within that container.”
- “Something broke, and I need to trigger a backfilling ingestion process to Snowflake or rerun my dbt transformations from a model that failed, but I don’t want to rewrite my code only to gain backfill and restart from failure features.”
The desired solution
The target outcome is to build observable and composable workflows that can be assembled together from lightweight modular components, including tasks, flows, subflows, deployments, and building blocks with secure configuration and capabilities to interact with external systems.
Each data pipeline is independent but can be called from another flow, acting as a parent flow or from a deployment run as needed. Additionally, the desired implementation must be able to detect when child flows successfully finish their execution to ensure that we don’t start any processing until it’s safe to do so.
Parametrization is key to solving many of these problems. Parametrized workflows allow you to write your code once but create multiple scheduled deployments or ad-hoc runs with modified input parameters when needed.
The good news is that you can solve all of the above-mentioned data platform challenges using Prefect in combination with dbt and Snowflake.
Let’s look at how this parameterized flow-of-flows pattern can be designed using the core Prefect concepts.
There are two design patterns you may consider to implement the desired solution:
- Parent flows orchestrating subflows
- Parent flows orchestrating runs from deployments
One of Prefect’s unique advantages over other orchestration tools is the first-class concept of subflows. Instead of creating implicit dependencies between DAGs of tasks, subflows allow you to build explicit dependencies between the steps of your workflows via simple decorated functions. This works the same way regardless of whether your workflow steps touch data or not.
If you prefer that each of your dataflow components runs in its own infrastructure, e.g., a separate process, (serverless) docker container, or a Kubernetes job, you can leverage the
run_deployment utility, as discussed in the release blog post. Using this pattern, you would create a deployment for each subflow (likely from CI/CD) and reference that deployment name in your parent flow, orchestrating the full refresh pipeline for your Snowflake data platform in a modular and extensible way.
We’ll look at how to implement both design patterns in the upcoming tutorials. Note that these approaches are not mutually exclusive. You can use both in tandem, depending on your needs.
Monorepo or separate repository for dbt and ML?
In many organizations, dbt transformations or ML workflows are written by a different team than the orchestration code. Separating orchestration and ingestion flows from dbt transformations into individual repositories allows for a reliable handoff between teams. This way, the development of orchestration pipelines by data engineers doesn’t block analytics engineers and vice versa.
We asked the data community on Twitter and LinkedIn whether they prefer to use a monorepo or separate repository for transformation, ingestion, and ML. It turned out the responses were as disparate as the data teams themselves:
These results demonstrate that rigid orchestration tools may cause problems down the road. Don’t plan for a static world and a single opinionated approach. Instead, design your architecture for change because change is the only constant in engineering (and life in general).
While today, monorepo might be your preferred way, you may end up with conflicts if multiple teams need to agree on some structure. Or it might be the other way around: you may be using one repository per project, and you switch to a monorepo because your team lost track of dependencies across projects.
There is no single right or wrong approach — a good orchestration tool can support you in implementing both.
In Prefect, running subflows is best for modularity, especially if you maintain your data platform code in a monorepo. However, you can still clone custom repositories from any Prefect flow using, e.g., the GitHub block.
Running deployments is best suited for orchestrating work that may be developed in separate repositories by decentralized independent teams or if your workflow components need infrastructure isolation.
This post covered the problems that data practitioners struggle with when building a data platform and discussed, on a conceptual level, how those problems can be approached using Prefect. In upcoming posts, we’ll dive into the implementation using Prefect, dbt, and Snowflake.
Modular Data Stack — Build a Data Platform with Prefect, dbt and Snowflake (Part 2)
DAGs don’t describe what you do. Your teams, projects & systems do. Use blocks to draw a map of your stack and…
Thanks for reading, and happy engineering!