Modular Data Stack — Build a Data Platform with Prefect, dbt and Snowflake (Part 7)
Coordinate dataflow across multiple domains and teams, and orchestrate your data platform with a parent flow in Prefect
This is a continuation of a series of articles about building a data platform with Prefect, dbt, and Snowflake. If you’re new to this series, check out the summary linking to previous posts. This demo will discuss how you can manage dependencies across multiple teams using a parent flow. First, we’ll implement that pattern using multiple subflows. Then, we’ll inspect how you can coordinate runs triggered from deployments. Finally, we’ll discuss when you should use each option.
To make the demo easy to follow, you’ll see this 🤖 emoji highlighting sections that prompt you to run or do something (rather than only explaining something). The code for the entire tutorial series is available in the prefect-dataplatform GitHub repository.
Table of contents· Orchestrating data platform with subflows
∘ Pros of subflows for data platform orchestration
∘ Cons of subflows (and how to overcome those)
∘ TL;DR of subflows
· Orchestrating data platform with deployments
∘ How to run deployments
∘ 🤖 Parent flow running deployments
∘ Pros of running deployments
∘ Cons of running deployments
∘ TL;DR of running deployments
· The key takeaway about both approaches
· Fun fact
· Next steps
Orchestrating data platform with subflows
So far, this tutorial series has covered all the pieces that are relevant to run data ingestion, transformation, downstream analytics, and ML. This section will demonstrate how to put all the pieces together using Prefect. Let’s start with a workflow that, by now, should be familiar to you:
This is the same parent flow that we executed in Part 4:
This parent flow ensures that everything runs in the right order. It starts from ingestion. It then runs dbt transformations and updates all critical dashboards and KPIs. Finally, it generates ML-driven sales forecasts based on recent sales numbers.
🧠 LPT: what if you want to run
sales_forecast
even if thedashboards
subflow fails? You can add thereturn_state=True
flag to the subflow call:sales_forecast(return_state=True)
. This way, even if this child flow fails, the parent flow will continue executing downstream tasks and flows (e.g., to run some cleanup steps or important final processes).
Pros of subflows for data platform orchestration
This subflow pattern, demonstrated in the above flow, is:
- observable — you can immediately see which child flows have been executed, and from there, you can navigate to individual tasks
- painless to deploy — there is only one parent flow process that needs to be deployed and maintained (no moving parts that you would need to manage when orchestrating tens of different Kubernetes jobs for this process)
- simple to troubleshoot — there’s only this one parent flow that runs on schedule and orchestrates your data platform — all you need is a notification when this process fails; the Prefect UI will tell you everything else you need to know for troubleshooting (what failed, when, and why)
- effortless to orchestrate — subflows are blocking, which means that there is no additional orchestration (dependency setting, waiting, or polling logic) required to ensure that if the ingestion flow fails, the transformation flow shouldn’t start — this happens automatically when you leverage subflows
Cons of subflows (and how to overcome those)
Problem: Using subflows running in a single flow run process may not work well if some of your flows (e.g., those for ML or dbt transformations) require package versions that conflict with those installed in your parent flow’s infrastructure environment.
Solution: keep using the parent flow approach, but trigger workflows that require different dependencies using the run_deployment
pattern presented in the section below. This way, you can have the best of both worlds.
Another problem is that running subflows concurrently (while possible) is not straightforward. There is an open issue to add .submit()
for subflows here on GitHub to make that process easier.
TL;DR of subflows
This pattern can be best described with simplicity, modularity, and ease of use at the cost of orchestration and infrastructure configurability. It’s particularly useful for largely standardized and homogenous deployment patterns, often maintained in a monorepo.
Orchestrating data platform with deployments
The alternative to subflows is the run_deployment
utility, which involves triggering flow runs from deployments. This pattern is especially helpful if you want to run each flow within a separate container, Kubernetes pod, or other infrastructure. It also helps coordinate work maintained by multiple teams without stepping on each other’s toes.
How to run deployments
Here is a simple flow that demonstrates how you can apply that pattern to run the same flows using a different execution mechanism. Instead of running those directly in a parent flow run process, each subflow run is executed within its own deployment-specific infrastructure (local process, docker container, Kubernetes job, or a serverless container):
This flow calls the run_deployment
method for each respective flow’s deployment in the order you defined. As a reminder, all deployments have been created by the automated setup script in Part 3. All we have to do now is to call them in the right order in a parent flow.
By default, the run_deployment
method will:
- Create a (child) flow run for a given flow’s deployment
- Poll for its completion status, wait, and block further execution until this (child) flow run finishes
The parent flow orchestrates the next runs from deployments only if the previous one succeeded (in the same way as subflows do) unless you would disable polling (by adding timeout=0)
, which will result in fire-and-forget behavior. If any child flow-run (triggered from a deployment) fails, the parent flow run will also be marked as failed.
🤖 Parent flow running deployments
While we’ve already executed the parent flow orchestrating subflows, we haven’t run deployments from a parent flow.
Use the following command:
python flows/orchestration/run_deployments/parent.py
Or trigger a run from deployment from the UI:
Or from CLI:
prefect deployment run parent/local-process
Within the flow run page, you should see that each child flow run displays additional attributes, including the deployment name, deployment tags, and the work queue name. The image below compares the Subflow Runs tab for both subflows (on the left) and runs from deployments (on the right).
Now that you know how to implement and execute parent flows using the run_deployment
method, let’s look at the pros and cons of that approach.
Pros of running deployments
- Each child flow runs in its own infrastructure — this often makes it easier to manage execution environments and resources. You can use it to leverage a separate Docker image. It can also help allocate a GPU or a specific amount of memory to a given ML-specific Kubernetes job. This also allows you to orchestrate processes with complex, potentially non-Python, library dependencies.
- Given that each component of this parent flow is a deployment, it can be triggered either from that parent flow or independently; for instance, you can trigger both the parent flow or only a single individual flow from the UI and the underlying execution metadata for that deployment is governed in one place.
- The fire-and-forget method can be handy if you are interacting with some processes that don’t immediately affect your downstream work.
- It allows setting custom flow run names.
Cons of running deployments
Problem: Troubleshooting might be a little more challenging as you have more components to govern, and each child flow runs in its own process (more moving parts).
Solution: we are actively working on a feature called Automations that will allow you to observe the state of runs from any deployment and take an automated action on it (such as triggering alerts or other flow runs and more).
TL;DR of running deployments
This pattern of running deployments can be best described with both the benefits and drawbacks of a decoupled per-flow-run infrastructure. It is particularly useful for heterogeneous deployment patterns and coordinating work developed by decentralized independent teams (potentially in separate repositories) or if your individual workflow components need to run in dedicated infrastructure.
The key takeaway about both approaches
When using Prefect, you don’t have to choose — Prefect provides flexible components you can leverage to build something great. You don’t need to either use subflows or run flows from deployments. Instead, pick those that match your scenario and combine both approaches when it’s helpful.
Fun fact
We asked the Prefect community how they would name a flow orchestrating other flows. The moniker “parent flow” won the poll, but the “Lord of the flows” won our hearts 💙
Next steps
This post demonstrated how to orchestrate larger data platform workflows maintained by multiple teams. We investigated two patterns to tie all components together: subflows and the run_deployment
pattern. Both have pros and cons and can be used in tandem to satisfy various use cases.
If anything we’ve discussed in this post is unclear, feel free to tag me when asking a question in the Prefect Community Slack.
Thanks for reading, and happy engineering!