Modular Data Stack — Build a Data Platform with Prefect, dbt and Snowflake (Part 5)
Failure handling with alert notifications and retries for data ingestion and dbt transformations in your Modern Data Stack
This is a continuation of a series of articles about building a data platform with Prefect, dbt, and Snowflake. If you’re new to this series, have a look at the summary linking to previous posts. This demo will dive into handling failure, including alert notifications and retries, and orchestrating Snowflake data transformations implemented with dbt.
To make the demo easy to follow, you’ll see this 🤖 emoji highlighting sections that prompt you to run or do something (rather than only explaining something). The code for the entire tutorial series is available in the prefect-dataplatform GitHub repository.
Table of contents· Failure handling: alert notifications & retries
∘ Alerting
∘ 🤖 Configure Slack alerts on failure
∘ 🤖 Failing successfully with automated retries
∘ 🤖 Trigger a flow run with a task that will fail and retry
∘ 📚 Summary of alerts & automated retries
· 🤖 Manual retries from the UI
∘ 🤖 Important notes ❗️
∘ 🤖 Trigger a run from deployment to test alerts
∘ 🤖 Retry the flow run manually
∘ 🤖 Inspect the logs
∘ 📚 Summary of data ingestion workflows
· Simple dbt transformation flow
∘ Parametrization
∘ 🤖 Jaffle shop: dbt build from the Prefect UI
∘ 🤖 Inspecting the dbt run from the UI
∘ 🤖 Alerts on failure in dbt tests
∘ 📚 Summary of simple dbt transformation flows
· Parsing the dbt manifest
∘ Pros and cons of parsing the dbt manifest
∘ Using the Dbt Prefect block
∘ 🤖 Orchestrating dbt from manifest
∘ 🤖 View Slack alert and troubleshoot with the Prefect UI
∘ 🤖 Fixing that issue with manual retries
∘ 📚 Summary of flows parsing the dbt manifest
· Next steps
Failure handling: alert notifications & retries
The last post demonstrated how to ingest data into a Snowflake data platform. However, data ingestion represents the most error-prone category of workflows. This section will discuss how to prepare for failure scenarios by leveraging alert notifications and retries.
Alerting
You can configure flow run notifications directly from the Prefect UI. From the Notifications page, click the plus button to create a new notification.
From here, you can select:
- the type of alert, including Slack, Teams, or Email
- the states you want to get alerted about, including Completed, Failed, Late, Scheduled, Pending, Running, Cancelled, or Crashed
🤖 Configure Slack alerts on failure
Suppose we are interested in failure notifications via Slack. To configure one, select the Slack Webhook
and add your Webhook URL. Then, select the Failed
state, and click on Create.
Optionally, you can select tags (for instance, a tag related to dataplatform
workflows) to apply that alert to specific deployments. This is useful if you plan to send alerts for Analytics
to a different Slack channel than notifications related to, for example, ML
workflows. This can help make alerts actionable by teams responsible for those deployment tags.
🤖 Failing successfully with automated retries
Let’s make our flow occasionally fail by randomly raising a ValueError
:
With three retries, we can expect one of the automated retries to fix the randomly occurring issue (unless we are extremely unlucky and the randomness gods conspire against us).
🤖 Trigger a flow run with a task that will fail and retry
Automated retries work for both the ad-hoc flow runs triggered as Python functions from your code IDE, and flow runs triggered from a deployment (e.g., from the UI or the prefect deployment run
CLI).
To test automated retries, you can trigger a run either from the UI, CLI, or by running the Python script:
python flows/ingestion/ingest_jaffle_shop_randomly_failing.py
When this task run fails, it will be retried. The error will be recorded in the backend, but you won’t get notified unless this task causes the flow run to fail (remember, notifications are on a flow run level, not on a task run level):
Despite the failure, the automated retry fixed the problem, and the flow run eventually ended in a Completed
state (the task failed successfully 🎉). But note that the flow run count is equal to 1 because the flow run was executed only once — only the task run was retried.
When you visit the task run page, you’ll be able to see that the task run count is equal to 2, which shows that the second run (after the retry) has been completed without failure:
📚 Summary of alerts & automated retries
So far, we’ve covered how to configure alerts and automated retries. The next section will deliberately force a failure to demonstrate what the flow run alerts on failure look like and how you can leverage manual retries from the UI to help fix failed runs when automated retries aren’t sufficient.
🤖 Manual retries from the UI
Let’s comment out the automated retries from the task decorator to experiment with two additional features: alert notifications that we configured earlier and manual retries from the UI.
Make sure to also comment out retries set on a flow decorator:
By disabling automated retries on both a task and flow level, the flow run will potentially end in a Failed
state (remember, the error is raised at random), and when it does, you will get a failure notification.
🤖 Important notes ❗️
To leverage the manual retries from the UI, it’s best if you toggle result persistence to ensure that any dataflow that passes data between tasks and subflows can store and later retrieve the results:
prefect config set PREFECT_RESULTS_PERSIST_BY_DEFAULT=true
To read more about results, check the documentation.
⚠️ Note that, in contrast to automated retries, manual retries from the UI work only for flow runs triggered from deployments.
🤖 Trigger a run from deployment to test alerts
To start a flow run from a deployment, you can use the command:
prefect deployment run raw-data-jaffle-shop/local-process
If this flow run fails, you should get a similar Slack notification:
🤖 Retry the flow run manually
Following this URL, you should see a flow run page that has a friendly Retry button allowing you to manually retry that failed run:
It will ask you to confirm (to be sure that you didn’t click the retry button by accident):
That’s everything you need to do to manually retry a flow run from the UI. You may end up having to repeat that a couple of times, but eventually, it should succeed (unless the randomness gods really conspire against you):
This simple use case demonstrates how valuable the Retry from failure feature can be, especially when dealing with flaky APIs.
🤖 Inspect the logs
Another way you can troubleshoot failures is by inspecting the logs. You can filter for a specific log level to find out what happened:
📚 Summary of data ingestion workflows
So far, we’ve covered how to run and troubleshoot data ingestion workflows to load data into Snowflake and handle failures. In the next section, we’ll start looking at dbt transformations in more detail.
Simple dbt transformation flow
Often, when using dbt, all that you want to orchestrate is a single dbt build command. The demo repository has a flow that you can use to perform that operation in a single function call:
Here is a full dbt transformation flow:
💡 Note that:
- the
dbt()
function invokes the tasktrigger_dbt_cli_command
from the prefect-dbt collection - it passes the dbt
command
as the name of that task — by default, it will usedbt build
but you can adjust that at any time - to avoid hardcoding the dbt project directory, this
dbt
function will infer the path from your flow file - it will load the profile configuration (with credentials to your Snowflake data warehouse) from the securely stored
DbtCliProfile
block to avoid hardcoding credentials — this block has already been configured in Part 3 of this tutorial series
Parametrization
Since we included the dbt build command as the argument to the flow function, the flow can now leverage the Prefect parametrization feature. If we instead hardcoded this command within the flow, it would make it difficult to adjust that command when needed, e.g., to rerun only failed models using the dbt command:
dbt build --select result:error+ --defer --state ./target
The flow parametrization allows you to trigger any custom dbt command directly from the Prefect UI, CLI, or API. Here is how you can pass this custom dbt command to the Prefect UI custom run page:
Alternatively, if you know which model failed and you want to directly trigger a run from that failed model and all downstream dependencies, you can use the select
flag:
dbt run --select failed_model_name+
🤖 Jaffle shop: dbt build from the Prefect UI
Let’s trigger the dbt build step for the jaffle shop example from the Prefect UI:
🤖 Inspecting the dbt run from the UI
When we inspect the logs, we can see which dbt models and dbt tests are built during that run:
🤖 Alerts on failure in dbt tests
Let’s simulate a failure in one of the dbt tests to validate that alert notifications are working properly. To do that, let’s assume that the jaffle shop now no longer accepts returns:
This run should fail as expected because our data contains returned orders:
By following the URL, we can dive into the logs that inform us which tests didn’t pass:
📚 Summary of simple dbt transformation flows
This section covered a simple way to orchestrate dbt with parametrized flows. This process is performant and proven to work well among many dbt users orchestrating their Snowflake data transformations with Prefect. But some users prefer to get a separate alert for each dbt model that failed, and they want to see each node of the dbt DAG as separate Prefect tasks. If you’re one of those users, you can use the approach from the next section — otherwise, feel free to skip it.
Parsing the dbt manifest
If you want to turn every node from the dbt DAG into a separate Prefect task, you need to parse the dbt manifest file. Luckily, we’ve done that for you. The prefect-dataplatform GitHub repository contains a custom Dbt
block with the capability to perform that operation on the dbt_run_from_manifest
block’s method. Prefect blocks allow you to implement that logic once and then reuse and extend it as you see fit.
Pros and cons of parsing the dbt manifest
Parsing the dbt manifest can be a bit tedious to implement and is slightly less performant than the method presented in the previous section, but it offers some benefits:
a) it helps to distinguish between dbt models and tests from other dbt commands such as dbt compile — have a look at the pretty emojis in the task run page below:
b) more detailed alerts telling you which run or test failed, allowing you to quickly fix the issue and rerun from that failed node.
Using the Dbt
Prefect block
Here is how you can use the Dbt
block to orchestrate dbt transformation flows in just three lines of code:
- Load the
Dbt
block - Compile the dbt manifest — technically, this step is optional, but it’s best to first compile the manifest to be sure that the parsed representation matches the latest state of your dbt DAG
- Invoke the
dbt_run_from_manifest
block’s capability to run your dbt models and tests from the compiled manifest
🤖 Orchestrating dbt from manifest
Let’s visit the flow page in the UI and select the dbt-jaffle-shop flow. You should see that this flow has two deployments — the first one named simple-local-process with the parametrized logic and the second, more complicated one named local-process, which parses the dbt manifest. Select the second one:
Then, trigger a run from that local-process deployment:
We keep the incomplete list of accepted order status values to make the run fail in order to test the alert functionality:
🤖 From Slack alert to the logs in the Prefect UI
The flow run should generate a similar Slack notification:
When following the URL, you’ll be able to inspect the logs for exactly that dbt test (rather than having to browse through all dbt logs):
If you click on the phenomenal-lemming (i.e., the flow run name), and from there to the task run page, you’ll get the overview of dbt nodes that were executed, and you’ll be able to see at a glance which step failed (dbt test for the table stg_orders
), and which models and tests have not been executed due to that upstream failure (all nodes with the NotReady
state marked grey).
🤖 Fixing that issue with manual retries
Let’s fix that issue in the dbt test by bringing back returns as accepted order status values:
Once the dbt test is fixed in the dbt project code (or more realistically, once it got addressed in the data), go back to the UI and click on the “Retry” button:
The previously failed run should turn blue 🔵 and should move into a Retrying state:
You should notice that the retries of the phenomenal lemming phenomenally fixed the issue:
The run count of 2 confirms that the flow-level retry was triggered properly.
📚 Summary of flows parsing the dbt manifest
This last section discussed why parsing the dbt manifest can often be useful for troubleshooting. It gives you informative error notifications and allows you to easily retry your dbt transformations from the model that failed. If you’re interested in how exactly the Dbt
block is implemented under the hood, check out this file in the prefect-dataplatform repository.
Next steps
This demo covered how to run data ingestion and transformation flows and how to handle failure with alert notifications and retries. It covered the differences between automated and manual retries and demonstrated how they could help with troubleshooting dataflow failures. This post also presented two ways of orchestrating dbt transformations and discussed tradeoffs between them.
If anything about what we’ve discussed in this post is unclear, feel free to tag me when asking a question in the Prefect Community Slack.
Thanks for reading, and happy engineering!