Modular Data Stack — Build a Data Platform with Prefect, dbt and Snowflake (Part 6)
Coordinate ML, analytics & BI reporting data platform workflows with Prefect, dbt, Snowflake, and modular building blocks
Welcome to another article about building a data platform with Prefect, dbt, and Snowflake. If you’re new to this series, check out the summary linking to previous posts. This post will cover processes after data ingestion and transformation, including ML, analytics, and reporting flows. We’ll also demonstrate how you can manage dependencies across multiple teams using the GitHub
block along with other modular building blocks available in Prefect.
To make the demo easy to follow, you’ll see this 🤖 emoji highlighting sections that prompt you to run or do something (rather than only explaining something). The code for the entire tutorial series is available in the prefect-dataplatform GitHub repository.
Table of contents· Run code from other repositories
∘ Motivation
∘ dbt repository for analytics engineers
∘ 🤖 Creating a private dbt repository
∘ 🤖 Creating a personal access token in GitHub
∘ 🤖 Creating a GitHub block
∘ Using the GitHub block with dbt
∘ 🤖 Triggering a dbt build using the GitHub block
∘ 📚 Summary: orchestrating code from multiple repositories
· Flows for analytics & reporting
∘ Orchestrate anything after dbt transformations
∘ Example reporting workflow
· Flows for ML and forecasting
∘ Example ML workflow
∘ Where ML flows fit within a data platform orchestration
· Next steps
Run code from other repositories
Motivation
In many organizations, dbt transformations and ML workflows are maintained by a different team than the orchestration code. Separating orchestration and ingestion flows from dbt transformations and ML into individual repositories can often help establish a more reliable handoff between independent teams without blocking each other.
Orchestrating work this way involves working with multiple Git repositories. This section will demonstrate how you can leverage Prefect's GitHub
file system block to coordinate dataflow across teams.
dbt repository for analytics engineers
The most prevalent example of this pattern is the orchestration of dbt transformations. Prefect users often don’t work with dbt themselves. Instead, a separate analytics engineering team maintains that process using their custom repository containing dbt transformations executed in a Snowflake data platform.
🤖 Creating a private dbt repository
If you don’t have a dbt repository yet, you can leverage this public repository template to create a private GitHub repo for your Snowflake data transformations.
If you create a new repository, make sure to make it private:
🤖 Creating a personal access token in GitHub
In order to use the GitHub
block with any private repository, Prefect needs to communicate with your GitHub account. To allow secure authentication, create a token with a repository scope. To do that, go to your GitHub account settings. From here, select developer settings and generate a new token:
Use the classic token and select the repository scope:
At the bottom of the page, click on Generate token:
Then, copy that token and add it to your .env
file or paste it within a form in your GitHub
block in the Prefect UI. The next section will demonstrate two ways of creating a GitHub
block.
🤖 Creating a GitHub block
You can either create the GitHub
block from the UI or code.
1) To do it directly from the UI, select the GitHub
block, then paste the previously generated token, as well as your repository URL and branch name:
🧠 LPT: configure the branch name corresponding to your environments. For instance, you may have a branch
dev
for your development andmain
for your production workspace. You can set the branch name on the block in a way corresponding to your Prefect Cloud workspace to easily distinguish between those two environments. If you create blocks from code, set the branch name as environment variable in your.env
file e.g.GITHUB_DBT_REPO_BRANCH=main
.
2) You can also create the GitHub block from Python code (especially useful if you have multiple GitHub repository blocks to set up):
Make sure to replace the placeholder repository (lines 8 and 15) with your repository name and a branch. This code also assumes that the root project directory contains an .env
file with a GITHUB_PERSONAL_ACCESS_TOKEN
environment variables with your access token.
💡 The access token is only required for private repositories.
Using the GitHub block with dbt
Here are three steps required to use the GitHub
block in a Prefect flow orchestrating dbt:
1) Load the block within a flow (line 21)
2) Call the get_directory()
block’s method to download the code (line 22)
3) Point the dbt function to its path for orchestration (lines 10 and 23).
Anytime you run this flow from deployment, it will be cloned within a flow run’s temporary directory so that you don’t need to worry about cleaning up folders or manually modifying the paths.
🤖 Triggering a dbt build using the GitHub block
Enough talk. Let’s trigger the above script from a terminal to validate that this flow works. Following the folder structure in the prefect-dataplatform repository, use the command:
python flows/transformation/jaffle_shop/dbt_repo.py
You should see the same flow run in your Prefect UI:
📚 Summary: orchestrating code from multiple repositories
This section started by explaining why using multiple repositories is often required to coordinate work maintained by separate teams. We covered how you can create a personal access token to clone a private repository at runtime. We then created a GitHub
Prefect block to securely store that information, including the branch corresponding to a given Prefect workspace. Finally, we used that block to orchestrate dbt transformations for your Snowflake data platform.
Flows for analytics & reporting
As a reminder, we started this tutorial series from a big-picture perspective (Part 1 discussed the problem, and Part 2 explained the desired solution).
Part 3 covered how to build your building blocks and deployments, and part 4 showed how to run the entire parent flow orchestrating all pieces, from the ingestion and backfilling, dbt transformations, to ML and BI reporting. Part 4 discussed the ingestion workflows in-depth, and part 5 demonstrated dbt transformations along with failure handling and alerting.
By now, all our data is fully processed and orchestrated, and we can talk more about coordinating analytics and ML processes that can only run after ingestion and transformation.
Orchestrate anything after dbt transformations
To demonstrate what you can do once your ingestion and transformation steps are finished, we’ll trigger downstream processes that will:
- Update reporting dashboards (refreshing extracts or clearing a cache) to ensure your dashboards operate on the most recent data
- Run an ML workflow generating a forecast based on recent sales and attribution numbers built with dbt transformations
- Send custom reports with current KPIs (the infamous report to your boss with updated KPIs — let’s assume everyone is still doing this)
- Feed the preprocessed data into online transaction processing applications, e.g., using a reverse ETL tool (such as Census or Hightouch).
Example reporting workflow
The following flow is a placeholder for anything you would like to do in your data platform after completing the essential processing (including data ingestion and transformation). This specific example performs several tasks concurrently, including updating dashboards, extracting current KPI values using the SnowflakePandas
block, sending those as custom reports, and feeding those KPIs back to operational systems using reverse ETL.
You could split those tasks into individual flows as you see fit based on your organization's needs.
Flows for ML and forecasting
Often, ML workflows can only start running once some tables in your Snowflake data warehouse are updated. For instance, the sales forecast flow needs to wait for both the jaffle shop and attribution dbt models to be refreshed in order to build the sales forecast on the most recent sales and attribution numbers. Then, this ML training job can read those tables and apply some ML training algorithm magic ✨ to predict anticipated sales numbers for the upcoming days and weeks and load those forecasts again to Snowflake.
Example ML workflow
Here is an ML workflow example for generating a sales forecast:
Where ML flows fit within a data platform orchestration
Depending on your needs, you can either trigger your ML workflows as subflows being packaged and imported within your parent flow or run them from deployments from your main data platform refresh workflow. The next part of the series will explore both options in more detail.
Next steps
This post demonstrated how to coordinate work maintained by multiple teams. It started by demonstrating the creation of a GitHub
block with a private access token to orchestrate dbt transformations developed using a separate Git repository. Then, the demo covered workflows that occur once dbt transformations are completed, including various ML and BI reporting processes.
If anything we’ve discussed in this post is unclear, feel free to tag me when asking a question in the Prefect Community Slack.
Thanks for reading, and happy engineering!