Declarative Dataflow Deployments with Prefect Make CI/CD a Breeze
Command Line Interfaces and YAML are loved by DevOps engineers — we use the same for DataflowOps
In Prefect, Python is the API — we believe that your code is the best representation of your workflow. But when it comes to deploying your flows, an imperative Python API can become cumbersome. This post will discuss how Prefect CLI and Blocks allow you to follow engineering best practices for reliable dataflow operations.
What is Prefect
Prefect is more than an orchestrator — it allows you to design, reliably execute and observe all dataflows regardless of the underlying execution environment. As long as your main workflow function is decorated with
@flow, any run of such flow becomes observable from the Prefect UI.
Often you don’t want to be responsible for triggering those runs yourself. Instead, you may want to schedule your workflows, start them from an API call, or run those on remote compute clusters or cloud VMs by leveraging agents and work queues. In those scenarios, you need deployments.
What is Prefect Deployment
At its core, a flow deployment is a server-side workflow definition, allowing you to turn any flow into an API-managed entity. This definition represents metadata, including:
- Entrypoint — which flow you want to deploy and where it is located in your project’s directory, e.g. for a flow function
hellolocated in a script
- Name — how do you want to name that deployment e.g.
-n dev; the name is a required argument to distinguish between (potentially) multiple deployments for the same flow; this might be useful, e.g., to run the same flow with different parameters on different schedules or to distinguish between development and production deployments,
- Remote storage block — where do this flow and related module dependencies are located, and whether you want Prefect to automatically upload the code to that storage location e.g.
- Infrastructure block — where and how to deploy your flow; you can use a preconfigured shared block or let Prefect create a default block for you. The infrastructure block type might be a Docker container, Kubernetes job,
a remote subprocess, and more, e.g.
- Overrides flags — one or more flags for infrastructure overrides; you can pass those to a deployment build command to override multiple arguments, e.g.
- Work queue name — allows you to define which agent work queue should pick up flow runs from that deployment; this will enable you to, e.g., point any given deployment to your development (e.g.
-q dev) and production (e.g.
-q prod) agents
- Tags — using tags, you can add extra information to organize your flows based on projects and the needs of your organization (e.g.
-t datateam -t mlproject). Tags are passed not only to deployments but also to flow runs generated from them. This way, you can easily filter your dashboard for runs and deployments related to your project or team.
- Version — what is the version of that build artifact e.g.
-v GITHUB_SHA; this argument makes it easy to provide your Git commit hash to deployment and troubleshoot, as well as revert if something goes wrong.
The above are the main arguments you should be aware of. But there’s more! To learn about more advanced deployment patterns, including passing the schedule information directly from the CLI, use the command:
prefect deployment build --help
Simple deployment example
Let’s assume that you have a flow
hello defined in a script
To turn this flow into a deployment, execute build and apply commands:
prefect deployment build demo.py:hello -n dev -q dev -o hello.yaml
prefect deployment apply hello.yaml
💡 With the deployment
buildcommand, you can specify the location for the declarative YAML manifest that gets generated as a build artifact. This facilitates reproducible deployments. Here, we store it as a file
-o hello.yaml. In the dataflow-ops repository listed below, the CI/CD workflow will generate a single GitHub Actions artifact package with all deployment manifests.
The build command shown above will, by default, assume that you execute everything locally, i.e., local storage with files living on your computer and running within a local process.
To start a run from a deployment, execute:
prefect deployment run hello/dev
prefect agent start -q dev
The last command starts an agent that will pick up and execute the flow run because the queue name provided on a deployment matches the queue name assigned to the agent (
Custom deployments as repository templates
We have only scratched the surface. There are:
- remote storage blocks for remote file systems such as S3, GCS, or Azure Blob Storage
- infrastructure blocks allowing you to fully customize your Kubernetes and Docker configuration and reuse generic configuration across multiple deployments
.prefectignorefile to configure which files get uploaded in your CI/CD,
- and more!
We’ll cover those via GitHub repository templates so that you can use those directly. The link below is one example repository, including CI/CD workflow for Amazon S3 and ECS and one-click serverless agent deployment built with Configuration as Code:
What are the benefits of infrastructure and storage blocks?
Here are the benefits of storage and infrastructure blocks that serve as building blocks of your dataflow deployments:
- Server-side packaging & no boilerplate: you don’t need to develop and maintain custom Python modules or configuration files to securely and reliably package the infrastructure and storage definition for your flows and integrate those with your CI/CD. The building blocks for repeatable, standardized deployments already exist and can be defined via code, CLI, API, or even from the UI.
- Deployment build process that “just works”: the deployment build process is able to manage full directories and thus preserves relative imports without having to worry about manually adding your custom modules to the
PYTHONPATH. Finally, you can stop worrying about the infamous
ModuleNotFoundErrorand why your code worked locally but not on your production agent — Prefect ensures that it just works.
- Modularity & composability: with blocks, you can have a parent infrastructure block defining a generic structure you want to standardize on for most deployments, e.g., a generic Kubernetes job definition approved by your DevOps experts. But some of your flow deployments will most certainly deviate from that standard pattern (e.g., changing the default image, a Kubernetes namespace, or modifying a value of an environment variable). You can flexibly override those with no code duplication. The child block inherits a generic configuration from a parent block. At the same time, this child (anonymous) block created for any given deployment is still fully independent and can be modified even from the UI.
- Adaptability: you can define your generic configuration (e.g., for your
S3storage block or infrastructure
KubernetesJobblock) in one place, making your DevOps process adaptable to change
- Secure configuration & simplified security processes: if you need to rotate your AWS access keys, you can do that without having to redeploy your flows or underlying infrastructure
- Changing the infrastructure or storage configuration is easy: if you want to change the logging level for any given deployment or add a new environment variable, you don’t have to redeploy your flow — go to the UI and adjust the environment variable when and if needed.
If you want to talk about your dataflow operations or ask about anything, you can reach us via our Community Slack.