Why Prefect
Orchestration is just one part of the dataflow equation
--
⚠️ Note: I no longer work at Prefect. This post may be entirely outdated, and it will not be updated. Refer to Prefect docs and website for current information.
What is Prefect?
It’s a flexible framework to build, reliably execute and observe your dataflow while supporting a wide variety of execution and data access patterns.
Why should you care?
The Modern Data Stack encompasses a large array of highly specialized components. You can find tools for data ingestion, transformation, analysis, validation, cataloging — the list goes on. Domain-specific orchestrators focus on supporting:
- one platform (Kubernetes, Databricks, Snowflake, AWS, GCP, Azure),
- specific interface for workflow definition (drag-and-drop, low-code, no-code, YAML),
- or a dedicated target audience (business users, data warehouse builders, microservice developers, analytics, or ML engineers).
However, custom business logic and more advanced applications are left behind. They don’t match the language of specialized tools and remain separated from the rest of the stack. As a result, you never get the full picture of what’s the status of the entire dataflow now.
Imagine a world where you can automate dataflow across any stack and observe its behavior within your data infrastructure.
Coordination plane
As someone who relies on robust dataflow, you may ask:
- What’s the state of your dataflow right now?
- Is the infrastructure healthy? If some component is down, was the platform team alerted?
- Did your critical data warehouse refresh workflow succeed? How long did it take?
- As part of a data governance strategy, you often need to guarantee that data consumers get informed about failed validation checks. Did all data quality checks pass? And if not, was the support ticket automatically created?
- The Sales table should be updated daily by 9 AM — did we miss that SLA today? If so, was the team notified via an automatically created alert?
With Prefect, you can answer those questions and automate how you want to react to those circumstances. The coordination plane gives you a framework to define how data should flow through your system. Prefect will then inform you when those expectations are violated.
At any time, you can observe the state of your data stack. You can use the observed state to drive orchestration and trigger automated event-driven actions.
Scheduled AND event-driven workflows
A common challenge in building a data platform is doing it in an extensible and adaptable way. Tool and database migrations are an inevitable part of engineering work in the constantly evolving world of technology. Therefore, your data platform must be designed for change. Your dataflow might be a batch processing job today, but a near real-time streaming workflow tomorrow. Similarly, the same business logic, processing files from S3, may simultaneously run as a daily scheduled job and as an event-driven workflow, depending on the context and use case. You shouldn’t have to duplicate that logic and use separate execution platforms only to support both access patterns. Prefect supports both scheduled and event-driven workflows.
Embracing change highlights the importance of adaptable user-facing abstractions. Prefect provides those out of the box. The most notable ones are blocks, deployments, and subflows.
Modular components: Blocks
Blocks are connectors for code. They hold credentials and business logic to interact with external systems in your stack. You can choose which blocks to include based on your current toolset. And as your stack evolves, blocks can gain new capabilities to reflect new business logic. They make it easy to add configuration to new systems and replace existing blocks with new ones. You can see your stack and its capabilities at a glance:
Blocks encourage modularity, standardized configuration, and reusability of components. You can set up your credentials once, rotate them in one place when they expire (via a single API call), and use them in any other component that needs them.
Modular components: Deployments
Infrastructure-agnostic
Prefect isn’t tied to any specific cloud or execution environment. You could run the same flow simultaneously across all public clouds and on-prem. When you create a deployment, a server-side workflow definition turns your flow into an API-managed entity. This definition represents metadata, including the flow entrypoint, work queue name, storage, and infrastructure blocks.
API-first
Triggering a flow run from a deployment and running your flow on a cloud VM or a remote compute cluster is just an API call away.
Flexible storage
Your code can be baked into a Docker image along with your code dependencies, or it can live in S3, a monorepo, or a project-specific GitHub branch. You only need to provide metadata on your storage block to point Prefect to the location of your code. Prefect will dynamically pull that code at runtime. You can switch between various storage options without code modifications.
Infrastructure at any scale: from a small VM to serverless and a distributed Kubernetes cluster
Switching a deployment’s infrastructure between a local Process
, KubernetesJob
and a serverless container is as simple as pointing to a different infrastructure block. You can start with a single VM and a local Process
block and, if required, later move to serverless containers in the Cloud or a KubernetesJob
. You can change the infrastructure block via a CLI flag, e.g. change -ib process/prod
to -ib kubernetes-job/prod
.
Scheduling is optional, and it’s decoupled from execution
When you create a deployment, you can attach a schedule to it, but it’s optional. This separation of concerns allows you to have a deployment executed ad-hoc (e.g., event-driven or from the UI).
Security
Prefect takes security and privacy seriously, from our Cloud infrastructure to the systems and practices. Prefect undergoes a SOC 2 Type II compliance audit annually. The hybrid execution model meets the strict standards of major financial and healthcare institutions and companies that work with regulated data.
Prefect Cloud never receives your code or data. Your flows run on your private infrastructure by exchanging state information and metadata. All storage systems are encrypted with industry best practice algorithms. Data is encrypted at all times in transit and at rest, with a minimum of TLS 1.2 enforced on all of our endpoints.
Dataflow coordination is driven purely by metadata transferred over an authenticated API. Prefect does not require ingress access to the customer environment as the connection is opened outbound via the Prefect agent.
Orchestration
Dataflow orchestration is an umbrella term for features allowing you to control how your flow should transition between various states. Prefect Orion API provides a rich array of orchestration states, ensuring the desired flow of execution.
Retries & alerts
If a task or flow run fails, the orchestrator can retry that run or send you a failure notification. You may choose to continue executing downstream steps even if this component fails.
Caching
If one part of your computation is expensive, Prefect can cache that run’s result and reuse it in later runs.
Data passing & dataflow
You can pass data in memory without having to manually persist the intermediary results.
Parallelism & concurrency
Scaling out your execution to distributed compute clusters is as simple as adding a Dask or Ray task runner to your @flow
decorator.
Concurrency limits
If you want to, for instance, mitigate pressure on your database, concurrency limits can ensure that only a certain number of concurrent runs get executed — the other ones will be queued.
Robust failure handling
You can take automated action if a flow fails, crashes, or gets canceled.
Manual approval
If some element of your workflow needs to wait for manual approval, dedicated orchestration states allow you to pause execution and resume it later.
Code as workflows
You can think of an orchestrator as control flow rules added to your business logic. Defining that control flow in Prefect is as easy as .py
. There’s no custom DSL for branching logic or handling exceptions. To define your control flow, you write native Python with function calls, if/else
statements, or for/while
loops.
The next level of orchestration
DAGs are optional
Most orchestrators expect you to define exactly how your workflow will behave in advance, either imperatively or declaratively. This isn’t always desirable or even possible. Imagine that you want to start a flow when a given file arrives in S3. If you know exactly when this will happen (e.g., every workday at 7 AM), this workflow is highly predictable and well suited to statically defined DAGs or CRON jobs. But suppose that one of the following conditions is true:
- you don’t know when this file will arrive
- you don’t know how many files will arrive
- you don’t know whether this file will arrive at all (third-party tools, unreliable APIs, things out of your control)
Take automated action on a missed SLA
Given the above criteria, it’s helpful to define an SLA on that file-arrival event and trigger some action if that desired event doesn’t happen by 7 AM. If that SLA is missed, you can trigger another workflow that extracts the same data from an alternative source system or API.
You may want to “just” get an alert telling you which data didn’t arrive at the expected time and which downstream dataflow was affected by that missed SLA. As simple as this use case may sound, implementing this in practice, especially at scale, is far from trivial. You may end up building a second orchestrator only to incorporate this SLA-based pattern.
Prefect allows you to combine explicit orchestration rules defined in code with the implicit state of the world outside of Prefect orchestration.
🛠️ Observability
Observe everything
Many users have workflows built with a legacy orchestrator, and those workflows can’t be easily migrated. Similarly, a common point of friction in large data organizations is that they can’t agree on a single orchestrator. As Martin Kleppman put it in Designing Data-Intensive Applications:
“The difficulty of getting different organizations to agree on anything outweighs most other concerns.”
Suppose you use Airflow or an in-house orchestrator for existing predictable workflows. You can keep using it and observe its health with Prefect. For instance, you can emit an event when your Airflow DAG completes, which then triggers a Prefect flow run.
Orchestrate anything
With the coordination plane, you can observe the state of a workflow in any orchestrator or tool in your stack via emitted events, which can trigger Prefect flows (or other actions of your choice).
Those three highlighted elements are components of Automations:
- Event informs Prefect about something that happened in your data infrastructure
- This emitted event can (optionally) serve as a Trigger for an Action
- Action can be virtually any API call — trigger a new flow run, pause a work queue, toggle the deployment schedule off, and much more.
The Automations feature is generally available in Prefect Cloud since 18th of January 2023. For more details about the scope of this GA-release, check this Discourse announcement.
Next steps
Here is how you can give Prefect a try. Create a script called myflow.py
:
from prefect import flow
@flow(log_prints=True)
def hello():
print("You're the Prefectionist now!")
if __name__ == "__main__":
hello()
From your terminal, run pip install prefect
& python myflow.py
. Start the Prefect Orion API server using prefect orion start
. That’s everything you need to build, run, schedule, orchestrate and observe your dataflow. For everything else, check our docs and sign up for Prefect Cloud.
If you have any questions, jump into Discourse or Slack Community.
Happy Engineering!
📚 Read next
🔹 Declarative Dataflow Deployments with Prefect Make CI/CD a Breeze
🔹 How to Build a Modular Data Stack — Data Platform with Prefect, dbt and Snowflake
🔹 Prefect + AWS ECS Fargate + GitHub Actions Make Serverless Dataflows As Easy as .py
🔹 Serverless Real-Time Data Pipelines on AWS with Prefect, ECS and GitHub Actions
🔹 Scheduled vs. Event-driven Data Pipelines — Orchestrate Anything with Prefect
🔹 Event-driven Data Pipelines with AWS Lambda, Prefect and GitHub Actions