The Prefect Blog
Published in

The Prefect Blog

Why Prefect

Orchestration is just one part of the dataflow equation

What is Prefect?

It’s a flexible framework to build, reliably execute and observe your dataflow while supporting a wide variety of execution and data access patterns.

Why should you care?

The Modern Data Stack encompasses a large array of highly specialized components. You can find tools for data ingestion, transformation, analysis, validation, cataloging — the list goes on. Domain-specific orchestrators focus on supporting:

  • one platform (Kubernetes, Databricks, Snowflake, AWS, GCP, Azure),
  • specific interface for workflow definition (drag-and-drop, low-code, no-code, YAML),
  • or a dedicated target audience (business users, data warehouse builders, microservice developers, analytics, or ML engineers).

However, custom business logic and more advanced applications are left behind. They don’t match the language of “modern” specialized tools and remain separated from the rest of the stack. As a result, you never get the full picture of what’s the status of the entire dataflow now.

Imagine a world where you can automate dataflow across any stack and observe its behavior within your data infrastructure. That’s why at Prefect, we’re building a global coordination plane.

Coordination plane

As someone who relies on robust dataflow, you may ask:

  • What’s the state of your dataflow right now?
  • Is the infrastructure healthy? If some component is down, was the platform team alerted?
  • Did your critical data warehouse refresh workflow succeed? How long did it take?
  • As part of a data governance strategy, you often need to guarantee that data consumers get informed about failed validation checks. Did all data quality checks pass? And if not, was the support ticket automatically created?
  • The Sales table should be updated daily by 9 AM — did we miss that SLA today? If so, was the team notified via an automatically created alert?

With Prefect, you can answer those questions with confidence, and you can automate how you want to react to those circumstances. The coordination plane gives you a framework to define how data should flow through your system. Prefect will then inform you when those expectations are violated.

At any time, you can observe the state of your data stack and be sure that all components are healthy and working well together. You can use the observed state to drive orchestration and trigger automated event-driven actions in real time.

Simplicity

We often hear the word simplicity when users describe Prefect. You install the pip package, add a single decorator to your function, and run your script. That single decorator adds superpowers to your Python function, providing observability from the UI, automatic retries, timeouts, versioning, parameter validation, result persistence, and more.

With Prefect, you never have to learn more about the tool than you need. For instance, you don’t have to break your flow into tasks. Once your use case grows and you need parallelization and concurrency, you can learn about tasks and task runners. The same applies to scheduling, asynchronous execution, and many more.

Scheduled AND event-driven workflows

A common challenge in building a data platform is doing it in an extensible and adaptable way. Tool and database migrations are an inevitable part of engineering work in the constantly evolving world of technology. Therefore, your data platform must be designed for change. Your dataflow might be a batch processing job today, but a near real-time streaming workflow tomorrow. Similarly, the same business logic, processing files from S3, may simultaneously run as a daily scheduled job and as an event-driven workflow, depending on the context and use case. You shouldn’t have to duplicate that logic and use separate execution platforms only to support both access patterns. Prefect supports both scheduled and event-driven workflows.

Embracing change highlights the importance of adaptable user-facing abstractions. Prefect provides those out of the box. The most notable ones are blocks, deployments, and subflows.

Modular components: Blocks

Blocks are connectors for code. They hold credentials and business logic to interact with external systems in your stack. You can choose which blocks to include based on your current toolset. And as your stack evolves, blocks can gain new capabilities to reflect new business logic. They make it easy to add configuration to new systems and replace existing blocks with new ones. You can finally see your stack and its capabilities at a glance:

Blocks encourage modularity, standardized configuration, and reusability of components. You can set up your credentials once, rotate them in one place when they expire (via a single API call), and use them in any other component that needs them.

Modular components: Deployments

Infrastructure-agnostic

Prefect isn’t tied to any specific cloud or execution environment. You could run the same flow simultaneously across all public clouds and on-prem. When you create a deployment, a server-side workflow definition turns your flow into an API-managed entity. This definition represents metadata, including the flow entrypoint, work queue name, storage, and infrastructure blocks.

API-first

Triggering a flow run from a deployment and running your flow on a cloud VM or a remote compute cluster is just an API call away.

Flexible storage

Your code can be baked into a Docker image along with your code dependencies, or it can live in S3, a monorepo, or a project-specific GitHub branch. You only need to provide metadata on your storage block to point Prefect to the location of your code. Prefect will dynamically pull that code at runtime. You can switch between various storage options without code modifications.

Infrastructure at any scale: from a small VM to serverless and a distributed Kubernetes cluster

Switching a deployment’s infrastructure between a local Process, KubernetesJob and a serverless container is as simple as pointing to a different infrastructure block. You can start with a single VM and a local Process block and, if required, later move to serverless containers in the Cloud or a KubernetesJob. You can change the infrastructure block via a CLI flag, e.g. change -ib process/prod to -ib kubernetes-job/prod.

Source: Create Robust Data Pipelines with Prefect, Docker, and GitHub

Switching between dev and prod is seamless

Moving from a local development to a production Kubernetes job is as simple as swapping the infrastructure block name on your deployment, e.g. change -ib process/dev to -ib kubernetes-job/prod. Additionally, Prefect profiles and workspaces simplify the environment separation between dev and prod. The experience of switching environments is (finally) painless.

Scheduling is optional, and it’s decoupled from execution

When you create a deployment, you can attach a schedule to it, but it’s optional. This separation of concerns allows you to have a deployment executed ad-hoc (e.g., event-driven or from the UI).

Modular components: Subflows

Prefect is the first orchestration solution that supports subflows. If you have a wide variety of workflows that depend on each other, incorporating them within a single monolithic workflow can be difficult to maintain, reuse, scale, and observe. Subflows help solve that challenge.

Source: Orchestrate Your Data Science Project with Prefect 2.0

Security

We take security and privacy seriously, from our Cloud infrastructure to the systems and practices. Prefect undergoes a SOC 2 Type II compliance audit annually. The hybrid execution model meets the strict standards of major financial and healthcare institutions and companies that work with regulated data.

Prefect Cloud never receives your code or data. Your flows run on your private infrastructure by exchanging state information and metadata. All storage systems are encrypted with industry best practice algorithms. Data is encrypted at all times in transit and at rest, with a minimum of TLS 1.2 enforced on all of our endpoints.

Dataflow coordination is driven purely by metadata transferred over an authenticated API. Prefect does not require ingress access to the customer environment as the connection is opened outbound via the Prefect agent.

The Hybrid Model — to learn more about our security best practices, get in touch with sales@prefect.io for our security white paper

Orchestration

Dataflow orchestration is an umbrella term for features allowing you to control how your flow should transition between various states. Prefect Orion API provides a rich array of orchestration states, ensuring the desired flow of execution.

Retries & alerts

If a task or flow run fails, the orchestrator can retry that run or send you a failure notification. You may choose to continue executing downstream steps even if this component fails.

Caching

If one part of your computation is expensive, Prefect can cache that run’s result and reuse it in later runs.

Data passing & dataflow

You can pass data in memory without having to manually persist the intermediary results.

Parallelism & concurrency

Scaling out your execution to distributed compute clusters is as simple as adding a Dask or Ray task runner to your @flow decorator.

Concurrency limits

If you want to, for instance, mitigate pressure on your database, concurrency limits can ensure that only a certain number of concurrent runs get executed — the other ones will be queued.

Robust failure handling

You can take automated action if a flow fails, crashes, or gets canceled.

Manual approval

If some element of your workflow needs to wait for manual approval, dedicated orchestration states allow you to pause execution and resume it later.

Code as workflows

You can think of an orchestrator as control flow rules added to your business logic. Defining that control flow in Prefect is as easy as .py. There’s no custom DSL for branching logic or handling exceptions. To define your control flow, you write native Python with function calls, if/else statements, or for/while loops.

Source: Sending Slack Notifications in Python with Prefect

The next level of orchestration

DAGs are optional

Most orchestrators expect you to define exactly how your workflow will behave in advance, either imperatively or declaratively. This isn’t always desirable or even possible. Imagine that you want to start a flow when a given file arrives in S3. If you know exactly when this will happen (e.g., every workday at 7 AM), this workflow is highly predictable and well suited to statically defined DAGs or CRON jobs. But suppose that one of the following conditions is true:

  • you don’t know when this file will arrive
  • you don’t know how many files will arrive
  • you don’t know whether this file will arrive at all (third-party tools, unreliable APIs, things out of your control)

Take automated action on a missed SLA

Given the above criteria, it’s helpful to define an SLA on that file-arrival event and trigger some action if that desired event doesn’t happen by 7 AM. If that SLA is missed, you can trigger another workflow that extracts the same data from an alternative source system or API.

You may want to “just” get an alert telling you which data didn’t arrive at the expected time and which downstream dataflow was affected by that missed SLA. As simple as this use case may sound, implementing this in practice, especially at scale, is far from trivial. You may end up building a second orchestrator only to incorporate this SLA-based pattern.

Prefect allows you to combine explicit orchestration rules defined in code with the implicit state of the world outside of Prefect orchestration.

Source: (Re)Introducing Prefect: The Global Coordination Plane

🛠️ Observability

Observe everything

At Prefect, we want to meet users where they are. Many of them have workflows built with a legacy orchestrator, and those workflows can’t be easily migrated. Similarly, a common point of friction in large data organizations is that they can’t agree on a single orchestrator. As Martin Kleppman put it in Designing Data-Intensive Applications:

“The difficulty of getting different organizations to agree on anything outweighs most other concerns.”

Suppose you use Airflow or an in-house orchestrator for existing predictable workflows. You can keep using it and observe its health with Prefect. For instance, you can emit an event when your Airflow DAG completes, which then triggers a Prefect flow run.

Orchestrate anything

With the coordination plane, you can observe the state of a workflow in any orchestrator or tool in your stack via emitted events, which can trigger Prefect flows (or other actions of your choice).

Those three highlighted elements are components of Automations:

  • Event informs Prefect about something that happened in your data infrastructure
  • This emitted event can (optionally) serve as a Trigger for an Action
  • Action can be virtually any API call — trigger a new flow run, pause a work queue, toggle the deployment schedule off, and much more.

The Automations feature is generally available in Prefect Cloud since 18th of January 2023. For more details about the scope of this GA-release, check this Discourse announcement.

Next steps

Here is how you can give Prefect a try. Create a script called myflow.py:

from prefect import flow

@flow(log_prints=True)
def hello():
print("You're the Prefectionist now!")

if __name__ == "__main__":
hello()

From your terminal, run pip install prefect & python myflow.py. Start the Prefect Orion API server using prefect orion start. That’s everything you need to build, run, schedule, orchestrate and observe your dataflow. For everything else, check our docs and sign up for Prefect Cloud.

If you have any questions, jump into our Discourse or Slack Community.

Happy Engineering!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store