The Prefect Blog
Published in

The Prefect Blog

Scheduled vs. Event-driven Data Pipelines — Orchestrate Anything with Prefect

When and when not to schedule your workflows and why sensors and daemon processes are a waste of resources

Marvin observes what’s currently happening in the galaxy

One of the most common data engineering challenges is triggering workflows in response to events such as when a new file arrives in a certain directory. The approach taken by legacy orchestrators is to deploy continuously running background processes, such as sensors or daemons that poll for status. In this post, we’ll discuss why this approach is a waste of resources and demonstrate an alternative — event-driven workflows observable with Prefect.

Use cases for scheduled dataflows

One prevalent use case for scheduled workflows is batch data ingestion and transformation. This process typically involves extracting data (possibly a delta compared to the last run) from source systems, ingesting it into your data warehouse, and transforming it when needed. Once this is finished, the same workflow may run data quality tests, refresh data extracts for reporting, or trigger downstream automated actions such as alerting about wrong KPIs and detecting anomalies in data.

Apart from these common data engineering tasks, many data science use cases constitute a popular candidate for scheduled workflows, including the re-materialization of feature tables, retraining ML models, and generating batch predictions on a regular cadence.

In short, scheduling is typically needed for static and predictable batch processing workflows that usually don’t require frequent adjustments (until you get into trouble with DST, just kidding).

Use cases for event-driven dataflow

Triggers

Scheduling workflows is wasteful if you need to run your workflow only when something happens, which you can think of as a trigger:

  • When a new object arrives in your S3 bucket,
  • When a new change data capture (CDC) record gets ingested,
  • When a new DynamoDB, Aurora, Kafka, or Kinesis stream arrives,
  • When a new event is received from some external system via an API call (MongoDB, Datadog, Zendesk, Salesforce, …),
  • When a new message is received in some publish-subscribe message queue,
  • …and many more.

Actions

There are many dataflow scenarios that fit into this pattern:

  • As soon as your data store gets updated in some custom application (a new record or object), run a flow to take action on that update, e.g., submit a trade, place an order, send a message to someone, or start some process,
  • When something happens (a new event or message arrives), process it immediately, e.g., insert that data into a table,
  • When some external job (e.g., a Databricks Spark job running data transformations) completes, run another flow to retrain an ML model, start some post-processing, or simply send a notification about its completion.

Event-driven workflows allow you to take action immediately when something happens. In the next section, we’ll look at possible solutions to implement the event-driven workflow pattern.

Possible solutions

There are several ways that you could implement the above-mentioned scenarios within your data pipelines. We’ll take the arrival of an S3 object as an example illustrating how each of those patterns might be used for the same use case.

1. Manual polling 👎

The most obvious solution to the problem is manually polling for a status. This may involve a while-loop continuously polling for status (e.g., via an API call to some external service) and breaking the loop once the condition is satisfied. Here is how you could approach such a long-running polling job for the S3 use case:

Drawbacks of this approach:

  • It requires a daemon process, i.e., a long-running job, which can be hard to maintain,
  • It’s fragile — if something goes wrong (say, one API call polling for status fails), your entire process fails as well, unless you write even more code handling exceptions, restarting the process and possibly alerting you that something went wrong,
  • It’s expensive — you need to run this process 24/7 until the event you care about occurs.

2. Sensors & daemon processes from legacy orchestrators 👎

The approach here is essentially the same as above, with the main difference being that this polling logic is hidden behind an abstraction implemented by that legacy orchestrator.

3. Event-driven workflows with Serverless (e.g., AWS Lambda) 👍

The event-driven architectures matured to the point that the industry started adopting standards such as CNF CloudEvents. Similarly, all major cloud providers offer event-driven services, including:

Using any of the above tools provides a significant advantage as compared to the previous options with manual polling, sensors, and daemon processes.

Drawbacks of this approach:

  • Navigating the event-driven landscape without a place to observe the execution state can be hard to monitor and maintain.

4. Event-driven workflows with serverless and Prefect 👍 + ❤️

You can combine the scalability of serverless (approach #3) with the convenience and observability provided by Prefect to orchestrate and observe both scheduled and event-driven workflows. Regardless of your chosen cloud provider, simply adding a flow decorator and pointing your serverless function to the Prefect Cloud URL gives you the ability to add retries, caching, and observability to your event-driven serverless workflows. You can also leverage Prefect blocks to securely store secrets and configuration data and send custom notifications.

To see how to implement that pattern in practice, see our previous post using Prefect, AWS Lambda, and serverless framework.

Next steps

This post discussed various use cases for scheduled and event-driven workflows. We looked at the pros and cons of various implementations and linked a recipe showing how you can use that in practice.

If you want to talk about your scheduled and event-driven workflows or ask about anything, you can reach us via our Community Slack.

Happy engineering!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Anna Geller

Anna Geller

5.1K Followers

Lead Community Engineer at Prefect, Data Professional, Cloud & .py fan. www.annageller.com. Get my articles via email: https://annageller.medium.com/subscribe