Dancing in the Dark

What happens if the Prefect Cloud API goes down?

--

“What happens if your API goes down?” This is an understandably common question from Prefect’s enterprise customers, who depend on Prefect Cloud to automate mission-critical workflows.

I always explain that because of Prefect’s unique Hybrid Model, an API outage is not nearly as disruptive as what they probably expect from a SaaS service, and in some cases its effects can be entirely mitigated.

Last Friday, my claim was put to the ultimate test: Cloudflare experienced a DNS issue causing many websites and services to become inaccessible for a short period, including the Prefect Cloud API. We predictably saw a large spike in outstanding task runs that became “zombies” and lost their connection to the backend (more on this later). Despite this unpleasant situation, the moment the issue was resolved, work continued as scheduled and all affected workflows were easily (and in most cases automatically) resumed.

People who naively glance at our Hybrid Model might conclude that it is purely about separation of concerns (execution environment vs. platform environment), but as with most things at Prefect, it is the product of careful consideration to ensure that even in a worst-case event we are still proactively working on users’ behalf and providing value. In particular, thanks to its innovative design, even if our API is down:

  • your business critical data is not lost or affected
  • your work is still being scheduled
  • a record of all outstanding jobs is maintained and curated (including sending notifications)
  • work will resume when API access is restored

Why do we have such confidence in our approach? Because it was designed to put resilience first. This is another example of how we designed Prefect as an insurance product — most useful when things go wrong. Through careful design of each component, we created a system that delivers value and recovers resiliently from failure even when a substantial portion of the global internet is down.

Scheduling

The Prefect Cloud scheduler service is an always-on, horizontally scalable service that is constantly parsing all flow schedules. Its job is simple: to create new flow runs and place them in a Scheduled state (with the appropriate future start time) for every flow that needs scheduling.

Once a run is placed in a Scheduled state, it is added to a work queue and stays there until a Prefect Agent picks it up at the appropriate time via an API query. This design ensures that scheduled work is never lost —if no Agents can communicate via the API, then at worst some runs begin late.

We are currently working on a feature that will allows users to send notifications on both late flow runs and when agents stop communicating with the API, ensuring they are alerted that something might be awry (Note: because of this design, these alerts will be triggered even if the API is down!).

Impacts to in-flight work

Prefect flows have configurable executors that manage all dependency resolution of the tasks within a given flow (this is critical to the scale that Prefect enables). In normal operation, this is sufficient to ensure all work is visited and completed. However, in extreme circumstances, it is possible for task and flow runs to end in a half-completed state. For example, Kubernetes preemption events can shut down work without warning. Similarly, an API outage means that tasks cannot confirm their final state with the backend and consequently the flow run can not complete.

Prefect Cloud has multiple services running behind the scenes that monitor for these types of situations. Two of the most visible are:

  • Zombie Killer Service: this service looks for task runs that are in a Running state but haven’t sent a heartbeat in the last 2 minutes; when found, the service either places the run into a Failed state or a Retrying state (if the task has configured retries). If no activity occurs on the retrying tasks, the Retrying states eventually make their way into the work queue for agents to pick up. Advanced users will be able to configure zombie behavior separately from task-level retries.
  • Lazarus Service: this service looks for distressed flow runs and task runs that don’t appear to be making any progress. When found, this service places them into the work queue for Agents to retry. If Lazarus visits the same flow 3 times in a row, it will conclude that it is fundamentally broken and automatically mark it as Failed, triggering any configured Cloud hooks to fire and send the appropriate notification. As a concrete example, the most common Lazarus event is a flow run that has been Submitted by an Agent but has not entered a Running state after some time, for example if the Agent is unable to deploy it into an execution cluster.

These services (along with many others) guarantee that in the extreme event wherein work stops communicating with the API, items are re-added back to the work queue for completion once API communication is possible again. Critically, a record of the event becomes both easily discoverable and apparent in your UI dashboard, from which you can choose to manually restart or inspect further.

Data Availability

Last but not least is the issue of data — more often than not, enterprises are concerned about their ability to access data during an outage. Independent of the outage we’re discussing here, Prefect’s Hybrid Model ensures that no proprietary data is ever stored in Prefect Cloud’s database. This means that your ability to access your business critical data is completely unaffected by Prefect Cloud API’s availability.

We say that the Hybrid Model provides “cloud convenience with on-prem security,” and indeed, it ensures that your code remains fully on-premise and in your control. This means that in an absolute worst case event, you could call flow.run() yourself to guarantee your data is updated.

Our work continues!

All aspects of what I’ve described above are in a continual cycle of improvement, as we constantly seek to strengthen our guarantees. Prefect’s ultimate mission is to eliminate negative engineering by ensuring that data professionals can confidently and efficiently automate their data applications with the most user-friendly toolkit around.

Our design goal is to be minimally invasive when things go right and maximally helpful when they go wrong; what better proof than a global internet failure?

Please continue reaching out to us with your questions and feedback — we appreciate the opportunity to work with all of you!

Happy Engineering!

— The Prefect Team

--

--