Common Pitfalls with Durable Execution Frameworks, like Durable Functions or Temporal
I recently came back from Temporal.io’s excellent Replay conference here in Seattle. It was great hearing from Temporal employees as well as their customers about the benefits and challenges of using “workflow-as-code” (sometimes also referred to as “code-as-workflow”) durable execution frameworks, like Temporal. As the creator of Azure Durable Functions, which is a similar project with the same historical roots, much of the discussion largely applied to the developers I work with as well.
One of the biggest takeaways from this conference for me personally was confirmation about some of the biggest challenges that developers face when using this kind of replay-based durable execution technology in their applications. I thought it might be worth writing about some of these challenges as a way to both educate those who are exploring the technology as well as help me think more clearly about ways to solve them. Below are a few of the most heavily discussed challenges. Note that the content of this post assumes you have basic familiarity or experience with either Durable Functions or Temporal.
Orchestrations in Durable Functions and workflows in Temporal must be deterministic, meaning the side effects of the code must not change if executed multiple times with the same input. This effectively means that durable code must not perform any I/O, use random number generators, or fetch the current time using language-native APIs (since you would get a different result on each execution). The Durable Functions documentation includes a detailed list of these constraints here. Temporal has similar documentation on this topic here. Failing to abide by these rules can result in execution failures when the most recent replay of the code doesn’t match the execution log of a previous replay. While it’s relatively easy to follow these rules, it represents a coding convention that most developers aren’t used to (yet).
This is the most obvious challenge developers will face but it’s also the easiest to deal with. For example, developers can work around determinism code constraints by wrapping non-deterministic code in activities, which have no such limitations. Static code analyzers, such as the Roslyn Analyzer for Durable Functions orchestrations written in C#, are also helpful in identifying determinism errors during development. However, static analyzers may not exist for all supported languages, and the rules for writing “correct” code may take some getting used to.
In practice, the challenge of determinism code constraints doesn’t come up much because it’s actually pretty easy to work with. In fact, I don’t recall much discussion about this during the Replay conference. However, I thought it was worth mentioning here for completeness. It’s the other challenges listed in this post that I think are much more interesting to developers.
One of the main benefits of durable execution guarantees is that you can focus on writing business logic rather than writing code to detect and recover from unexpected failures. The savings in terms of time spent writing corner-case code and time spent debugging live-site incidents can be significant, which is why these frameworks are incredibly attractive to the developers that discover them. However, another side effect of automatic retries is that a particular piece of code may be executed more than once, hence the phrase, “at-least-once” execution guarantees. This is not a problem for orchestrations or workflows; they’re already written to be completely deterministic and safe for replay. However, this may be a problem for activities. For example, if your activity sends an email to a customer, retrying the activity could result in multiple emails sent to the customer. If this happens just once, then it’s not a big deal, but if it happens many times due to many failures, then it could quickly escalate into a big problem.
Because of the at-least-once execution guarantees of orchestrations and workflows, it’s best practice to implement activity code to be idempotent. Whether this is practical or not depends entirely on what the activity is doing, and whether any external systems that these activities work with support idempotency. For example, you may be able to easily check whether a particular record was already written to a database before writing it again, but you may not be able to know whether an email has already been sent before sending it a second time.
To be fair, this problem isn’t unique to Durable Functions or Temporal. This problem exists for many types of distributed systems that depend on queues or durable streams, and there is a large amount of literature dedicated to this topic. Nevertheless, it’s still something that developers writing orchestrations and workflows need to be aware of since both Durable Functions and Temporal provide at-least-once execution guarantees of activities by default. If you’ve mostly only worked with RPC frameworks that don’t involve retries (i.e., have “at-most-once” guarantees), then this may be an issue you haven’t encountered as frequently.
When working with reply-based durable execution frameworks like the Durable Task Framework (which powers Durable Functions) or Temporal, the code is tightly coupled to the durable execution history log. Deploying a change to the code that isn’t compatible with an existing history log can therefore cause your app to unexpectedly fail with non-determinism errors. However, the general problem of code changes not matching existing state has always existed for apps that depend on state in the form of queue messages or rows in a database. The difference for “workflow-as-code” is that it’s far less obvious for developers to understand whether a code change is safe to make or not when writing this kind of implicitly stateful code (i.e., in the form of workflows or orchestrations) vs. explicitly stateful code that reads queue messages or queries a database.
There are two primary approaches that I’ve seen for dealing with the code versioning challenge. One is to make the code aware of different versions by adding if/else checks against version numbers. It’s not unlike putting schema version numbers in queue messages. The downside of this approach is that it doesn’t scale well beyond a small number of changes and there isn’t a good way to test whether this was done correctly. Durable Functions instead proposes deploying code changes into a separate copy of the app (Temporal similarly advises running different versions on different task queues). Doing so removes the problem of needing to be careful about code changes but places a burden on the developer to manage multiple versions of their apps running side-by-side. In my opinion, this is one of the more important challenges that durable execution frameworks need to solve.
The payload size problem, while arguably less severe than the versioning problem, is possibly more complicated. First of all, it’s not immediately obvious that things like activity input parameters and return values are being serialized; the programming model conveniently abstracts this detail away. Newer developers that aren’t yet aware of how parameter serialization works behind the scenes are likely to be surprised by this. In some cases, it’s something you don’t learn about until your workflow fails because you’ve exceeded some size limit, or when your app crashes because it ran out of memory trying to load these values from the history log.
The common workaround for dealing with payload size limits is to instead store them into external blobs, like Azure Storage or AWS S3 buckets, and pass references (URLs, etc.) to these blobs to the activities. This avoids the large payload limitations but adds unpleasant boilerplate to the code, which now needs to manage uploads, downloads, serialization, and possibly compression. Even worse, you’re now responsible for managing the lifecycle of those external blobs (i.e., deleting them when they’re no longer needed). Durable Functions tries to help with this by doing the blob storage and lifecycle for you behind the scenes when using the Azure Storage state provider. However, this comes with other hidden costs, such as high memory and high CPU usage when loading orchestration history with many large payloads. In such cases, the orchestration will usually continue to function, but performance may be severely impacted in a way that’s unexpected. This is why Durable Functions strongly recommend using smaller payload sizes even though there isn’t any enforced size limit.
History log sizes
Another challenge that often surprises developers is the need to keep workflow or orchestration history log sizes manageable. Writing an infinite loop, for example, can explode the size of the history log, resulting either in runtime failures due to reaching a size limit or in hard-do-debug out-of-memory errors. But history log sizes can balloon quickly in other ways too, such as when doing large activity fan-outs. For example, an orchestration or workflow that schedules several thousand activities in parallel can balloon the size of the history very quickly.
Techniques for keeping history log sizes manageable include taking advantage of things like the continue-as-new APIs and breaking up large orchestrations or workflows into multiple smaller sub-orchestrations or child workflows. If an orchestration or workflow needs to process many documents in parallel and assigns each document to a single activity, techniques such as batching multiple documents into a single activity can also be helpful. Unfortunately, this is another issue that developers can run into unexpectedly and may require significant code refactoring (and therefore code versioning, as we already discussed).
Dead-lettering or poison-message handling
Sometimes workflows or orchestrations will get stuck in retry loops because of an unrecoverable failure. For example, if a particular parameter value for an activity triggers code that is capable of crashing the worker process, then it will be impossible for that workflow to make progress. It may not even be a problem in your code. For example, a problem in the SDK itself that is triggered in certain conditions (for example, out-of-memory errors when loading an orchestration’s history). What makes this especially problematic is that the infinite retries will continuously consume resources, preventing other orchestrations or workflows from making progress.
At the time of writing, neither Durable Functions nor Temporal has a built-in dead-lettering or poison-message handling feature. This means that these types of problems require manual mitigation. In some cases, it can require identifying and terminating problematic orchestrations or workflows. In more degenerate cases, it may require surgically modifying durable state to remove “poison” messages, which itself is fraught with danger.
In summary, these are some of the biggest challenges I’ve seen when helping customers adopt durable execution frameworks like Durable Functions and Temporal. It’s my opinion that these problems are far outweighed by the productivity benefits of the frameworks, but they are challenges nonetheless that are critically important for developers to be aware of as they build out production applications. It’s also critically important for durable execution frameworks to continue to evolve and create easy-to-use solutions to these challenges. My hope is that as these frameworks grow in popularity, more and more creative solutions will arise. For example, the folks at Convoy have recently proposed a CI-based solution that they use for dealing with versioning in a recent blog post from one of their summer interns.
We’re certainly working hard on these problems as we improve the developer experience for Durable Functions, and I look forward to learning more about the various solutions the Temporal community comes up with as well! My hope is that our respective communities can learn from each other and that all users can benefit from this shared learning.