How to Think About Cancellation

Terry Crowley

It’s striking how many systems do a poor job with cancellation. This post was kicked off by a recent thread about a proposal to add a standard cancellation mechanism to the JavaScript promises APIs. The lack of cancellation in that original design is in good company. The core Win32 APIs had no standard cancellation mechanism for most of their history. Windows 7 added some support but generally only in the lowest level of APIs so it was difficult to use these cancellation mechanisms in a consistent way at the application level. Even then certain critical low-level APIs (like CreateFile) did not provide any cancel mechanism. Most middleware layers had followed the lead of the lower levels and did not provide any cancellation mechanism and were never extended to provide one. Using the lower level mechanisms on some middleware computation was essentially equivalent to firing a shotgun into the API (which tends to be a poorly tested code path)!

The learning that happened over years resulted in a better standard mechanism in the WinRT APIs that were released with Windows 8.

So how should you think about cancellation?

The first thing to recognize is that once you make some request in a distributed system, the only guaranteed way to recognize the failure of that request is to simply decide that you have waited “too long”. The only fundamental mechanism you have is the timeout. Some API may return a failure code, but that is only because some underlying layer made the decision to timeout for you. If you want to propagate the ability to make that decision that you have waited too long, you need to provide an explicit cancel mechanism. Since there are cases where the only really valid decision-maker is the user, having some mechanism to propagate cancel almost always makes sense.

People sometimes get “cancel” confused with “rollback”. Rollback is virtually never the right way to think about cancel. Once you have made some request, that request may have gotten lost before it got to the server, it might have failed at the server, or the response might have failed to reach the client (I’m using “server” and “client” as equivalent to “requester” and “requestee” — they might just be separate processes on a single machine or separate asynchronous components within a single process). Trying to rollback will just worsen any problems the underlying system already has. Rollback needs to be an end-to-end system-level feature, not something you try to provide at an API-by-API level.

In practice, the best way to think about cancel is as “reclaim any local resources associated with my request because I am no longer interested in the result”. An API implementation may attempt to propagate the cancellation intent in order to optimize reclaiming of resources in other parts of the system, but that is really just a performance optimization.

The challenge of course is at this point the originating caller does not know the remote state of the system they are interacting with because they do not know what actually happened to their request — the cancellation is all about reclaiming local resources but the action they requested may or may not have happened. Really this is no different than the normal situation when trying to model your local understanding of remote state — your understanding may be wrong for any number of reasons so you need some other mechanism to reconcile the two models. In the aftermath of cancel, the application just lets that reconciliation mechanism play out.

This is a good example of the end-to-end argument — you already need some overall end-to-end mechanism for reconciling the local and remote state, so there is no need — and in fact it is a poor design — to spend effort trying to maintain a consistent view at every API call — especially because it is impossible!

The lack of cancellation at lower levels requires that an application that is trying to give back control to the user needs to simulate cancellation, typically by leaving any outstanding request in a “zombie” state where the request is left outstanding but any return value is ignored. This introduces non-trivial complexity in the client, beyond the fact that it continues to use up local resources (thread stacks, etc.) and may prevent other operations from being executed because of constraints or throttling on local resources.

From the user interface perspective, one of the key challenges is actually determining what the user intends to cancel. Word could have on the order of 10 requests outstanding when opening a document, some of which needed to complete before the document could be shown to the user. For example, a linked template could contain an “OnOpen” macro that needed to run before the document was displayed. In that circumstance, “cancel” could cancel this request to open the template (which might be on an unreachable, and slow to timeout, file server), allowing the document — which was already local — to be displayed to the user. So unintuitively, “cancel” would cause the document to display. In reality, this was a poor feature design. The original feature was designed in a world where the template was almost always local and typically stored with the document. The application should be aware that any remote request might fail or take arbitrarily long to complete and the application should leave the user aware of and in control of the application state. An overall system that recognizes the need for cancellation also recognizes that any of these requests and responses can take an unbounded amount of time to complete. Typically this should deeply impact the design of the user experience.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade