Expecting the unexpected

A two-phase-commit approach to gain traceability of a distributed system, with Elixir

Authors: Ed Ellson, Dorian Iacobescu, Tobias Kräntzer & Qixxit Team

Here at Qixxit, we combine long-distance bus, train and flight options into a single route, allowing customers to book tickets through a single platform.

Checking out an order through our platform requires communication with a lot of 3rd parties. Products have to be booked with the train operator, money has to be collected from the customer, etc. Each of these actions can fail for a variety of reasons.

Because the checkout of an order is a critical process where both capturing the money and booking the product must succeed, it is important to have full traceability to recover from potential failures.

To gain full traceability, actions are first prepared, then persisted, and finally executed. With this, applying an action requires the manipulation of two sources (a local database and the 3rd party system) which cannot be committed in a single atomic transaction (e.g. a database transaction). These types of problems are commonly known as distributed transactions and are well-covered by numerous publications. An interesting read can be found in Distributed Transactions: The Icebergs of Microsystems.

This emerged as a common pattern in many parts of our system, but we could not find a common pattern to address this in Elixir. With this in mind, we tried to distill a more general approach.

In this article we’ll cover:

  • How traceability might falter, and why this matters;
  • A strategy to regain traceability and recoverability;
  • An Elixir-based implementation of this strategy;
  • Some conclusions, and further reading.

How traceability might falter

Consider a case where your application needs to capture money from a customer, via a 3rd party payment provider, persisting some record of the result.

Assuming we receive and process a response here, we’re good. We know whether the capture was successful , have a record of it, and can act accordingly. But what happens if we either don’t receive a response or fail to process the response correctly? Any of the following could have occurred, but we have no record of it:

  • The request never reached the provider, payment was not captured;
  • The request reached the provider, and payment was successfully captured;
  • The request reached the provider, but payment was not successfully captured;

Recovery from this situation is very difficult — we have no idea whether we have taken money from the customer, and so there is no way to decide how to proceed . We need to come up with a strategy for recovering from incidents like this.

Regaining traceability

Before settling on a strategy, it might be useful to look at the problem from a slightly different angle: What issues might occur such that we end up in this unrecoverable state, where we have no record of a capture attempt? We’ve already identified what might have occurred in terms of the capture itself, but what might have been the cause of this situation?

  • The request never reached the provider;
  • The request reached the provider, but was not successfully processed;
  • The request was processed by the provider, but our application did not receive a response;
  • Our application received some response, however, it crashed or restarted before it could be processed.

Under any of these scenarios, we need to be able to gain certainty about what happened. We need to be able to identify which of the potential causes actually occurred, so that we can decide what to do next; whether that means dispatching an order, allowing a user to reattempt payment, or perhaps nothing at all.

To help us decide, along with knowledge of whether a capture succeeded or failed based on the response, we need to know that a capture was attempted. We can gain this knowledge by persisting some record of the request before it is made:

  1. Persist representation of capture request;
  2. Attempt capture with 3rd party;
  3. Persist representation of capture response.

With this change, we have a better picture of the state of the system. We attempt a capture if and only if the initial persistence succeeds. If we have a persisted request, but no related response, we have to assume we have attempted a request, but something has gone wrong. Where previously we only knew about the capture attempts which we processed a response for, we now also know about those which we attempted, but which may or may not have succeeded. We now know more about what we do not know, and with this knowledge, we can act appropriately.

This pattern, whereby we perform some preparation (in our example; persisting representation of our request to the database), before committing an action (attempting capture, then persisting the response), looks something like the two phase commit algorithm. It’s beyond the scope of this article to get into the depths of it, but worth a look if you are interested.

Recovery

Now that we have this knowledge about captures which may or may not have succeeded, we can think about how to handle recovery. We need a mechanism for deciding whether a capture has succeeded or not.

This mechanism will depend on the functionality offered by the provider, but for our example with capturing payment there are 2 common solutions: an idempotent API, or a status endpoint. Each requires that our requests are made with some identifier.

  • Idempotent API: Making the same request to the 3rd party will always produce the same result. If the action has in fact already been processed by the provider, the new response indicates this along with some indication of the initial result. If the action has not been processed, it is handled like the initial request and the result is returned*.
  • Status endpoint: The 3rd party provides an endpoint to ask for the status of an action. If it has not been applied the request can be made again.

To be able to use one of these approaches, we simply need to persist the identifier as part of the initial persistence action, before calling the provider. With this, we can either attempt the same call again, with the same ID, or query for the result of a request made with it.

Example implementation

The following example shows a simplified implementation of a capture function that stores data to allow recovery from failure. To achieve this we create and store a unique request_id in a database, together with the request parameters of the checkout.

This request is then executed. Once the response is returned, either the checkout is updated with the captured amount, in case of success, or an error is returned in case of failure.

As mentioned above, recovery in the case that we do not receive a response is domain specific, so we’ve left the implementation out of this example.

Generalising the implementation

While this implementation should be behaving as we want it to, it could still be improved. For instance: the module knows about both performing and committing an action; it would not handle concurrent requests on the same checkout correctly; and it feels likely we could split the general two-phase-commit logic from our domain-specific stuff. We’ll explore this next.

Separation of concerns

Our example above shows a simplified pseudo-implementation of capturing payment through a 3rd party provider. It does this by running the following steps:

  1. Preparing the request with the required parameters and a request_id
  2. Persisting the prepared request in a database (as part of the checkout data)
  3. Sending the prepared request to the payment provider
  4. Handling the response
  5. Updating the checkout on success

These steps can be separated by domain (business logic vs. persistency) and by stage (preparation vs. execution/committing). Using these terms the execution of an action can be performed with the following steps:

  1. Preparation / Business Logic
  2. Preparation / Persistency
  3. Committing / Business Logic
  4. Committing / Persistency

Having this separation in mind, the example implementation can be refactored into two components. One component is responsible for the business logic and does not know anything about persistence. The other component is responsible for the persistence and does not know anything about the business logic.

Capture Action

This component will handle the business logic. In it’s prepare step it simply returns what will be necessary for the commit. In commit, we will actually send a request to the provider, and handle its response.

Checkout store

This component is responsible for persistence. It takes a transaction in its prepare step (which in this case is the capture request) and associates it with the checkout. Then in commit, we update the checkout again, persisting some representation of the result.

Applying the action

With these new components, in order to run the checkout we just need to call our functions in the sequence outlined above:

Avoiding conflicts

A key feature of Elixir is the ability to run many processes in parallel. Either on one node or on a cluster with several nodes. Given this, we might want to better handle cases where more than one process needs to operate on the same entity.

We can solve this by updating our persistence layer, so it provides an interface which allows only one pending transaction, and where the committed state must be the result of that pending transaction. We’ll address this using state revisions and transaction references.

  • When preparing a transaction the caller must provide the revision of the state the transaction is based on. The preparation returns a transaction reference.
  • When committing a change the caller must provide the transaction reference of the prepared transaction. Committing the change provides a new revision of the state.

The state revision and the transaction reference guarantee that the state of an entity is always consistent and that changes can only be applied sequentially.

For example, it is not possible to create the transaction D based on revision 2 because there is already a newer revision of the state. Also, transaction E cannot be created because there can only be one pending transaction (here transaction C). The new revision of the state can only be added when it is based on the pending transaction (here revision x must be the result of committing transaction C).

General pattern

With the changes above, we can now generalise our implementation to work with the execution of any action that uses the two-phase-commit pattern. Applying the action is then only a matter of executing 4 functions in sequence where first the transaction is prepared, and later committed.

The action first needs the state of the entity and some specific args (in our case a checkout and the payment data). Its prepare step results in a transaction which is opaque to the outside — the store does not know or care what this contains, it must simply take it and prepare to commit.

When committing the action it only needs the state of the entity and the prepared transaction. This results in an updated entity state. The following Elixir-behaviour shows which interface an action needs to implement:

As mentioned above, the store does not care about what a transaction contains, it is responsible only for persisting it, and making sure that the caller has the latest revision of the state. If the caller does not have the latest revision the preparation fails, guaranteeing that changes to an entity are applied sequentially. When committing the updated entity state, the store must verify the transaction reference to make sure that the new state is the result of the committed transaction.

For clarity we’ve omitted the implementation confirming that the caller has the latest revision (you can find an example in the links below), but the following Elixir-behaviour shows the interface a store needs to implement:

Conclusion

Over the course of this article we’ve identified a potential source of error where we need to rely on 3rd party providers, we’ve considered how we might handle this, then walked through an example implementation in Elixir. We believe that our application will be more reliable with this change in place.

While we’ve focussed on the capture of payment here, this sort of strategy could be reused for similar purposes. For instance:

  • Handling refunds;
  • Handling the booking of tickets with a 3rd party;
  • Tracking sending of emails;
  • Handling a sequence of events all of which rely on a 3rd party response to decide how to continue.

As we think this pattern is more generally applicable, and not tied to the example described above, we’ve defined the interfaces of the store and action components in a library, along with a small example application:

We’d love to hear your thoughts on how this might be done differently or improved.

PS: We’re hiring! If you’d like to work with us on these kind of problems, take a look here: https://www.qixxit.com/de/careers/


*It is possible that with repeated attempts to an idempotent API endpoint we still do not receive a response, or fail to process it. In this case, most likely we require some manual resolution process.