Money Transfer Saga, Part 5 — Results

- Part 1 - The Scenario
- Part 2 - The Implementation
- Part 3 - The Audit Log
- Part 4 - Supervision, error kernels and idempotency
- Part 5 - Results

We’ve now finished implementing the Money Transfer Saga, but we need a way to run it, and a way to generate failures. To run the saga, we’re going to use a simple console app:

Here we setup some variables for use by the Account actors to simulate the various failure scenarios:
* uptime determines the probability of Account failures
* refusalProbability determines the probability that an Account will refuse a credit or debit request
* busyProbability determines the probability that an Account will return a ServiceUnavailable response

We also specify how many sagas we are going to run and how many retry attempts will be made. We then create a Runner actor to run the sagas.

Runner

The Runner actor is responsible for running sagas and gathering and reporting on the results. It implements a scatter-gather pattern to spawn TransferProcess actors then reports when they are all complete:

Once the Runner is started, it loops through the number of iterations and creates two Account actors and a TransferProcess actor each time, adding the TransferProcess PID to a _transfers collection. The Runner supervises the TransferActor, and is responsible for restarting it should it crash. Inside the TransferProcess actor, a call to context.Parent.Tell(); informs the Runner of the result. The Runner then waits to receive results back from the TransferProcess actors:

For each result type, a counter is incremented to track the different result types, then a completion check is performed to determine if all sagas have finished. If so, the results are outputted:

Some examples

So how do things look? Given good enough uptime and sufficient retry attempts, things look good:

RESULTS for 99.99% uptime, 0.01% chance of refusal, 0.05% of being busy and 3 retry attempts:

- 100% (1000/1000) successful transfers
- 0% (0/1000) failures leaving a consistent system
- 0% (0/1000) failures leaving an inconsistent system
- 0% (0/1000) unknown results

Even if we lower the uptime and increase the probability of being busy, things still look good:

RESULTS for 99% uptime, 0.01% chance of refusal, 0.1% of being busy and 3 retry attempts:
- 100% (1000/1000) successful transfers
- 0% (0/1000) failures leaving a consistent system
- 0% (0/1000) failures leaving an inconsistent system
- 0% (0/1000) unknown results

We have to significantly drop the uptime to start seeing something different:

RESULTS for 90% uptime, 0.01% chance of refusal, 0.1% of being busy and 3 retry attempts:
- 99.9% (999/1000) successful transfers
- 0.2% (2/1000) failures leaving a consistent system
- 0% (0/1000) failures leaving an inconsistent system
- 0% (0/1000) unknown results

Dropping the retry attempts significantly affects our results:

RESULTS for 90% uptime, 0.01% chance of refusal, 0.1% of being busy and 1 retry attempts:
- 92% (920/1000) successful transfers
- 0% (0/1000) failures leaving a consistent system
- 0% (0/1000) failures leaving an inconsistent system
- 8% (80/1000) unknown results

Dramatically increasing the retry attempts allows us to cope with a very failure prone system:

RESULTS for 50% uptime, 0.01% chance of refusal, 0.1% of being busy and 15 retry attempts:
- 100% (1000/1000) successful transfers
- 0% (0/1000) failures leaving a consistent system
- 0% (0/1000) failures leaving an inconsistent system
- 0% (0/1000) unknown results

Increasing the probability of refusal has a big impact, as retrying does not happen:

RESULTS for 50% uptime, 20.1% chance of refusal, 0.2% of being busy and 15 retry attempts:
68.9% (689/1000) successful transfers
29.2% (292/1000) failures leaving a consistent system
4.6% (46/1000) failures leaving an inconsistent system
0.1% (1/1000) unknown results

The biggest effect comes from not retrying at all, as we are in danger of timing out on our requests (Account actor has Thread.Sleep(_random.Next(0,150) in it, whilst the AccountProxy expects a response back within 100 milliseconds):

RESULTS for 99.99% uptime, 0.01% chance of refusal, 0.01% of being busy and 0 retry attempts:
48.8% (488/1000) successful transfers
0.1% (1/1000) failures leaving a consistent system
0% (0/1000) failures leaving an inconsistent system
51.1% (511/1000) unknown results

Overall the results show the importance of retrying our operations, and the need to have idempotent receivers that enable us to retry. We can get very good results with very failure prone systems if we simply retry our operations.

This is of course an artificial scenario. In the real world, you’d want more subtle retry strategies that allow remote services to recover from high demand they might be experiencing or failures that might be transient — exponential back-off strategies are more useful than immediate retries. The ability to be able to resume a saga from a given point through the use of an audit log is also very important — if a remote service is down for a considerable amount of time you can still attempt the saga when it has recovered.

software developer, expat in estonia, coffee drinker