How we implemented a state machine framework

Joan Zapata
Memo Bank
Published in
5 min readAug 30, 2022

As explained in most of our articles: building a banking system requires two properties more than anything, traceability and consistency. That’s especially true when it comes to the financial part, i.e. anything that makes money move.

In our core banking system, we mainly have three services involved:

  1. The core, for tracking accounts, balances, debits and credits.
  2. The payment-processor, for connecting ourselves to other banks and actually performing and receiving transfers.
  3. The transaction-engine, which coordinates both, amongst others.

The core being the source of truth for the entire system, it requires extreme traceability and consistency, so we used an architecture called “event sourcing”, implemented on the Elixir/Erlang stack, described in a dedicated article.

For the other two, we were looking for something more lightweight, that would take into account the high number of interactions with other services, while still maintaining 100% traceability and consistency. To do that, we used a state machine model. For example, in the transaction-engine, the state machine of a transaction, including its interactions with other services, looks like this:

With regard to our requirements, we paid attention to:

  • Having a high level of formalization, in Kotlin;
  • Resiliency to all kind of crashes and errors over long periods — transaction can span over weeks — , by having states stored in database;
  • Side effects being executed with strong consistency guarantees;
  • History data being complete, and easily browsable for traceability.

The formalization is mostly done through the State interface.

It forces developers to declare all states, inputs, and entities as pure data, and implement onInput as a pure function.

For the same parameters, onInput will always return the same output, and it does not have any side effects. As such, it has two interesting properties: it’s easily testable, and by storing states and inputs we have perfect traceability over what happened.

To make those state machines long-lived — some transactions can last for months if we take into account all return scenarios — , we store them into a database. The minimum required is to store each state change with a timestamp, associated with the Entity's reference. As you probably noticed, states are not simple values but data objects, the framework delegates the serialization to the user code, and stores the result as a blob. Then, when a new input comes in, it uses the reference of the Input to rebuild the last state of the associated Entity, and calls the onInput function on it.

Even though storing states is enough for it to work, we also store all inputs for extra traceability.

Now, a big part of the job of these state machines is to interact with other services, by sending commands or events. It’s very important that these “side effects” are eventually consistent with transitions, i.e. that we aren’t able to send a command and then fail to commit the state change, or to commit the state change without eventually sending the command, because both these scenarios would leave the overall system in an inconsistent state.

To do that, we first had to make side effects declarative. For a transition to have a side effect, for example sending a command to another service on a particular transition, we declare it as data on the transition.

The side effect EmitCommand is then run separately, only when we are sure that this transition was successful — fully committed to the database —, using the mechanism we already described in this article. This gives us a 100% consistency guarantee.

And of course, these side effects data is also stored on each state change, for full traceability.

One of the big advantages of all the formalization is the ability to build smart tooling. We do it once, and benefit from it for all further development. In our case, we added a user interface in our back-office that displays all state changes, inputs and side effects across both services, on the same timeline.

Here is an example for an outgoing transfer (simplified). The blue rounded badges are the side effects which are emitted when arriving in the state.

All entities, inputs, states and side effects are clickable and show their data. Even better, in batch_creation for example, the batch itself is a state machine, and the reference can be clicked to see the batch’s timeline in the same UI.

This is invaluable for the developer team during maintenance and new developments, as well as the customer support team to answer precisely to clients.

We’ve been using and improving our state machine framework at scale for the past 3 years. The best advice we can give is to make sure all business decisions are made in the states. That’s very important to keep the state history valuable. To do that, we have three rules:

  1. Documentation first. State machines are easy to represent with state diagrams, it’s a great tool for quick brainstorming and later documentation. Before working on the code, we have a strict rule to first have a peer review on diagram updates. The code review is usually a no-brainer after that.
  2. Low abstraction level on inputs. Don’t create inputs that already say what needs to be done. For example when reacting to events, we first made the mistake to convert each incoming event into a dedicated input. Instead, we now have a single EventInput containing the raw event, and it’s up to the state machine itself to give it a meaning, or ignore it.
  3. Low abstraction level on side effects. Side effects should be dumb, they shouldn’t make business decisions. Even an if in a side effect executor is suspicious. Also, if on a transition from A to B you need to do 2 things, make 2 side effects with explicit names, don't try to abstract them into one or you'll end up calling it DoSideEffectsOfTransitionFromAtoB and have a meaningless state machine history.

Building our own state machine framework was not the fastest choice but it paid off a thousand times. It does not have the exact same guarantees as event sourcing — all transition data are not guaranteed to be in immutable timestamped events — , but offers 100% consistency, and a fair amount of traceability, while being faster to develop with. It allowed us to gather data effectively in a single view, allowing us to track a financial transaction across the whole system, which proves super helpful for development, maintenance and customer support.

If it’s the kind of system you’d like to work on, we’re always looking for new talents to join the team, get in touch.

Originally published at https://memo.bank/en/magazine.

--

--