Choosing an architecture
Creating a bank from scratch is nothing like your usual Sunday stroll. Picture it more like going on a trek in a far and unknown jungle. As for all tough and long-running activities, it requires a good preparation and carefully chosen tools which allow to move fast and won’t break half way.
In software, the decisive tool isn’t the database or the framework we use, it’s not even the programming language, it’s the architecture. Simply put, it is how we organize our code such that it meets the technical and operational requirements of our business domain.
We defined at least five key requirements for our Core Banking System.
- Traceability, in case of an audit we must be able to justify each data in the system with the series of operations that lead to it;
- Performance, our customers see bank accounts and operations, but our accountants see the transactions in a very different way, yet both should benefit from a very fast interface;
- Availability, we have the responsibility to keep and improve people’s faith in payment systems, refuting a card payment because we couldn’t answer an authorization request in time would be unacceptable;
- Consistency, we must be confident over an account balance before we charge it;
- Maintainability, we must be able to quickly diagnose problems and fix them. You don’t mess with people’s money. Also, our code should be fully tested, and the whole team should be able to work without stepping on each others toes.
The choice we’re making now will be decisive for the pace at which we’ll be able to adapt to our customers, and provide them with the best experience.
Our business lies in our data. Money itself is only data, an account balance is a number somewhere in a database that lets you pay your rent, your bills and your food. But that’s not enough, we want to be fully transparent on our data, be able to justify each and every operation that happened on an account, leading to its current balance. Our architecture must allow that by design.
There’s an emerging pattern in software architecture called Event Sourcing (ES). The term first appeared in 2005 in an article by Martin Fowler, but only started to raise public interest in 2011, pushed by Greg Young. His 2014 conference talk is a reference on the topic. The mantra of ES is that the state of a system is given by all the events that lead to that state. Those events can never be modified once emitted, and are stored in an append-only storage called the event store.
This is very good for traceability, everything that ever happened to the system is in our database. But now even the simplest query becomes very complex. Getting an account balance, for example, would force us to iterate through the event store to sum deposits and withdrawals. Not only this is not trivial to implement, it will also get slower and slower as events pile up.
That’s the reason why ES is used in combination with another pattern called Command Query Responsibility Segregation, or CQRS. This pattern handles reads and writes in the system with two very distinct entities. In this configuration, ES offers a very natural boundary between them. This architecture is usually referred to as CQRS/ES.
The read side waits for events from the event store, and has its own database. After each event, it updates his database accordingly.
BankAccount is called a projection of the event stream. This approach has tremendous advantages when it comes to performance and maintainability.
- We can have as many projections as we want. For example, one for the client API containing users, accounts and operations. Another for the accountant API containing the general ledgers accounts, accounting movements, assets and liabilities. Those projections are entirely decoupled since the only source of truth are the events.
- We can easily add or fix existing projections. All we need to do is to replay all the events using the new projection, then switch to the new database.
- We don’t have to use SQL everywhere, we can choose a different storage for each projection, depending on the kind of queries we’ll make. For example, if we need to make a graph-related query we can project the events in a Neo4J database. We could also project events into Elastic Search for very fast search.
Another advantage of those projections is that they can live on their own, which brings me to another of our requirements.
Our system, like most systems, will need to handle a lot more reads than writes. Yet, we need to keep short response time on every request, especially on the write side. If we’re a little too long to respond to a card transaction authorization, the transaction is rejected, causing frustration to our customer.
That’s why the event store must be asynchronous. This way the write side can simply post events to it and move on, not waiting for all the projections to handle them. Moving projections to different machines, we can absorb a huge traffic with zero impact on the write side. If the read side becomes overwhelmed, we can even replicate the same projection on multiple machines, all connected to the event store.
While this is very good for performance, there’s now one problem.
We make our business decisions on the write side, based on a given command and the current state of the system. If this state is in the projections, we’re likely to do a dirty read, thus making a decision based on an outdated state.
To solve this problem, we divide the write part into small stateful components called aggregates. One of them would be “a bank account”. An aggregate handles commands and produces events based on its internal state. In the following example, the state is the bank account balance.
The aggregate uses its state to make business decision. For example, it can reject a withdraw command due to insufficient funds.
Since there’s only one instance of an aggregate per bank account, it ensures consistency. You’ll also note that we don’t need to rely on transactions and database locks.
Great news! The actor model of Elixir is a perfect fit for these aggregates. An actor, called a GenServer in Elixir, has a state and receives messages in its mailbox. It guarantees by design that two messages can’t be processed at the same time. It’s also very cheap to start and stop, and hundreds of thousands of them can easily run on a single machine.
Our final criterion was maintainability. The CQRS/ES architecture is harder to setup than traditional ones, but it improves the maintainability over time.
Most parts of the system are pure functions, which means they are very straightforward to test. An aggregate, which is where most of our business logic resides, takes commands and produces events with no side effects. Most tests will inject some commands and simply check the produced events.
Since most parts of the system are small and decoupled from each other, multiple developers can easily work on different parts without conflicts.
Projections are very tolerant to errors, since we can retroactively fix them. If we made the wrong rounding in the accounting projection for example, we don’t need to migrate the data after the fix, we only need to replay the events.
CQRS/ES is not perfect though, so we carefully considered all aspects.
CQRS/ES is a very opinionated pattern and some tasks may become more complicated than in traditional systems.
- Unicity checks is a good example. Since aggregates can’t communicate with each other, how do we check that a user email is not already taken?
It should be made really clear that CQRS/ES rarely fits a whole system, and should be used sparingly. Using it for user management and roles for example wouldn’t make much sense, and would better fit a more traditional CRUD system.
- How do we handle incoming request that need to write then read? For example, a POST request to register a user should return the created user, but dispatching the command then querying a projection wouldn’t work since projections are updated asynchronously.
Those cases rarely happen in practice, and if they do, there are some workarounds.
- If events are immutable in the event store and must replayable, how do we deal with breaking changes in the events?
Short answer, we don’t make breaking changes. Migrating previous events is dangerous, a bit like traveling in time and irremediably changing the course of events. Instead, we add a version to each event and handle the different versions in aggregates and projections. That means we must be very careful when designing those events in the first place. However, it must be noted that it isn’t more difficult than handling the versioning of a standard HTTP API for instance.
- How do we deal with side effects? For example, how do we schedule jobs in the future, send a confirmation email to a user or interact with external systems?
There’s an additional concept coming from DDD/CQRS called a “saga” or a “process manager”. It reacts to domain events and produces side effects.
Hopefully you now have a good overview of CQRS/ES and how it suits our needs for the foundation of our IT. At the price of an extra initial complexity, we believe it will help us scale in the right direction and achieve our goals. Being based on pure functions, and especially well-suited for actors, this choice is consistent with our choice of using Elixir as a language and platform.
If you happen to be in Paris (or happy to relocate) with a crazy and irresistible urge to learn, write and ship code in a CQRS/ES architecture: let’s talk!