Using CQRS and eventual consistency to achieve high performance on complex systems
Contributed by Mario Bittencourt, Architect @MindGeek
Today, in the era of distributed, (micro) service-oriented systems, one common task is complexity or, more precisely, how to handle an ever-increasing complexity and the pace of change required by business.
One of the approaches we are leveraging is the use of a pattern called CQRS (Command Query Responsibility Segregation)[1].
Before going over the pattern on details, let us present a situation that most can probably relate to. As you gather requirements for your system, you start modeling how you are going to persist the data generated by it. If you still use a traditional RDBMS, this implies databases, tables, columns and their relationships. All is good and you aim to model it using the most normal form possible as the data is the single source of truth of your system. If you are lucky you can still uphold such standards as the requirements come and you release your system. Such system can be represented by figure 1.
Then two things are likely to happen[2]:
a) You will notice that all the sudden your now popular service is having a poor performance.
b) The business will come with new requirements, be new reports or new ways for the users to retrieve their information from the system.
The solutions are also usually the following:
a) Add a caching layer to your application using redis, memcache or elasticsearch.
b) Start de-normalizing your database and/or add new indexes to satisfy.
Been there, Done that™
While the aforementioned solutions are perfectly fine, as the systems become more complex, we have started to think that there should be a better way as the solutions seemed artificial, an afterthought or even caused more problems[3]. This is where CQRS comes to the rescue.
CQRS brings front and center and leverages a characteristic that most of our systems have: the write access pattern and requirements differs from the read.
This may be different for you, but in some parts of our systems, the frequency of writes is normally lower than the reads. Despite their differences, we served them from the same model and persistence infrastructure, at least from a conceptual point of view.
With CQRS, you have the concept in which you have one model dedicated for writes and another for reads. This gives you the freedom to optimize each model to its use case instead of being tied and try to make both perform optimally from the same model. See figure 2 for a standard CQRS system.
If you go back to our newly released system example, you would maintain the data integrity that you started with for the write model, and you would create a new read data store for the updated use cases business team asked you. Instead of having to figure out how to refactor your system to start adding caching layers, you already have a predefined path to follow from the beginning.
Where does the eventual consistency play a part? Well, since updating the write store and the read data store will not happen in an atomic operation, your system needs to embrace the fact the writes made will be propagated after some time to the read data store. At first, this may sound unacceptable, but if you are using any form of caching or even an asynchronous replication, you are already living on an eventually consistent world.
A common element, not represented on figure 2, is the way to propagate the writes. Usually we leverage a messaging system, like RabbitMQ or a distributed log such as Kafka for this job. This gives you an added benefit to decouple even more the write part from the read, and helps with the scaling[4].
One added bonus, especially for any (micro) service system, is that you can more easily have two teams involved, one working on a more business-centric and complex write model and another on managing and creating the read models.
In conclusion, if you have a complex system and have to deal with a high volume of read/write, consider adopting the CQRS pattern. You may need to reconsider some concepts but you will likely benefit from the changes.
[1] https://martinfowler.com/bliki/CQRS.html
[2] Assuming proper profiling did not review any obvious optimizations such as indexes missing.
[3] Adding indexes increase the size used, can slow down inserts. Try doing an ALTER table on a 60M+ records.
[4] You can add more read nodes as needed and sustain some operations even if the write part is non operational.