What they don’t tell you about event sourcing

Event sourcing and CQRS gained a lot of popularity recently. The advantages are obvious and they share a very peculiar symbiosis with each other and with the current tech state of the art making them very relevant. However after working for several years with them in production there are several caveats that one should care for.

If you are not familiarized with event sourcing it comes down to instead of saving the application’s last state, to model it in form of a sequence of immutable events. Changes to the state are reflected by saving the event that triggers that change instead of actually changing the current state. Processing each event in the stream will produce the latest state of that entity. You can find a detailed explanation by Martin Fowler here.

At first glance it seems a terrible idea. Since each entity is represented by a stream of events there is no way to reliably query the data. I have yet to meet an application that doesn’t require some sort of querying. This is especially evident on data intensive applications where much of business value relies on analyzing the data. This difficulty by itself makes event sourcing unsuitable for most applications and only relevant for very isolated and specific use cases.

That’s when CQRS comes in. CQRS (command query responsibility segregation) describes the concept of having two different models one to change information and one to read it, completely separated from each other. Basically asking a question shouldn’t change the response. Martin Fowler has a very interesting article about it here.

Greg Young identified quite well how CQRS and event sourcing share a symbiotic relationship. In fact the limitation I talked earlier elegantly goes away applying event sourcing with CQRS. Having the write model separated from the read model enables the use of the most appropriate strategy to each model and allows both the write and read model to be scalable independently. Event sourcing is a particularly efficient write model since it works basically as an append log where new information is always added enabling minimal locking. Since each event is irremovable and immutable there are no updates or deletes enabling good write performance. On the other hand since the read model is completely independent allows the freedom to choose the most adequate technology to optimize for queries. Which can even be a completely different technology from the write side, for example a non relational denormalized built for search data store (winks at ElasticSearch). It really seems the best of both worlds. Or is it?

Event sourcing with CQRS is the kind of architecture that is sweet promise that brings tears to my eyes, like witnessing a powerful and beautiful once in a hundred years meteor in the night sky. As long as it is applied to the right use case… Otherwise that meteor will come down and hit you in the face, and those tears will be of despair instead of happiness.

Eventual Consistency

Deep diving into CQRS with event sourcing you will find that a queue separates the write and the read model. That means any change will be available for read in between the next milliseconds and a foreseeable future. Meaning the functionality as a whole is eventually consistent. Eventual consistency is the notion that a system will eventually converge on a value if no more updates are done to a given entity. While the system settles on a value can return stale or inconsistent data, a period known as inconsistency window.

By definition that queue between the write and read model can fill up, the system can have an unforeseen peak of usage and can take more than it is expected to process. You will say to yourself that will hardly happen, you have a fast component with a strong infrastructure behind it. However this will happen. It will happen in the mid of the sales season of your e-commerce platform when the functionality is needed the most.

Eventual consistency became popular with the introduction of NoSql databases and dealing with the challenges of distributed systems. Eric Brewer’s CAP theorem illustrates how a system can be either available or consistent in face of network partitions, but not both. Being eventually consistent allows a system to be scalable and stable, but at what cost?

Purists will say that consistency is a fairy tale, in the highly distributed world of big data to have your system available you need to be eventually consistent. Will be ready to (mis)quote the CAP theorem saying that is the proof that consistency belongs with the countless shipwrecks of the past drowned by the tsunami of big data.

This kind of mindset made it acceptable to sprinkle the magic dust of eventual consistency everywhere. How did we come from taking ACID for granted, from consistency being the very foundation of software and data storage to saying that “well everything’s eventually consistent deal with it”?

The theory says that in distributed systems everything is eventually consistent but the pragmatic view of the real world says that we need to be real careful on what we choose to make eventually consistent. If we choose to build a business critical functionality around this eventual consistency can have dire ramifications. There are use cases that availability is the needed property of a system but there are also use cases where consistency also is, where is better to not make a decision rather than making it based on stale information. The sensibility to distinguish between these situations can be hard to master and sometimes impossible during to the transient nature of software development. This should always be questioned when choosing to use CQRS with event sourcing, making the decision to do so always a risk.

Whole system fallacy

Maybe it’s the exotic complexity of CQRS or the unquenchable thirst for knowledge that appeals to us developers and try to use it on the more unsuitable use cases but CQRS is not an architectural pattern and should not be applied to a whole system.

A whole system with every component based on event sourcing will turn the interactions between them complex and hard to read unless digging down to each component. While if every functionality affects the same event sourced component will rapidly become an event sourced monolith. Overall this pattern adds a significant complexity, should be considered whether it’s worth it. Typically shines the most when pinpointing parts of the system that benefit from it, identifying a specific bounded context in DDD terms, but never on a whole system.

Task based UIs

Task based UIs focus their design on the user intent. They are built in a way that the action the user needs to make is intuitive and is incorporated on the interaction the user makes with the software, guiding the user through the process. Contrasts with a typical CRUD interface where the user simply interacts with a given entity and is expected to understand how the entity is modeled.

Given they are focused on the user intent they work with DDD quite well. It is seamless to create commands that translate this intent. However if there is a strong requirement to follow a more traditional CRUD approach the adaptation effort is rather cumbersome and the end result is all but satisfying. Also your events will be based on a SomethingCreated or SomethingUpdated which has no business value at all. If the events are being designing like this then it is clear you’re not using DDD at all and you’re better of without event sourcing. Finally, depending on the requirements on how the synchronous the UI and the flow of the task is the eventual consistency can, and most of the times will, have a klinky feel to it and deliver a poor user experience.

Event schema

Converting data between two different schemas while continuing the operation of the system is a challenge when that system is expected to be always available. Due to the very nature of software development new requirements are bound to appear that will affect the schema of your events that is inevitable. A lot has been written about how to handle event schema changes, there are several articles on how to handle this. One thing is for certain it’s hard, complex and no perfect way to do it.

There are techniques similar to adapters that convert the event before returning them to the application called upcasters. They can convert events to different versions giving for example more granularity. This however defeats the purpose of event sourcing, the stream of events is expected to show a history of what happened, with this the application is now publishing events that doesn’t even have. Associated to this you can save the new versions of the events called lazy upcasting, now the stream actually reflects what is being published but there are several different versions of the same event in the store which is a nightmare no manage. It is possible to change the schema to all events at once, like would be done in a SQL table, which can mean a considerable downtime and and a lot of complexity managing the moment of the change since all applications would have to change at once. In the end events are immutable just live with it. Having different versions of the events is the best way to handle schema changes, similar to a REST api where the application would support both the old and the new version for a given amount of time. The drawback is the maintenance of the code handling all the different versions but the different applications have time to adapt to the new versions. Also the stream will become intact reflecting what actually happen this is what event sourcing is supposed to do.

Independently on how the schema changes are handled, managing these changes is one of the most complex and error prone drawbacks associated with event sourcing. A strategy should be prepared upfront and considered on the system design.

Event granularity

One of the most important design decisions and one of the hardest things to get right is how granular the events should be. Too fine grained they won’t have enough information to be useful. Too coarse they will have a high impact on performance due to serialization and deserialization, on disk space and stress the message broker. Also they most likely won’t mean anything and won’t have any domain value at all.

In theory and I find it to be a good rule of thumb is to have your commands and events reflecting the intent of the user staying true to DDD. They should be modeled using ubiquitous language and part of the domain value of the application will reside on the commands and events. However if you manage to follow a more pragmatic approach you can avoid some serious impacts on the consumers of your events by understanding their different needs. Illustrating this on a simple example, if a given AddressStreetChanged event is published it clearly shows the intent of the user by changing the street of the address but how many of your listeners will need that information without for example the door number? To obtain it they have two options; either save the state of the address internally or ask the service that owns the data for the missing information. Both have dire consequences, the first you will have to worry about disk space, the extra effort of building that internal state and keeping it synchronized. Also in a micro service architecture several copies of the data of the original system will appear everywhere which is a nightmare to manage, especially if the schema changes. Regarding the second option, since the read model is eventually consistent it is possible to retrieve information that isn’t up to date with the event, i.e. the consumer can receive the event faster than the originating system’s read model. On the previous example if the consumer application needed to validate the address for example they could retrieve a stale address which might fail the validation. Depending on the use case this can be unacceptable and trigger some complex inconsistencies that are hard to trace.

The events can’t be too small, neither too large they have to be just right. Having the instinct to get it right requires an extensive knowledge of the system, business and consumer applications, it’s very easy to choose the wrong design.

Operation flexibility

Things are bound to go wrong no matter how detailed your quality assurance is, how much coverage there is on your unit tests or how many test cases your integration tests cover, they just control the level of wrong things go.

Either by a bug or by human mistake, now and then it’s always required to do a manual correction, someone has to run a SQL script or shuffle around some data. Usually there is a support team in charge of this and they need the flexibility to fix something on the moment. On a traditional data store a simple update will suffice. However the events in a event store are immutable and can’t be deleted, to undo an action means sending the command with the opposite action. It is harder to affect multiple entities and requires a knowledge of the system, not like SQL that everyone knows. Overall these operations aren’t easy to do without some kind of tool prepared beforehand and make these operations more complex and error prone, making it hard and complex for the team supporting these problems to do their job.

Wrapping it up

CQRS and event sourcing are a mischievous lover, if you know their inner needs better than themselves they will be everything you ever wanted. Otherwise they will make your life a living hell.

They have their limitations as everything in life. By knowing their limitations will empower you to make them truly shine.