Event sourcing makes errors worse
Mark Green
82

Mark, thank you for writing this post!

I think it is great that you bring attention to these aspects. I think that they aren’t really specific to event sourcing. Our industry’s reality is that we do sometimes experience the “sky is falling” syndrome, and there are reasons behind that (nobody and nothing is perfect), but yet we still somehow managed to keep things in a more or less working condition :)

I think there are at least a few different components you’re discussing.

The issue of the data being invalid (or malformed). The same applies to any software system, and if we were talking about a more traditional design approach, you will either get a runtime exception (constraint violation, invalid type, etc.) and your best chance of finding this out is through monitoring your logs; or it will lead to an exception down the road, and those are harder to trace because of their delayed nature.

With event sourcing systems, at the very least there is a journaled cause of error that will help you identifying an issue. In a classic event sourcing system, you’d have to either stop or skip the event, depending on the domain. Failing early is usually the best bet to avoid further complications. In lazily event sourced systems, this is even less of a problem because this will only affect a subset of queries against indexed events.

I believe that especially in event sourcing systems, with its immutability requirements, it is even more important to define types and constraints (event schemas) well and fail as early as possible (at compile time, when feasible).

The issue of eventual consistency. You mentioned an example of having stock when the order arrived, but not having it when we’re processing the order. I think this more of an issue of designing your domain to reflect the workflow better.

The fact that we’ve received an order (say, OrderPlaced) doesn’t mean more than just that, well, the order has been placed. As you mentioned, “that thing has happened, you have to deal with it now”. And that’s the key moment. What happened here is just the receipt of the order. Shipping didn’t happen, so the OrderPlaced subscriber is in no position to deal with the stock.

A process manager orchestrating an ordering process would further ship this order to a component that handles actual shipping (say,ShipOrder) that should only ship what is available in stock and should only emit OrderShipped if it has been, in fact, shipped. There should be some kind of locking mechanism used in ShipOrder processing to ensure we don’t overspend stock to ensure that our events accurately represent what happened. If it is undesirable, or impossible, a compensating event can be indeed introduced to “undo” OrderShipped. Perhaps, OrderShipmentCancelled.

Modelling beyond the happy path. Indeed, it is easier to pretend failures don’t happen. However, they do, and just like in Erlang (for example), you can embrace the idea of failure. In this case, this actually means having a more complete model of what is happening (something similar to a finite state machine).

In the longer term, it does lead to a better understanding and coverage of the domain model and more reliable software. Sure, hastily written “happy path” code is easier to ship in the shorter timeframe, but whether it is a reasonable decision is dependent on the significance of gradual quality decline in the project.


Thanks for highlighting interesting system design aspects!