Event Notification vs. Event-Carried State Transfer

Published in

The Startup

7 min readOct 27, 2019

As we move into a world where distributed systems are the norm and monoliths the exception, one of the key architectural decisions to make is around the mechanics of inter-service communication. Whilst the challenges of synchronous communication are reasonably well understood (unless of course, you like a special kind of challenge and would love to manage a distributed monolith), the decision points around how asynchronous communication is enabled is still a bit murky. This post attempts to push across a few of the factors I take in when defining the pattern specifically around eventing.

In this post, we’ll assume the organisation implements an ‘address’ microservice to manage its customers’ addresses and its usage. This could be billing addresses, shipping address or other locations. The example we’ll take will belong to a customer identified by a customer id 911000100. This customer had an initial state in the system of 4 addresses:

Snapshot of customer 911000100’s addresses in the system prior to change

And has made a change to his Billing address — a minor correction of the typo made on street number (from 84 to 48). The final state should have been

Snapshot of customer 911000100’s addresses in the system after the change

Event Notification:

In this mode, the event producer sends a notification to the event system that a change has happened to the entity. For instance, in the case of the address service, the event will only say the bare minimum information (“Customer 911000100 has had a change to the address entity”).

This can alternatively be refined to give a bit more metadata about the change. For example: “Customer 911000100 has had a change to her billing address”

Note that in both these scenarios, the change itself was NOT specified in the event. Consumers are expected to query the read endpoint to understand the latest state of the data.

Event-Carried State Transfer:

In stark contrast to the event notification model, the event-carried state transfer model puts the data as part of the event itself.

There are two key variants to implementing this. Fine-Grained and Snapshots.

Fine Grained has two options:

The address change event could send across fine grained events either as specific changes (customer 911000100 changed her ‘billing’ address’ house number from 84 to 48)

2. As a snapshot of the specific record (customer 911000100 previously had ‘billing’ address as ’84 Baker Street, London’ and has now changed it to ’48 Baker Street, London’)

Snapshot:

The second is by implementing snapshots of the main context that should be enabled in the inter-process communication. As you’d have noticed in the samples, the context of the message is specifying the customer id as opposed to the address id itself. This would be based on an evaluation on what consumers would benefit most from — an understanding of data against a specific customer vs. specific entity records ignoring the larger context it operates in.

First up a non-negotiable that must be built into all consumers, Idempotence:

A service is said to have the idempotence characteristic if the same input being thrown at it multiple times does not change its state unexpectedly. This is a concept that is applicable to computer science as in math. More details here. Practically, if your organisation mandates the usage of correlation-id’s for requests (incl. eventing), this can be achieved with simple introspection on correlation id’s already processed to do this. In the eventing world, there are very few platforms that have modes that target exactly-once delivery (most platforms offer at least once delivery). This should always be part of the non-negotiable list of requirements (others in this category include auditing and logging — more on that another time) when building your service.

Evaluation Criteria

Consuming systems’ ability to manage out of order delivery:

Unlike idempotence, out-of-order delivery is a different beast altogether. This requires a much more deep introspection capability within your service or within your processing pipeline. Message order guarantee is an expensive operation, almost all distributed messaging platforms either do not offer it or have specific variants that allow FIFO ordering that is more expensive (both in terms of $ and throughput. Additionally, setting up the platform will require careful thought into partitioning models). This is one factor that I assign significant weight to in evaluating the decision on notification vs. state transfer. If your consuming applications are unlikely to accurately make decisions to reorder messages/cherry-pick the messages to process vs. ignore, notification is by far the simplest approach to adopt. In the notification model, the relation between event time and processing time has been severed. Regardless of when a consumer chooses to consume an event, the latest state of the record is retrieved from the data owner. Assuming the consuming system is idempotent, this can now accommodate significant delays to processing (which can be leveraged for optimisation in batching) if the mode is a notification. Having said that, this assumes the event producer has a highly available read service to allow this callback. Practically, this is generally not an issue, though it is still something to keep an eye out for — since the read can happen at any time in the future, not just the event time.

Batching:

There might be performance optimisation you’d prefer in the communication to certain downstream applications. This could be due to a whole range of reasons including SaaS solutions that charge based on the number of API calls. In this scenario, if the delay between the change event in the owning application and the propagation to the consuming application is negotiable, it is a common practice to cluster related messages into single API calls. This is another scenario where the notification mode simplifies the overall system. In the state transfer mode, multiple messages need to be compared for its content before this can be achieved. The complexity increases significantly if the snapshot mode is not chosen by the provider.

Event producer simplicity:

The complexity that event producers must be subject to is a key criterion for choosing the design pattern. As seen in the examples, the more verbose the event becomes, the more complex it is to extract the data and the more expensive it is to send over the wire. This, in turn, causes more risk in sending events out of order. Additionally, as the data starts appearing in the message itself, the need to protect the channels and messaging platforms (including encryption/masking needs, etc.) starts coming into play. Access control also needs to be applied to the event platform with the same diligence as the read API’s.

Need for tracking all state changes:

If there is a requirement where a downstream consumer needs to know every state change that the data owner has gone through, then the previous decision points are effectively overridden. Whilst I’m not particularly fond of the idea of external systems auditing every change, there might be specific reasons within the organisation to warrant this. Ideally in a properly divvied up ecosystem, auditing responsibility of an entity should fall within the same service that sent the event out in the first place (and should be a single source of truth for that audit). Instead, consuming systems should be able to accept an eventually consistent replication of the source of truth — the data owner of the entity. This is probably a debate in itself to have. However, to the topic at hand, this requirement of a system that will audit all state changes externally has no option but to handle out-of-order delivery properly and will leave the event carried state transfer as the only option to choose.

Conclusion

The event notification pattern is simple to define and implement from an event producer’s and consumer’s perspective. From a consumer perspective, there is a need to make an additional API call (this activity itself is as simple as it gets) there is an error condition to manage — the endpoint to retrieve the current state of data being unresponsive. However, I’d consider this complexity easier to deal with than managing out-of-order delivery and introspecting multiple parts of the consuming application and comparing with the incoming event to make consumption decisions.

If the requirements however rule the notification mode out of the equation, keep the following strategies in mind to achieving this:

Evaluate the section in the ecosystem that takes responsibility for ordering: Individual services vs. Streaming Pipeline. A combination of a messaging platform like Kafka and Apache Beam is good at managing in-order exactly-once delivery
Ensure the security rules that need to be applied to data (particularly related to PII) are not ignored on the eventing model
Lambda architecture if streaming is a bit of a stretch and the temporary misalignment on data is acceptable. This assumes the event producer is capable of sending out batch extracts to its consumers.

PS:

A couple of things that this post did not dig into:

Event vs. Command:

An event is something that has already happened and it does NOT tell any consumer what should happen next.
A command on the other hand notes within its construct a specific state change expected in another system.
Implementations of the choreography pattern have a tendency to blur the lines on this distinction

Event Sourcing and CQRS:

This post did not get into another key topic, Event Sourcing (and its partner, CQRS) since that model has less to do with inter-process communication and more with how a given system chooses to hold its state. A topic for another day perhaps.