The 6 Things You Need to Know About Event-Driven Architectures

Oskar uit de Bos
The Startup
Published in
14 min readNov 23, 2020
Photo by Kevin Ku on Unsplash

In the first blog of the ‘engineers guide to event-driven architectures’ series we discussed how event-driven architectures can deal with complexity, how it provides agility and that there is massive scaling potential. It is my personal belief that every solution architect and software engineer looking to solve challenging problems at the heart of businesses, should be equipped with a proper understanding of event-driven architectures.

This blog series aims to help you build that understanding and make you aware of the new challenges that arise with event-driven architectures. Based on my personal experiences, absurd amounts of reading and watching a slew of tech talks on the subject I’ve collected what I believe to be the key concepts to know and clustered them in 6 topics. While each project has different architectural drivers and constraints, understanding these topics is a great start on the journey to design, implement and run event-driven architectures.

Without further ado, here are the 6 things you need to know about event-driven architectures:

  • The importance of events, commands and queries
  • Event processing is more complex than you think
  • Performance and scale don’t come for free
  • The different ways to build event flows
  • The power of quick diagnosis with observability
  • The event powered software design patterns

The importance of events, commands and queries

The first topic is the difference between events, commands and queries. In conversations about event-driven architecture, it’s not uncommon that every message in the system is referred to as an event. While that may be acceptable in a casual conversation at the coffee machine, it is not when designing and implementing a (complex) system where it’s vital to use the right concepts in order to capture intent, in both design and code.

All interactions taking place in an event-driven architecture are one of these three primitives: events, commands and queries.

You can think of an event as an observable fact, something that has happened. In the context of event-driven architectures, wikipedia describes an event as “a significant change in state”. The event is broadcasted, accompanied by relevant data, for other components in the system to act upon and thereby allowing side effects to occur. Events should not carry any expectation of future actions to be performed, or a response to be returned, as they merely represent a fact.

Commands on the other hand, carry an authoritative instruction that there is specific work to be done. A command is sent and a specific receiver is expected to react. Often, an event happens off the back of a command being successfully processed, as this likely led to a significant change in state.

While the technical implementation of events and commands may be similar, the usage of clear naming to separate these concepts ensures that other engineers that come in can understand the intent and can build on that work. This is called the principle of least surprise, sometimes referred to as the principle of least astonishment. Now I am bound by craftsmanship ethics to link this cartoon because it is both funny and the truth.

The final of the three primitives is the query, which is a request to retrieve internal state and return it to the caller. A query doesn’t cause any state changes or other side effects on the system. The most common implementation of a query is an HTTP GET request or a database query to retrieve data.

Event processing is more complex than you think

The second topic is the high complexity of event processing in a distributed system. The potential of duplicate processing and out of order processing of messages caused by network failures are the culprits of that complexity.

Distributed systems have to be designed with the understanding that the network is not reliable. In the first blog of the series we discussed how this harsh reality results in the unavoidable tradeoff between consistency and availability, causing many distributed systems to have eventual consistency instead of strong consistency.

Besides consistency tradeoffs, the unreliable network can also cause failures in message processing. Losing messages is often not acceptable in an event-driven architecture. Therefore, most event-driven systems rely on at-least-once delivery or exactly-once processing guarantees to achieve reliable message delivery. With these guarantees, any transmission between the broker and its producers and consumers requires an acknowledgment. If an acknowledgement is not received the message is resend. Not receiving an acknowledgment can have multiple reasons:

  • The message from the sender to its receiver is lost in transport
  • The receiver is unavailable or malfunctioning and is not able to process the message
  • The acknowledgement of from the receiver back to the sender is lost in transport

Adding the acknowledgement is vital for reliable message delivery over the network, but it does introduce the potential of duplicate processing and increase the chances of out or order processing.

Duplicate processing

When the acknowledgement of successfully processing a message is lost in transport, the same message is resend causing it to be processed a second time. This is not problematic if messages are idempotent. Idempotent messages will leave the system in the same state regardless of how many times that message is (re)processed. For messages that are not idempotent, processing the message a second time means the same state change is also applied a second time, putting the system into an unexpected and invalid state.

You might have noticed earlier that I said most event-driven systems rely on at-least-once delivery or exactly-once processing guarantees. I did this intentionally, since in a distributed system, having exactly-once delivery is impossible. However, using deduplication it’s possible to recognize previously processed messages and discard them, meaning that exactly-once processing is in fact possible and it will prevent duplicate processing. There are solutions that provide an exactly-once processing guarantee, often with some fineprint. This will be covered in an upcoming example.

Out of order processing

When messages are processed by competing consumers they are equally distributed amongst the available consumers. Consumers don’t process messages at the exact same speed due to (network) speed fluctuations or transport loss causing messages to be resend, potentially to a different available consumer. This means that with competing consumers messages might not be processed in the order they were produced.

If a message has no relation to other messages it doesn’t need to be processed in any particular order, as long as it gets processed in a timely fashion. In those cases the possibility of out of order processing has no impact.

For other use-cases where messages are related you do want to ensure that the messages are processed in the right order. For example, a ProductCreated event and all the subsequent ProductUpdated events are related to each other. These events should be processed in the correct order, otherwise the state for that product might become invalid. There’s two approaches to deal with ordered processing of related messages:

  • Using a single consumer, in combination with first-in-first-out (FIFO) delivery. This approach protects the global order of all messages, but obviously doesn’t scale well.
  • Use partitioned consumption with multiple consumers, in combination with FIFO delivery. Related messages are always routed to the same consumer based on a routing key attribute. In the previous example with the ProductCreated and ProductUpdated events the unique ID of the product would be the routing key. While partitioned consumption doesn’t protect the global order of all messages, it does protect order for related messages, which is what matters in most cases. And the big benefit is that it’s much more scalable. There are solutions out there that provide order processing guarantees, which can help with this. This will be covered in an upcoming example.

Solving processing challenges

The complexities of duplicate processing and out of order processing can be mitigated by having idempotent messages that are not related to other messages. However, it’s highly unlikely that all messages you have and will need at some point as requirements evolve can be designed that way. At some point, probably sooner than later, you will be facing these complexities when designing event-driven architectures.

As mentioned, there are solutions that can help you address these problems. For the majority of teams I would recommend having a good look at existing solutions rather than implementing one yourself. First of all because the best code is the code you don’t write, as development and maintenance costs are high. Secondly, expert skills on storage and database performance optimizations are required to create a truly scalable solution to this complex problem.

That being said, using an existing solution doesn’t mean it’s simply plug and play with guaranteed success. You have to understand the edge-cases, tradeoffs and configuration choices of the solution and be thorough in architecting because having end-to-end guarantees is not as straightforward as it may seem. Let’s illustrate why this is the case by using a common modern application architecture.

We launched a shiny new app that’s supported by an REST API backend. We expose the REST API backend using an API gateway. The REST API handles incoming requests by producing messages to an Amazon SQS FIFO queue for further processing. This type of queue is a variant of SQS that offers exactly once processing guarantees, FIFO delivery and it supports message groups for ordered processing of related messages. It ticks all the boxes for dealing with duplicate processing and out-of-order processing.

SQS implemented the exactly once processing guarantee through message deduplication, which is performed by SQS using a deduplication ID. That deduplication ID can be explicitly provided by the developer as part of the message or SQS can generate one using the SHA-256 hash from the message body.

So there’s implementation options. And that’s where the importance of being thorough in architecting comes into play. Let’s say we use the request ID that the API gateway generates for each API call as the deduplication ID and pass that to SQS to perform the deduplication of messages. All good right? Well, no. We haven’t quite thought everything through here…

Apps run on mobile devices which are highly susceptible to connectivity drops. With a bit of unfortunate timing the app sends a request that is successfully processed by our REST API, but the app never receives the response due to a connection drop. This will prompt the app (or the user) to try again. This second request gets a different request ID assigned by the API gateway. As we, in a rare moment of lesser brilliance, decided to use the request ID as the basis for SQS to deduplicate messages we now have a problem.

You might think this is more of a theoretical problem. Almost never happens. Wrong. Google “cell phone internet connection drops” if your sceptical. I’ll wait. So, we pick ourselves up and dust ourselves off. Having learned that the request ID is not the way to go we let SQS generate the deduplication ID based on the message body instead. We also make sure that the message body does not contain any dynamic fields that change with retried requests, like minor shifts in some timestamps. As the whole body is hashed and used as the deduplication ID any change in any property of the message body will result in a different hash. It’s these small implementation details that can make or break the effectiveness of processing guarantees.

Another thing to look out for besides configuration and implementation details is the solutions itself, and if it has limitations to the guarantees it offers. SQS for example has a 5-minute deduplication interval for the exactly once processing guarantee. Which is likely a tradeoff made due to performance and cost, and it should actually be fine for the fast majority of use-cases. However, it’s not configurable. If you need guarantees beyond 5 minutes you will have to implement additional measures yourself or select a different solution.

So, to recap. Leverage existing solutions when possible as processing guarantees are not easy to implement at scale. Review the limitations, tradeoffs and configuration options for solutions you are considering. Additionally, understand the points of failure that can cause processing issues in your own architecture.

Performance and scale don’t come for free

The third topic is dealing with performance and scale. Event-driven architectures introduce new concepts that are important to master in order to unlock the scaling potential of this type of architecture.

In event-driven architectures, when the load on the system increases the ingestion rate increases as well. The ingestion rate refers to the number of messages that the producers produce as a result of the incoming traffic. When the ingestion rate grows beyond what the consumers are capable of processing, the number of buffered messages that are waiting on the queue for consumption will start ramping up. While the queue and the asynchronous nature of the system do provide a buffer, queues that store messages in memory obviously cannot grow indefinitely. If the number of buffered messages keeps on increasing, at some point the queue will run out of memory and collapse. Not all queues are equally susceptible to this as they store messages durably on disk instead of in memory. Apache Kafka being an example.

However, regardless of message storage, the consumer lag will also increase. This is the number of messages that the consumer has left to process because it’s “all caught up”. A growing consumer lag negatively impacts a number of things in the system like eventual consistency windows and asynchronous request-reply flows that expect a response within a timeout window.

In order to prevent high ingestion rates to overwhelm the system it’s highly advisable to apply back pressure techniques. Backpressure is a feedback mechanism that allows systems to gracefully respond to increases in ingestion rate. It does so by slowing down the producers to match the processing speed of the consumers. In more extreme cases, the backpressure may bubble all the way up to the user, which is unfortunate, but still preferable over queues and consumer lag growing outside normal operating bandwidths.

Another relevant concept, that’s not specific to event-driven architecture but highly beneficial and worth mentioning, is elasticity, or elastic scaling. This practice allows resources to dynamically grow/shrink based on the load. Elasticity can be applied on (cloud) infrastructure, but also on the application level. For example, adding additional consumer instances when consumer lag starts ramping up is a common practice.

An event-driven architecture has great scaling potential, but it doesn’t come for free. Weak links in the processing chain still need to be found and addressed, and concepts like backpressure and elasticity help deal with spike loads.

The different ways to build event flows

The fourth topic is the different ways to design event flows. Building flows out of event-driven microservices can be challenging. Making the wrong decisions can put you in a position where it’s hard to understand the flow and to accommodate new requirements. When designing event flows there’s two main strategies, often referred to as peer-to-peer choreography and orchestration.

When an event flow is designed with peer-to-peer choreography the various microservices communicate directly with each other using events in order to get something done. While building flows this common, there’s some big challenges with peer-to-peer choreography:

  • It can be difficult to understand and evolve complex peer-to-peer choreography as it requires reasoning across multiple microservice codebases to understand how the flow works. The flow is not as explicit as looking at a piece of code. The fact that each microservice has some level of flow awareness also impacts their re-usability.
  • It can be difficult to implement concerns like retries, timeouts and process overviews that require knowledge of different steps in the flow.

The other approach to event flows is called orchestration, which introduces an orchestrator that acts much like a conductor that steers an orchestra. It centralizes flow knowledge in a single place, allowing other microservices to not care about processes and instead just perform a single task extremely well. Netflix built their own microservice workflow engine called Conductor because they found peer-to-peer choreography hard to evolve.

Look out for a future blog in the series where I will do both approaches justice with examples, visuals and a step-by-step explanation of how to evolve these flows as well as discuss the level of coupling.

The power of quick diagnosis with observability

The fifth topic is the need for observability. There was a time when you brought your car to the garage for diagnosis, a mechanic would come in and perform diagnoses relying on immediately visible and auditable information as well as a mental list of potential causes with high probability. Nowadays, they connect a diagnostic device that can instantly read dozens if not hundreds of sensors. The diagnostic device removed the educated guesswork from the process and made diagnosis much faster.

It’s very similar to what observability can do for operating and running any distributed system. Observability is a group of patterns and practices that allows observing, tracing, and monitoring of a distributed system in real-time, usually with one to two weeks of history retained.

In a complex distributed system, with many small components working together, having this observability is massively important as it allows an engineers:

  • To follow the flow of events triggered by a single request or workflow in real-time with different microservices involved. This is extremely helpful for engineers to understand systems with complex event flows.
  • To quickly diagnose the cause of an issue by pinpointing the exact point of failure in a processing chain of multiple microservices quickly.
  • To see performance and throughput in event processing flows and identify bottlenecks that need to be to be addressed.

Unfortunately, I often see teams underestimate the importance of observability, or even ignore it. I will go as far as to claim that not addressing observability means guaranteed failure for any distributed system that runs in production. There’s a reason that every mechanic has a diagnostic device now.

Look out for a future blog in the series where I share combinations of tooling you can use to create observability in event-driven architectures.

The event powered software design patterns

The sixth and final topic is becoming familiar with the commonly used event powered software design patterns. As mentioned before when talking about events, command and queries the principle of least surprise is important in software development as it helps an engineer coming in fresh to quickly understand intent. And perhaps the argument I should have opened with; why invest time trying to solve a problem that has been solved for you; and also in a manner that’s endorsed by the broader engineering community?

These are the patterns I most commonly see in event-driven architecture as well as ones I’ve applied myself in designing event-driven systems:

  • Event Notification
  • Event-Carried State Transfer
  • Queue-Based Load Leveling
  • Asynchronous Request-Reply
  • Claim Check
  • Event Sourcing
  • Command Query Responsibility Segregation (CQRS)
  • Stream processing

It’s not uncommon for an event-driven system to use a combination of these patterns, as they address specific challenges. Arguably not all these patterns act on the same level of abstraction, or have the same overall significance in the systems architecture, but I choose to combine them in one list for simplicity and completeness.

Look out for future blogs in the series where I can do these patterns justice with proper details discussing their strengths and considerations.

Join me for the rest of the series

I closed the first blog of the ‘engineers guide to event-driven architectures’ series stating that we’ve just scratched the surface of event-driven architectures.

In this blog we moved deeping, covering the 6 things I feel every engineer needs to know about event-driven architectures. We discussed duplicate processing, out of order processing, backpressure, consumer lag and many more important concepts and challenges.

There’s still a lot more cover! Especially on the design patterns, event flows and observability. In subsequent blogs of the series, I will explore the design patterns in detail first as they form such important building blocks for an event-driven architecture.

Hope you are along for the ride. If you feel I’ve missed something important or you want to learn more about a specific topic related to event-driven architectures, let me know with a comment! Until the next blog, I wish you an awesome day and happy coding!

--

--

Oskar uit de Bos
The Startup

Engineering Manager at Albert Heijn, empowering teams to build services and applications used to run over 1100 Albert Heijn stores in the Netherlands!