The Engineers Guide to Event-Driven Architectures: Benefits and Challenges

Published in

The Startup

9 min readNov 3, 2020

Photo by Pablo García Saldaña on Unsplash

Many big technology companies like Netflix, Uber and Spotify have moved from a monolithic architecture to a microservices architecture in order to successfully build and run complex systems at a massive scale.

Event-driven architectures are becoming increasingly popular in the microservice space due to the scalability potential as well as it’s adaptability. It’s a powerful architecture, but it does come with its challenges.

In order to help you determine whether it’s an interesting architecture for your next technology endeavor I’ve decided to write an in-depth series of blogs, called ‘the engineers guide to event-driven architectures’, that will cover topics like:

Different types of event-driven architectures and their use-cases
How to select the right tools for an event-driven architecture
How to design events
How to design event flows
How to ensure delivery of high-quality microservices
How to scale and performance tune

Let’s start at the beginning and break down how an event-driven architecture works and provide an overview of its benefits and challenges, laying the foundation for the subsequent blogs of the series.

How does an event-driven architecture work?

In an event-driven architecture, when a microservice performs an action that other microservices are potentially interested in, it publishes an event to the event broker. This microservices is referred to as a producer. Other microservices in the system, called consumers, can process the event.

There are two types of message models: messages queuing and publish subscribe. The difference is that with message queuing, events are processed by one consumer. With the publish/subscribe model, many different consumers can subscribe to particular types of events and consume them.

Once the consumer(s) successfully processed an event, they send an acknowledgement to the event broker. The event broker acts as the intermediary between the producers and consumers, as producers don’t know or care about consumers and vice versa.

A simple illustration of events using the publish/subscribe messaging model

The reality of distributed systems

Before we dive into the benefits and challenges to event-driven architectures, it’s important to realize that like any microservice implementation, an event-driven architecture is a distributed system. Which means the system’s components are located on different networked computers, which communicate and coordinate their actions over the network in order to achieve a common goal.

Because of the hype of microservices and the focus on the benefits many teams adopted this architectural style while the underlying theory of distributed computing was either not clear or underestimated and not sufficiently accounted for in system design.

The distributed computing theory I am referring to are the fallacies of distributed computing and tradeoffs between consistency, availability and partition tolerance as a result of CAP theorem.

The fallacies of distributed computing describe a set of false assumptions that engineers new to distributed applications invariably make about the network. Assuming that the network is reliable, latency is zero and bandwidth is infinite leads to systems that are designed without the appropriate robustness and resilience measures to deal with the harsh realities of communicating over a network with a plethora of performance issues and cascading failures as a result.

Another false assumption is that it’s possible to create a system with strong consistency and high availability while also being partition tolerant. Eric Brewer burst that bubble in the late 1990s when he formulated CAP theorem.

CAP theorem states that you cannot simultaneously have consistency (C), availability (A) and partition tolerance (P). Partition tolerance means that the cluster continues to function even if there’s a “partition”, a communication break between nodes. Meaning both nodes are up but unable to communicate with each other.

When dealing with scenarios of high scale, often partition tolerance and availability cannot be sacrificed. Strong consistency guarantees, meaning the guarantee that data viewed immediately after an update will be consistent for all requesters which is the case with ACID guarantees, is no longer feasible.

Instead, systems fall back to what is called eventual consistency. This is a theoretical guarantee that, provided no new updates to data are made, all reads of that data will eventually return the last updated value.

In monolithic architectures eventual consistency was limited to the storage layer. The implementation of challenges like replica convergence and conflict resolution didn’t bleed into the application layer.

This is different for most microservice architectures. The centralized database is forsaken to reduce coupling and a database-per-service model is adopted instead. Meaning that instead of being able to update a bunch of things together in a single transaction, multiple resources need to be updated.

Now, I believe that a properly designed microservice architecture should avoid transactions across microservices as much as possible due to the complexity and performance penalty. But event without distributed transactions, the choice between consistency and availability is still present.

Let’s say we’re designing an eCommerce system, where we have a customer and an order microservice. These services can be designed to favor either consistency or availability:

Favoring consistency, the customer data would strictly remain under control of the customer microservice. The order microservice reaches out to the customer microservice for details like the shipping address. However, if the customer service is not reachable, the order service cannot perform its task. This might be an acceptable risk, or it might not.
Favoring availability, the dependency on the customer microservice needs to be eliminated by replicating the relevant subset of customer data to the order service. By replicating the data, you end up being limited to eventual consistency and implement replica convergence and conflict resolution as a consequence.

The benefits of an event-driven architecture

Now that we thoroughly explored some of the key challenges when building distributed systems, let’s explore the specific benefits and challenges that an event-driven architecture has over competing architectural styles like building REST microservices. Here they are:

Loose coupling: The producers and consumers have no awareness of each other as the event broker acts as an intermediary. This results in very loosely coupled microservices which makes modifying, testing and deploying them easier.
Scaling: It’s much easier to scale the system by horizontally scaling producers and/or consumers without any system redesign due to the loose coupling, asynchronous nature of events and the broker topology. This also allows for speed mismatches between producers and consumers, enabling smaller incremental scaling improvements.
Fault tolerance: If a consumer error’s on consuming an event, or unexpectedly stops working, the event is not acknowledged to the event broker as being successfully processed. The event is not lost, but instead gets re-consumed by another instance of the consumer or consumed again when the consumer becomes available again.
Replica convergence: As discussed earlier, one of the complexities of eventual consistency on an application level is implementing replica convergence. Events are a great mechanism for doing so, especially in more complex scenarios with multiple replicas of data. The microservice that owns that data doesn’t have to know who has a replica, it just needs to emit an event when that data changes.
Extensibility: Since produced events can be consumed by multiple consumers, there’s a one-to-many relationship that allows events to be re-used for new features by simply adding an additional consumer. This makes event-driven architectures easier to extend without existing microservices.
Realtime: An event-driven architecture offers unique transparency to everything that is happening in the system in real-time. The ability to consume this information, extract real-time insights and facilitate automated decision making is a matter of adding consumers to existing events. Adding real-time capabilities to other architectures is possible, but requires more effort, additional tooling and changing of existing components.
Recovery: When using an event broker that offers persistent and durable event streams, it’s possible to recover lost work or rebuild the system’s state by “replaying” events from the past.

The challenges of an event-driven architecture

Now that we’ve discussed all the benefits that make an event-driven architecture great, let’s explore the tradeoffs. Some of these tradeoffs are also present in competing architectural styles like building REST microservices, but in my opinion they are amplified with an event-driven architecture. Here they are:

Reasoning about the system: The loose coupling of producers and consumers and the openness of the pub/sub messaging model make it harder to reason about the system and more difficult to quickly understand how things fit together. While this is true to some extend for microservice architectures in general when compared to a monolith, in REST microservices you can see the direct interactions between microservices from the code. In an event-driven architecture, looking at the code of a producer gives no irrefutable evidence of what happens once an event is produced. Engineers are at the mercy of clear naming, the documentation and watching events flow through the system real-time to understand the system.
Reasoning about data: Understanding an application’s data was easy with the centralized database. With the database-per-service model it becomes more difficult, especially if data is replicated to other microservices. Again, this is true to some extent for microservice architectures in general, but since events are great for replica convergence the overall level of eventual consistency tends to be higher. However, the problem of data conflict resolution is still hard. It’s not always obvious who the true source is and how data conflict resolution should be done.
Designing events: Designing events is hard. Since any consumer can subscribe to the event, it has to be re-usable instead of being tailored to the exact needs of one consumer. At the same time, it cannot be generic to the point where the intent becomes unclear.
Designing for failures: With the inherently unreliable network, events have to be designed to handle duplicate consumption and out-of-order processing. This is because the transport loss can occur on the consumer acknowledging successful consumption to the event broker. Assuming at-least-once delivery guarantees are used (by far the most common), the broker will send the event to the consumer again. If events are not designed to deal with these realities, it will cause widespread data consistency issues that have to be reconciled.
Designing and tracking event flows: While components in the system are loosely coupled, and fairly autonomous as a result, in the end the system overall needs to perform business transactions. There are several approaches for designing event flows, with some pitfalls to be aware of to prevent flows from becoming hard to maintain over time.
Changing events: Systems change over time, meaning that events change as well in order to facilitate the new behaviors of the system. Changing events without impacting other microservices can prove difficult, especially in a scaled setting when teams are not on top of keeping track of who consumes the events they produce. When events are persisted and replayed, the system also needs to be able to parse all historic versions of an event, which can cause massive complexity over time.
Synchronous flows: Due to the asynchronous nature of events facilitating synchronous behaviors where the result of the operations needs to be returned in the context of the same request-reply scope is not where this architecture shines. My recommendation is facilitating these operations with REST, which is a much better pattern for synchronous request-reply. Your event-driven architecture is allowed to have REST as well.

There’s one final challenge left to discuss: the event broker. It is the most important part of the event-driven system, it’s a single component that connects all others. If the broker fails; the whole system will grind to a halt. It must be maintained and operated with the utmost care, which requires a lot of specialist knowledge that is in most cases quite scarce. Using a hosted solution is therefore recommended if this knowledge is not available in the organization.

Join me for the rest of the series

We’ve scratched the surface of event-driven architectures, providing an overview of its benefits and challenges compared to competing microservice architectures. While there’s definitely complexity increases and pitfalls to avoid, the benefits this architecture brings are massive.

In subsequent blogs of the ‘engineers guide to event-driven architectures’ series I will explore this architecture in much more detail by sharing experiences, learnings and best practices to deal with the complexity and avoid the pitfalls.

The blogs will be linked here as soon as published with a short description on their content.