Event-Driven Architecture and the Outbox Pattern

Rod Shokrian
Engineering @Varo
Published in
5 min readOct 1, 2019

In an ideal world, a modern application would look something like this: a collection of stand-alone microservices, each isolated from the others with a unique set of messages and operations. They would have discrete codebases, an independent release schedule, and no overlapping dependencies. In reality, services need to communicate with one another, the coupling between them grows ever tighter, and this original sin of data exchange begets the race conditions, data inconsistencies, and a thousand other problems that cause so many hours of lost sleep.

  • Event-Driven Architecture and the Outbox Pattern

Database updates and interservice communication must be atomic to avoid data inconsistencies and bugs. However, distributed transactions that span multiple services can become very complex processes that are long and unreliable, especially when running on different machines or even in various data centers. “Eventually consistent” architecture patterns like Change Data Capture (CDC) are an effective alternative, but many implementations tie messages’ contracts to the producing service’s database schemas. A transactional outbox is a design pattern that solves this problem by ensuring reliable data exchange and decoupling database schemas and message contracts.

  • Synchronicity

Data exchange is a perennial issue in designing and building microservices. The naive approach, which most of us adopt in our first forays into the world of microservices, is synchronous communication. Services invoke services through synchronous API, whether that be REST, gRPC, or something else. However, the amount of coupling this approach engenders is inevitably problematic. Service failure is a matter of when — not if — and an architecture that relies on synchronous service-to-service communication will result in cascading failures. Buffering and retry logic can mitigate this issue somewhat, but an app built on synchronous communication will ultimately be impeded by issues of scalability and reliability.

  • Asynchronous Data Exchange

This problem can be handled by approaching data exchange asynchronously. By placing messages on an event queue like Kafka, consuming services have the autonomy to process requests at their own pace. At Varo, we decided early on to make an event-driven architecture and asynchronous data exchange core pillars of our app design. Once we made that decision, the next question was how to safely and consistently ensure the propagation of data across service boundaries.

I’ve mentioned Kafka as one piece of the puzzle, but alone it cannot ensure reliable data exchange. Consider an example: a microservice needs to write a record to its database and also propagate that data to other services. We would need one shared transaction across both actions to guarantee no possible inconsistencies, but Kafka does not support distributed transactions with database writes. We could potentially persist the record to the database, but the message is not published to a Kafka topic, say, through a networking issue. Alternatively, we could publish the event while failing to persist to the local database.

Both lead to unpleasant inconsistencies across services, so we need a way to mimic the guarantees of a shared transaction. The solution is to make one of these processes depend on the other. Let’s first consider writing to Kafka. By gating the database write-behind event publishing, we can ensure that the transaction will be rolled back if we encounter difficulty in publishing the event. This is an improvement in terms of data consistency but creates a dependency for our producing service on Kafka. While this guarantees eventual delivery, the service now needs to consume the Kafka event to access its data and cannot, for instance, synchronously query the data it has written. By relying on database transactions to drive event propagation through our event bus, however, we can ensure guaranteed delivery to consumers and instant “read your own write” capabilities to producing services.

  • The Outbox Pattern

Enter the aptly-named Outbox pattern. Like the literal trays on office desks that once held outgoing letters and documents, this pattern describes writing messages to an outbox that ensures reliable, at-least-once delivery to an event bus and other services. It involves a two-step transaction: initial commit of the message to, in the context of a relational database like Postgres, an outbox table, and a separate process to poll changes in the outbox and publish messages based on them.

  • Outbox Architecture

Here are the broad strokes of implementing the Outbox pattern. It’s important to note that the pattern is largely technology agnostic and implemented with a different database or event bus.

Let’s take a closer look at the outbox table:

The payload and ID fields are self-explanatory. What’s of interest here is Topic, AggregateType, and AggregateId. Kafka Topics are used to categorize events and send them to the appropriate brokers. AggregateIds are essential for ensuring idempotency, and AggregateType describes the specific type of event published to a topic and can help a consumer deserialize the event.

Note the decoupling of the outbox and account tables. Our event payloads do not necessarily need to reflect the account model, freeing us to design contracts for our event bus independent of the producing service.

I mentioned monitoring data changes in the outbox, so let’s not gloss over that. A critical component of effectively implementing this pattern is setting up appropriate tooling around CDC. At Varo, we use Debezium connectors for continuous, low latency CDC, but there are a variety of methods to accomplish this. An essential aspect of implementing the Outbox pattern is evaluating CDC for your needs, which is out of scope for this article.

Another reason why Debezium is a robust connector is that it emits events in a complex “envelope” that can include additional event metadata. Kafka Connect expects to receive this data flattened, so we need to apply some configuration to how the connector handles these events:

With an event bus like Kafka in the middle of the flow, we can enforce some degree of decoupling between services. Individual services can fail or be taken offline, only to process events at a later date. New services can be spun up and, with the current retention policies, build up state without timely backfill migrations by working through event history. It’s good practice to treat events published by an outbox as part of the service’s API. That is, their structure should be changed minimally and with consideration for compatibility. By the same token, consuming services should be flexible in handling messages. While this only guarantees eventually consistent data between services, it’s not as glaring an issue as it might seem. The latency for these kinds of processes typically works out to a few seconds at most and rarely has an impact on the user experience.

Event-driven architectures for microservices are popular with good reason. The Outbox pattern can be a powerful tool to create flexibility and resiliency in an event bus and improve interservice communication. Hopefully, this post has inspired you to try something new. Let us know your thoughts and what experience you’ve had with it!

--

--