Event Driven Systems-Lessons from the Trenches

Siddharth
Sid’s Tech Cafe
Published in
4 min readFeb 18, 2024
Generated Image from Dall-E 3

There is no dearth of readings on Event Driven Architectures (EDA) in the context of DDD and Microservices with some great blogs by Martin Fowler & Mathias Verraes, in addition to some brilliant books and videos, courses etc. (No affiliation to any of these).

The idea for this post arose from a discussion in an internal ThoughtWorks technical chat thread on EDAs. It really made me think hard of the mistakes I have made in designing such systems.

However, just like any other technical concept, there are two phases of learning: one is reading and other is building. And both are critical to the process of learning and doing.

Digression: I have realized that “learning” as an activity unto itself is sometimes an escape hatch that I use to avoid the activities that cause Resistance, aka tasks that matter and need Action, but are psychologically perceived painful. I end up wasting time watching technical videos and reading blogs that make me feel productive, but are not in fact.

I was fortunate enough to get an opportunity to implement EDAs with one of my projects. It was a large scale & high traffic system with critical business data and stringent SLAs. In this post, I’ll skip the “Why” of adopting an asynchronous design and will rather share the “What”, “How” and “What Not” from the experience.

This is a subjective post and I expect (and hope for) some disagreements from the readers.

Source: https://blog.ippon.tech/event-driven-architecture-getting-started-with-kafka-part-2
  1. Consider using Event Schemas to ensure their evolution is handled in a versioned way. On the other hand, avoid schemas if the variation of the types of commands/events is small.
  2. Consider Event Registries in advanced cases (e.g. when the types of events are in hundreds and the number of consumers is very high). The highly decoupled nature of event-driven systems can sometimes lead to reduced traceability of dependencies and event data flow graph. Registries act like book keepers of publisher/consumers contracts. But like point 1 above, it’s an overkill for less complex systems.
  3. Consider idempotency conditions, since Exactly Once semantics are not reliable. Kafka fans will tout its Exactly Once guarantees. However for critical processes involving financial or logistical systems, idempotency is a responsibility of the consumer systems. The CEO won’t be thrilled if double spending is commonplace, or if the same customer order is shipped twice.
  4. Consider side effects of eventual consistency across systems. In more complex cases, the systems need to behave transactionally and/or atomically. In such cases, the likes of Saga Pattern or Outbox Pattern could be useful. For a non-trivial & multi-step transactionality, a central orchestrator with an async workflow is a better fit (e.g. A complex Tax related workflow that combines automated, manual and batch operations). This opens the pandora’s box of the BPM tools. Use them at your own risk.
  5. Consider the eventuality of replaying events or commands— Sometimes systems reject events with incorrect acknowledgments and they may have to be re-sent. An event store that persists the event data and whether it succeeded or not could be useful. Most message brokers have this functionality inbuilt.
  6. Consider pull vs push model of event handling. (e.g. Kafka is pull based, while GCP Pub/Sub provides both). Pull based systems are good for more control & easy back pressure configuration (since the consumer dictates the rate of pulling messages), but it causes coupling between the broker and consumers, and a lot of onus falls upon the consumer.
    Push based brokers are more decoupled from consumers who only need to expose an API to the broker. But push based systems need rate limiting for effective back pressure. This adds some complexity, and any 429s may be treated as errors.
    In my experience, I have found Push Based model more reliable and robust.
  7. Consider retrying with Exponential Back-offs and configuring Dead Letter Topics/Queues (DLQs) as a default practice. Also have intelligent retry jobs from DLQ.
    This is a no brainer in any non-trivial EDA system, especially those with high throughput.
  8. Consider a business facing reporting functionality for the number of events received vs consumed vs failed. Most brokers provide this out of the box, along with alerting rules. In many cases a human intervention is imperative, and such reports are vital to make decisions. It saved our team’s time when we could present numbers to executives.
  9. Explore the AsyncAPI standard for event driven APIs (https://www.asyncapi.com). Although I haven’t been able to use it hands on, it looks promising with support for generating Specifications (like OpenAI), Code generation, Documenting Events, Governance etc.
  10. Evaluate the decision on the contents of your event payload with great deliberation. It’s a Type-1 decision and has caused me a great deal of pain. Balance between API chattiness and bulky events.
  11. And lastly, YAGNI. Consider lightweight queuing mechanisms instead of Kafka or heavy weight brokers. You likely don’t operate at LinkedIn scale. (E.g Postgres, Redis, RabbitMQ etc. can act as suitable queues for many use cases) — Postgres as a Queue, Redis Pub/Sub

These are some of the lessons I could recollect from my experience. I am sure there would be a lot more Whats and What-Nots. Hope this helps!

--

--