Why we dropped event sourcing with Kafka Streams when given a second chance

Published in

DNA Technology

6 min readMar 23, 2022

Software architecture of IT systems keeps evolving every day. But what if you had a possibility to pause for a couple of weeks, reflect on the choices you made so far, and then refactor the system?

However rare it may seem, we were fortunate enough to get this kind of opportunity. We decided to use it (hopefully) wisely and as a result, agreed that event sourcing might not be the best solution in our case. So we switched from event sourcing based on Kafka Streams to an event-driven system with a database used as a data storage.

After a short introduction to our product, we’d like to list exactly what has been changed, then share our reasoning and gains from the new approach so that you can base on them when choosing an architecture for your next system. We’ll wrap up with some trade-offs and conclusions.

This article assumes you know the basics of both event sourcing and other event-driven solutions. If you don’t, better start with this talk. In case you need to get to know more about Kafka or Kafka Streams, take a look at my first article — Event Sourcing with Kafka Streams.

Background

Due to various circumstances, we were given a chance to reuse a system we had previously developed for a similar purpose but in a different organisational context.

Other context meant also new requirements and so some adjustments were needed. We thought that the time-box we were given for adaptations (before going live) is the only opportunity we’re going to get to make modifications potentially bigger than expected. At the same time, we wanted to save everyone from trouble in the future because of the inappropriate architecture. Moreover, we also had to have good arguments to explain the extra time needed.

The Product

Without revealing too much, I can share that we’re talking about a FinTech product constituting a part of a larger system. Its main purpose is to track users’ money flow, handle bank transfers and other financial operations.

Before the makeover, the software utilised event sourcing and used Kafka as the Source Of Truth. We heavily used Kafka Streams library to process, aggregate and store data. The stack was based on:

Spring Boot
Kafka Streams, Spring Kafka,
Spring Web, Data, Security
PostgreSQL (just for query-only projections)
Modular Monolith architecture
AWS
Protobuf

If you’re interested in details, here’s the article describing how we implemented event sourcing using Kafka Streams.

What we actually changed

When I mentioned that during the adjustments “we switched from event sourcing based on Kafka Streams to an event-driven system with a database used as a data storage” — what does it mean in terms of architecture and technologies?

Previously Kafka was our Source of Truth. All the messages representing changes happening in the system (events) were indefinitely stored in Kafka.
Now, good old database stores the current state of entities, and Kafka just helps us propagate information after a change occurred. We treat Kafka more like a message broker, and so the messages are ephemeral.
We introduced synchronous REST APIs, previously everything was asynchronous via Kafka
So far we resigned from Kafka Streams library and are using standard consumers and producers which are sufficient
We switched from Protobuf to Avro
Not a change at all, but we decided to keep implementing the Event-Carried State Transfer pattern (here’s a great talk)
For a single use-case, we’ll be able to simulate event sourcing — we store the crucial messages in the database and we can dump them on Kafka in case we need to replay all of them

Rationale

As you can imagine, it was non-trivial to implement these changes. However, we established it should be beneficial in the long run both for us as developers of the system and the business.

There are multiple reasons for each of the aforementioned changes. Some were enforced by new constraints, some were our decisions. Let me give you a walkthrough of how we came to these particular results.

Scale

The new expectation is to achieve a bigger scale faster than assumed at the beginning. This would mean less time to actually learn living with and operating an event sourced system.

Consequently, the number of events would eventually be much bigger which in turn would result in longer replaying times to rebuild the state from events. This could be solved with an increased number of partitions, but changing the number of partitions is not that easy.

Synchronous APIs

A new requirement is to have synchronous REST APIs for some functionalities. As the system was previously fully asynchronous, we’d have to hide the asynchronicity anyway on a REST API level. Again, doable but it’d introduce a lot of unnecessary complexity and potentially take much time to make the mechanism reliable.

Takeaway: event sourcing is not a good fit for all kinds of problems. You should be careful which domains you try to apply it to. The same goes for asynchronicity.

Backup

Backing up data stored in a database is much easier than backing up Kafka messages and offsets. Many battle-tested solutions are readily available.

Kafka Cluster

Previously, we used AWS MSK for hosting Kafka. Debugging was non-existent and support was relatively slow. Now we use Kafka as a Service as part of our platform and have a dedicated internal team supporting it. However, they weren’t too happy about storing our messages indefinitely (the default retention they recommend is under a month). After some experience with MSK, we were more inclined to use the platform-wide cluster. This also enables us to share data with other services (otherwise we’d potentially need to connect to multiple Kafka clusters).

Zero-downtime

We had some troubles achieving zero-downtime when using Kafka Streams (it might’ve improved in newer versions). REST + DB makes it very easy.

Onboarding

Event sourcing is difficult (especially when combined with Kafka Streams). Difficult to learn the concept well, difficult to explain it to others when it comes to implementation. For the sake of our new team members, but also other engineers a simpler approach seems wiser.

Tooling

We’d be the only team developing an event sourced system. That would mean even more on our plate in terms of the necessary tooling and techniques used, and we’re a small team.

Debugging

We learnt it’s harder to reason about event sourced systems. During a potential crisis, it may take longer to find out what happened and to recover a service.

GDPR

Anonymising data in a database is less painful than handling GDPR requirements in connection with everlasting Kafka messages.

Trade-offs

When it comes to Software Engineering, everything is a trade-off. Resigning from event sourcing introduces other challenges.

Consistency

By storing all the data only in Kafka, we were able to use Kafka Transactions to achieve exactly-once semantics and therefore consistency. Now, when we store the results of changes in the database and need to publish events informing about a change to Kafka, we can’t do it in a single transaction. We’re thinking of using an Outbox Pattern to circumvent it (read this piece from my colleague for a brief overview).

Audit trail

Because we stored all the events indefinitely, we had a complete history of what happened in the system at hand. As a consequence, there was no need for audit trail tables or other audit logs — no information was lost. Now we need to be extra careful not to lose track of important actions happening in the system, as some of them might only leave trace as ephemeral Kafka messages. The database stores only the current state of the system. Fortunately, there are proven ways around it.

New services, historical data

Introducing new services which needed historical data was previously straightforward. They could just consume events from the earliest offset and replay the history. With the new event-driven solution such cases will get harder to handle.

Developer excitement

It’s always fun to try out a new approach and develop your skills. But we as engineers need to be at the same time pragmatic and constantly reevaluate our choices. Although the more traditional solution focused around storing data in a database seems like a more boring one, on the other hand, it lets us spend more time on business features, and less on technical difficulties.

Conclusions

We’re not stating that event sourcing is always a bad choice. It’s a truly powerful concept. However, you should be wary of the complexities it may add to your project. Our take is you shouldn’t use it to solve all types of problems, but rather apply it to chosen business areas (it’s probably universal for all kinds of technologies and techniques). Kafka Streams might have steepened the learning curve even further for us, although we still believe it may be a great solution for data streaming with Kafka and Java.

Finally — never be afraid to challenge your choices regarding architecture! Some early changes and refactoring, even if they seem a bit costly, might save you lots of time and stress in the future.