The Case for Database-First Pipelines

Quite a few data architectures involve both a database and Apache Kafka — usually Kafka is used for event-driven apps, streams processing and as a big buffer between systems. The database is used… well, as a database. For analytics, OLTP apps, legacy apps and what have you.

And there is always the question of the order of integration. There are basically 3 options:

Nice view from a hill next to my house. Long walks are best for thinking about architectures.
  • Write to the database first and from the database use change capture (CDC) system like Debezium to write to Kafka.
  • Write to Kafka first and use Connect to write from Kafka to the DB.
  • Write to both.

The third option tends to be terrible because of all the failure scenarios. Typically the database and Kafka diverge within days of launch, leaving everyone in a very painful position.

I usually tell people that if they want to have a successful first Kafka project, they need to get some data for their event-driven app or stream processing jobs, and since databases already have data, CDC is a great place to get started.

But I always said that eventually — when you gain more confidence and experience with Kafka, when you have enough credibility to get other people to rewrite some apps… Eventually you’ll want to write to Kafka first and from Kafka to the DB (if at all).

There is a good reason for that recommendation: Writing from Kafka to a DB is much easier than going from DB to Kafka. Kafka was written as a data pipe and the DB wasn’t.

But I recently ran into a use-case where the long term recommendation is to write to the database first. And I thought it is a good pattern and others may want to know too.

Here’s the (heavily modified) use-case: You are an online store. If you ever used Instacart, imagine that. Users can create an “order” and add items to an order. They can then schedule the order or cancel the order. If the order was canceled, you can’t add items to it. Even if you are logged in on another device that has the order open. Users need to get an error.

The traditional implementation involves a database with referential integrity constraints, also known as foreign keys. In fact, orders and items is the classic example for these constraints. The order table will be the parent, the items table will be the child, and if you try to insert into the child table when the parent doesn’t exist — you’ll get an error. Show the error to the user and you are done.

Note that the DB does quite a lot for you: it enforces serializability, locks, your logical constraints, etc. If the DB is distributed (Vitesse, Cockroach, Spanner, Yugabyte), it does even more.

If you were to go Kafka-first… well, it isn’t impossible. But all those responsibilities now belong to you as a developer. And if you are thinking there may be multiple webservers handling user requests and passing them to Kafka, you have to solve fairly challenging problems.

In the distributed webservers scenario, the best you can do is use stream processing or a database downstream to validate the consistency of the data (consistency in the original ACID meaning), and send the “actually we can’t deliver this item because you cancelled your order” message later. Not a great user experience. If you have a single webserver, you can cache the open orders or something and implement your own validation.

Putting the database first and using CDC to send events to Kafka, and using Kafka to notify other applications (shipping, loyalty points) about the event simplifies the design considerably.

So, next time you have logical constraints on your data, think whether putting the database first will make things simpler. And if you can come up with an even simpler solution — let me know, I always enjoy better architectures!