From databases to topics — an opportunity to reinvent

David Navalho
Marionete
Published in
4 min readJan 22, 2021

It’s not uncommon to start a business — any business — with a database. We need to keep our data somewhere after all.

On the first day, data is kept in a couple databases. Business goes great, and organically grows. A new product is launched, a couple extra databases are deployed. A few extra columns added here and there. Maybe we want to start analysing data based on multiple tables over multiple databases…and a new database is generated just over there.

Some years go by, and the business organically grows — as does the data. But over time, no one actually really knows what each column represents. Is that the master table? Is it the other one? Which do I update? If I update the master table, do I get the data I need on the one I’m consuming? Who owns it? Who owns the other one?

This is indeed simple: I’ve seen drawings spanning multiple duct-tapped A3 pages - representing only one business unit!
The above is an over-simplification of what is a typical scenario. As additional services and databases are added, complexity easily grows exponentially

Kafka to the rescue?

Log-bases solutions for enabling streaming have been exploding for a while now. Taken as an opportunity to both provide new capabilities (streaming) and maybe — just maybe — get some control back on the data.

We start by defining what we want, maybe by identifying a Use Case that will provide some added business value. Armed with new concepts such as Service Meshes, we ensure services do not communicate directly, but instead through Kafka. And we can use new and powerful new tools, such as NoSQL Databases, KStreams, maybe dump data into a distributed file system for Spark analystis. The future looks bright indeed!

It’s not uncommon to look at Kafka (or your flavour of log-based solution) as the saviour to the organic chaos that was initially generated. The image below is often envisioned: how simple does that look?

New Services, decoupled from each other, with a simplified view over the data

So we start with the new project. On the first day, some data is generated — or brought over from a database — into a set of new topics. The tests go great, it can scale easily, and we now have a new streaming use case! A new use case is deployed. Maybe a couple databases are migrated into Kafka. A new topic is generated over there — wait, haven’t we been here before?

Yes, yes we have. It’s indeed true Kafka can help streamlining and controlling data, but without proper constraints, migration plans and new methodologies, we now are maintaining both old and new — especially during the migration phase. Additionally, if we are not careful, we will be left with the exact same questions we had before! Is that the master topic? Is it the other topic? Which topic do I read data from? Who owns it? Who owns the other one? If I add data to a topic, will the downstream topics get it as well?

So…just the streaming bit? Not quite!

So what should we do? What’s the solution here? Should we just tackle projects which need streaming? No, that’s not the solution. The solution is to take advantage of the tools — but use them wisely. Avoid mistakes of the past, get some control back.

Kafka can be used as the pandora’s box to unlocking the potential of your company — but also to ensure proper standards are followed. It’s not a far stretch to envision this. Kafka enables some really good Data Mesh patterns. I heartily recommend this talk by Gwen Shapiro around Kafka and Service Mesh: https://www.youtube.com/watch?v=Fi292CqOm8A.

By simply stating that any new applications cannot communicate directly, but instead go through Kafka, you are decoupling complexity, and allowing other services to actually take advantage of the data being generated for your services. We also now have a critical path everyone must go through-and by controlling the critical path, you control the message, the quality and the architecture.

This is of course not enough. We are still asking questions around how to control data — no really, what is the meaning of that “id” field? Because I know I’ve seen the same named field on 3 other tables, but they look wildly different! What’s my lineage? How does my data look like?

This is were notions such as Data Governance, Data Quality, Data Catalogues come in. Access control is key. Maybe we even decide on designing domains around the data. These are all interesting topics, which can really help going from “I can make a new streaming service with X data” to “my business unit can plug into any topic (they are authorised to), quickly explore the data while fully understanding it, and generate new business opportunities”.

Without full knowledge of your data at your fingertips, you will never create the exploration scenario you require to really unlock the full potential of your teams. I’ll be tackling some of the strategies (and failures) I’ve seen and applied in the field in the future.

--

--