Schema and Topic Design in Event-Driven Systems (featuring Kafka!)

Published in

Flipp Engineering

15 min readMay 1, 2020

In the microservices world, communication between services provides a host of problems — one of them being “how do we ensure that the contract between services doesn’t break as those services change?”

Communication in event-driven land usually happens through some kind of messaging system. Here at Flipp, we’ve standardized on Apache Kafka, a streaming platform with a number of advantages which have seen it play a central role in our architecture. This system will have a number of topics (also known as queues, buses, etc. in other products), each of which have a number of messages (records, events, etc.)

The way to enforce a communication protocol with services using a messaging system is by using a schema, a strongly typed description of your data. You encode your data using the schema on the input side, and decode it on the output side. If any data doesn’t match the schema, you can’t even encode it in the first place. You can think of schemas in messaging as analogous to interfaces in object-oriented design.

There are a number of schema formats, including Google’s Protocol Buffers / Protobuf (primarily used for RPC calls), and JSON Schema, which can be used for Ajax requests. In the Kafka world, the “winner” for schema validation and encoding has generally been Apache Avro. This is the format used in a number of Confluent tools, including Kafka Connect and most Kafka topic viewers. (Hot off the press: Confluent just added support for Protobuf and JSON!)

I’d like to explore the different categories of topics and provide some advice and best practices when crafting your schemas, as well as how to design what goes into your topics that the messages with those schemas are in.

Note that although I’ll be using Avro in this discussion, almost everything I say can be applied to Protobuf, JSON Schema, Thrift, etc. — they’re all pretty similar. There are also a couple of other streaming products similar to Kafka, but we’ll be using Kafka in all our examples here.

Topic Best Practices

Every message in a topic should be encoded with the same schema. This enforces the fact that the topic should represent one type of thing. Mixing types of data in a single topic pollutes the mental model of what that topic means.
In most cases, you’ll want to have exactly one producer and one or more consumers. If more services need to send the same kind of data, that’s a clue that your service boundaries are not well-drawn, or that one service should be talking to the other one in some different way instead. This applies to event, entity and response topics (see below).
In some cases, you should have the opposite: one or more producers and exactly one consumer. This applies to request topics (see below), ensuring only one system processes each request. You can think of it as a funnel instead of a branch-out.
Standardize your topic names to make it a no-brainer when creating new ones. One possible format is {BoundedContext}.{Subdomain}. For example: Orders.OrderCreatedEvent , Customers.ContactInfo , Geography.ZipCode.

Topic Design

One great feature that Kafka has over many other streaming / messaging platforms is the concept of a message key. I go into more detail about it here, but associating a key with each message allows you to compact your topic so only the latest message with a given key exists and all previous messages with that key are automatically deleted. Generally the message key should be some kind of identifying ID (e.g. widget ID, order ID, etc.).

In the rest of this article, I’ll refer to message keys as just plain keys.

Without further ado, let’s dive into the different types of topics and some things you should know about designing and using them.

Event Topics (“this thing happened”)

The important thing to note about event topics is that things can happen multiple times. An order can be edited many times; a product can be added, then added again so that there are two products in the order.

Because of this, event topics are “insert only” — you can’t represent the current state of anything using them. If you want to figure out the current state, you would need to use an event sourcing strategy, which is really only ideal for specialized cases.

Events are great for things like beacons and user actions (which you can run analytics queries on), or for a choreography architecture — triggering actions based on things that happen.

Messages in events should:

Not have any keys associated with them — each event needs to be unique.
Have a retention policy or TTL — events should be automatically deleted after a certain period of time. You should be saving your events into long-term storage as they happen rather than keeping them in the topic.
Include all possible data that can act as a dimension when querying: User ID, order ID, customer ID, event type and subtype, etc. Once events have happened they can’t re-happen — and some dimension information may be different later, such as a user changing address — so try to include as much information about it as you can.

Event Types

During design, you may find yourself in a situation where you have a number of events which all look exactly the same and have the same information:

Order created: has a user ID, order ID, timestamp
Order began processing: Ditto
Order completed processing: Ditto

The question is, when do you combine all of these into one topic with an event type, and when do you separate them into multiple topics? You need to ask yourself a few questions to decide this:

Are we tracking the state of a single thing? Using a single topic can help determine the final state, and also allows you to have a consistent partition for all related events. Assuming you’re producing your events in the correct order, this means you can also enforce ordering on the consumer side, so that you never process a “completed processing” event before a “began processing” event for the same order.
Will we work with these events all together? If we want to combine a set of beacons into a single query (e.g. “get me all actions that happened to a order”), you’ll have an easier time with a single topic. If they’re always separate, it’s cleaner to have different topics.
What additional parameters does each event need? If “order processed” has a number of additional fields which would be meaningless for “order created”, that’s a clue that they need to be separated out.
You may actually want three topics — one for “order created”, one for “order processed”, and one much simpler one for all “order events”. You could send both at the same time, or have a downstream service collating both topics into one.

Request and Response Topics (“do this thing / this thing is done”)

Request and response topics are similar to event topics. The main difference is in the meaning of the topic.

Events (pub/sub style) contribute to a choreography architecture —each service publishes things that happen and interested downstream services subscribe to those events and react accordingly.

On the other hand, requests and responses are an orchestration strategy — when a service knows something must happen, it explicitly asks the downstream service to perform it. Often there is a single orchestrator service performing a number of requests to different downstream services.

There are pros and cons to both strategies. In general, when service A not only wants service B to do something, but is interested in the results, request/response is a better choice than pub/sub.

Like event topics, requests and responses should not have message keys, and should have a TTL / retention period. The differences are:

Events represent actions that happen in a system, and are prime candidates for analytics queries later on — hence my suggestion to include as much information as possible. Requests, on the other hand, only should include whatever information the downstream service needs to do its job.
Requests should have some kind of “request ID” (also known as a “correlation ID”) which the downstream service will retain and send back in the response. This ensures that responses can be associated with the original request, and also helps with tracing and debugging.
If a service can respond to requests from multiple sources, it’s a good idea to identify the source service with a string or ID of some kind, and pass that back as well. This way the originating service can easily filter out responses that aren’t meant for it. Alternatively, make separate topics for each requesting system (although that makes the requested service more tightly coupled with the requesters).

Entity Topics (“here is this thing”)

These topics represent the latest state of a particular entity, with all its current information. I go into more depth here as to how you can use these topics to form a “data backbone” for your architecture. The main point is that you use message keys to identify the entity you’re dealing with, and you ignore (or delete) all messages with the same key except the latest one.

Entity topics can be considered a “public view” into one or more tables in a local database. Exposing this data via Kafka allows other systems to read the data into their own data stores and perform operations or queries on it.

Entity topics can be considered “upserts” — they should be sent whenever:

An object is created
One of the fields in the topic schema changes on that object
The object is deleted (in Kafka, you’d send a null payload with the same message key).

Shameless plug — if you’re using Ruby, Deimos does a great job of managing entity topics!

Entity topics in the world of the data backbone should be created:

When a downstream system has need of the data
When people need to run queries on that data in a reporting data structure like a data lake (assuming data gets into that lake via your messaging service)

For entity topics, you should be adding as little as possible to your topics. Find out what your consumers need and don’t add anything else, even if it’s tempting (and encouraged by tools like Kafka Connect) to just dump your entire database table to your topic. Adding fields is easy; removing them is very hard, as it’s difficult to figure out who is using a field and why.

Most importantly: Do not mirror your database schema in your topics. Your internal schema is used internally. This is a public interface read by people who know nothing about how you work. Use nested records instead of one per table; ignore internal fields that aren’t immediately obvious; ensure your data is clean; and don’t produce more messages than downstream customers care about.

Sometimes you might need an “internal topic” for use within a set of microservices that you own, as well as an “external topic” for use outside your team. You can either send both of them from the same source, or consider a “sieve” or “gatekeeper” service that intercepts all internal topics, cleans up and sanitizes the data, and produces a set of topics that external teams can understand and use without asking questions.

When do I make a new topic?

One of the common questions when working with entity topics is how much data to include in your topic, particularly when looking at has-many relationships. Do you totally normalize it and have many smaller topics? Or have multiple tables in one big topic?

Here are some rubrics for making this decision — assuming you already have a topic A (representing entity A) and you’re looking at adding information representing entity B, which has some kind of relationship with entity A:

If B is interesting on its own, a new topic is probably a good idea. By “interesting”, I mean that there can conceivably be a service that cares only about B but not about A.
If the source of B data may change in the future (e.g. you are looking at breaking up a monolith and B would move to a microservice), you should make B its own topic. If A and B are part of the same bounded context and will always live together, it’s likely not necessary.
For performance purposes — if adding B to A would cause the messages to be too big or make it send messages unnecessarily. Remember that you’d need to send a new message when either B or A changes if they share a topic.

Examples — Translations

In this case, we are using a library like globalize to provide a has-many relationship with domain data — so a retailer might have an English, French, and Spanish name, for example. In our database, we represent this with two tables: retailers and retailer_translations .

Using the rubric above:

Translations are not interesting on their own — no one would only want the translations without the retailer.
The source is not likely to change.
It won’t make a big change to the size of the message or update frequency.

In this case, adding the translations to the existing schema (via a nested array) is the right choice.

The same thinking can often be applied to thumbnails, images or attachments associated with an object.

Examples — Associations

Think about something a bit more meaty, like a customer that has multiple orders. In this case:

Some services might care about orders without knowing all the customer details.
It’s very likely that the thing that manages orders and the thing that manages customers might break apart in the future.
The update frequency might change a lot, and if we’re keeping a full order history, that would balloon the message size.

In this case, it’s a no-brainer to separate them.

Other examples aren’t quite so straightforward, so try and figure out what belongs where using my suggestions above.

Note that sometimes the right solution is option 3 — have both topics. With Kafka, you can write joiners with Kafka Streams or KSQL to join two topics (although there are challenges with that approach). You can also have your current database send out both the “enhanced” and “separate” topics, since it’s often easier to join in a relational database than outside it. The main downside to this is that you now have two “sources of truth”, so you need to be extra careful about how you present these topics to your downstream consumers.

Schema Design

A Quick Avro Introduction

Like most schema formats, Avro has both primitive types:

Numeric types: int, long, float and double
Strings and bytes
Booleans
Null

…and complex types:

Records (a type with predefined, named fields)
Arrays (lists of values)
Enums (a scalar value from a list of possible values)
Maps (key-value stores — a.k.a. dictionaries or hashes)
Unions (one of multiple types — this is used to allow nulls)

Avro schemas can be nested by having types inside types (e.g. an array of records where each record has a field called “images” which is an array of strings).

Schema Evolution

One of the nice things about Avro is the fact that two schemas can be compatible even if they aren’t identical. This means that you can encode a message in version 1 of a schema and decode it with version 2, as long as you haven’t made any breaking changes in your schema in the meantime. Similarly, you can write using a newer version of the schema and read with an older version.

A great tool when using Avro with Kafka is the Confluent Schema Registry. This is a central place to hold your schemas, and allows you to encode and decode against the registry instead of a local schema. This not only shrinks the size of your messages, but also enforces schema evolution — you can’t make any breaking changes to a schema if that change is rejected by the registry.

Breaking Changes

In Avro world, a breaking change means:

Adding a field without a default value
Removing a field without a default value
Changing a field’s type or name
Changing an enum’s values if it has no default enum symbol

You can see a theme here — default values are incredibly important as a barrier against incompatibility. In general, all non-essential fields should have default values. Here, non-essential means a field which does not define the basis of the message.

Here’s an example:

In this case, widget_id is a basic field — you’d never describe a widget if it didn’t have one. On the other hand, color is a nice-to-have. Even if every single current consumer wants to know a widget’s color, widgets could stop being categorized by color, or others could lose interest in the color. Therefore, we give color a default value. In practice, very few fields should ever not have a default value.

Note that the default value is not an empty string, but a sentinel value we don’t expect to ever see. This is because the default value represents the value we see when our reader schema doesn’t even have this field — for example, if we decide to remove it. You’re going to want to use a sentinel value when an empty value has meaning. In this case, an empty color might mean that no color was selected, whereas the sentinel value indicates that we don’t know anything about the color. Downstream systems might handle these cases in different ways, so it’s important to differentiate.

On the other hand, if there’s no appreciable difference between the field being missing and the value being empty (for example, some freeform text like a description), feel free to use an empty string or null for a default.

Schema Tips

Here’s a quick list of suggestions for your schemas. Again, this is Avro-centric, but most if not all can be applied to other schema formats.

Your schema is your service’s public interface. It’s how other services access your data, events, requests and responses. Design them for your consumers, not for you.
You can use your database schema as a starting point, but messages are not database tables. Just because your database has a field in a particular type does not mean your public schema should define it that way. Consider each field and type realistically — in what format do consumers want it? You might be using ints in your DB to represent a state, timestamps, etc., but downstream consumers will want string s or enum s to make it more understandable. Do your consumers really care about the created_at date of a record in your table?
Use longs instead of ints when defining identifying IDs (widget ID, order ID, etc.), wherever you can. You never know when your scale of data is going to explode, and changing it can be expensive. Another option is to use UUIDs instead of integer IDs, but that’s not always possible.
Add a timestamp — in human readable format — to all your schemas. When printing out your messages to logs, it becomes much easier to trace and debug. Computers can read human format, but humans can’t read UNIX timestamps.
Add a message_id to all your schemas, and set it to a UUID. This allows you to trace a message easily — e.g. figure out if service A sent it and service B didn’t receive it. Note that this is different from your identifying IDs — this is for debugging purposes only.
Most schema formats have a “doc” field for types allowing you to describe them. Use it. Document the meaning and use of your schema, and every single one of your fields. Don’t assume that because you know what it means that others will (or that you will in six months’ time). Document the possible values for fields, as well as what might cause those values to change.
If your field is a boolean type, don’t allow nulls. Just… don’t. They don’t make sense. If you need three values, make it a three-value enum — where two of the values are not “true” and “false”.
Allow nulls in your field if and only if null is a valid, meaningful value. If an empty string and null are always treated identically for a field, use a string type. Don’t allow null for any “required” field (i.e. if the field is in the schema, it must always have a value). Don’t add it in just because you forgot to add a “NOT NULL” to the corresponding database column.
Use enums for a limited, discrete set of values. Don’t use it for ranges of numbers. Don’t use strings or ints for this purpose. Remember to add a “default enum symbol” of a sentinel, similar to what you’d use for fields. (Note that earlier versions of Avro had some challenges with enums — the introduction of default symbols has made them flexible enough to use safely.)
Avoid unions for any purpose other than allowing nulls. It muddies the waters as to what the meaning of a field actually is. If you find you need a field to be one of multiple completely different types, a rethink is likely in order.
If you are using Avro and Kafka, schema-encode your keys as well as your payloads. This makes it much easier for strongly-typed languages like Java to manage your messages.

Schemas are powerful and amazingly helpful to keep your data sane. Use them wisely!