If you’re not using Schemas in Apache Kafka, you’re missing out big time!
Kafka does not look at your data.
Kafka takes bytes as an input and sends bytes as an output. That constraint-free protocol is what makes Kafka powerful.
Obviously, your data has meaning beyond bytes, so your consumers need to parse it and later on interpret it. When all goes well, you’re happy. When it doesn’t you hit the panic button.
There’s nothing worse than parsing exceptions.
They mainly occur in these two situations:
- The field you’re looking for doesn’t exist anymore
- The type of the field has changed (e.g. what used to be a
Stringis now an
What are our options to prevent and overcome these issues?
- Catch exception on parsing errors. Your code becomes ugly and very hard to maintain. 👎
- Never ever change the data producer and triple check your producer code will never forget to send a field. That’s what most companies do. But after a few key people quit, all your “safeguards” are gone. 👎👎
- Adopt a data format and enforce rules that allow you to perform schema evolution while guaranteeing not to break your downstream applications. 👏 (Sounds too good to be true ?)
That data format is Apache Avro. In this blog, I’ll discuss why you need Avro and why it’s very well complemented by the Confluent Schema Registry. And in my online course on Apache Avro, the Confluent Schema Registry and Kafka REST proxy, I go over these concepts in great depth alongside many hands-on examples.
The data formats you know have flaws.
Okay — all data formats have flaws, nothing is perfect. But some are better suited for data streaming than others. If we take a brief look at commonly used data formats (CSV, XML, Relational Databases, JSON), here’s what we can find.
CSV — The Almighty
Probably the worst data format for streaming, all-time favorite of everyone who doesn’t deal with data on a daily basis; CSV is something we all know and have to deal with one day or another.
- Easy to parse… with Excel
- Easy to read… with Excel
- Easy to make sense of… with Excel
- The data types of elements have to be inferred and are not a guarantee
- Parsing becomes tricky when data contains the delimiting character
- Column names (header) may or may not be present in your data
Verdict: CSV creates more problem than it’ll ever address. You may save in data storage space with it, but you lose in safety. Don’t ever use CSV for data streaming.
XML — The Dinosaur
XML is heavyweight, CPU intense to parse and completely outdated, so don’t use it for data streaming. Sure, it has schemas support, but unless you take pleasure in dealing with XSD files (please reach out), XML is not worth considering. Additionally you would have to send the XML schema with each payload, which is very wasteful of resources. Don’t use XML for data streaming!
The relational database format — not really a data format
CREATE TABLE distributors (
did integer PRIMARY KEY,
We’re getting somewhere though. Looks kind of nice, has schema support and data validation as a first-class citizen. You can still have runtime parsing errors in your SQL statements if someone in your company drops a column, but hopefully, that won’t happen very often.
- Data is fully typed
- Data fits in a table format
- Data has to be flat
- Data is stored in a database, and data definition, storage, and serialization will be different for each database technology.
- No schema evolution protection mechanism. Evolving a table can break applications
Verdict: Relational databases have a lot of concepts we desire for our streaming needs, but the showstopper is that there’s no “common data serialization format” across databases. You will have to convert the data to another format (like JSON) before inserting it into Kafka. The concept of “Schema” is great though, so we’ll keep that in mind.
JSON — Everyone’s favorite
The JSON data format has grown tremendously in popularity. It is omnipresent in every language, and almost every modern application uses it.
- Data can take any form (arrays, nested elements)
- JSON is a widely accepted format on the web
- JSON can be read by pretty much any language
- JSON can be easily shared over a network
- JSON has no native schema support (JSON schema is not a spec of JSON)
- JSON objects can be quite big in size because of repeated keys
- No comments, metadata, documentation
Verdict: JSON is a popular data choice in Kafka, but also the best illustration to “how, by giving indirectly too much flexibility and zero constraints to your producers, one can be changing data types and deleting fields”. If you ever had parsing issues in JSON (the ones I talked about in the beginning), you know what I mean.
As we have seen, all these data formats have advantages and some flaws, and their usage may be justified in many cases, but not necessarily well suited for data streaming. We’ll see how Avro can make this better. Nonetheless, a big reason why all these formats are popular though is because they’re human readable. As we’ll see, Avro isn’t because it’s binary.
Apache Avro — Schemas you can trust
Avro has grown in popularity in the Big Data community. It also has become the favorite Fast-Data serialization format thanks to a big push by Confluent (due to the Confluent Schema Registry).
How does Avro solve our problem?
Schema as a first-class citizen
Similarly to how in a SQL database you can’t add data without creating a table first, one can’t create an Avro object without first providing a schema.
There’s no way around it. A huge chunk of your work will be to define an Avro schema. I’ll try to make things short:
- Avro has support for primitive types (
bytes, etc…), complex types (
unions, optionals), logical types (
decimal), and data record (
namespace). All the types you’ll ever need.
- Avro has support for embedded documentation. Although documentation is optional, in my workflow I will reject any Avro Schema PR (pull request) that does not document every single field, even if obvious. By embedding documentation in the schema, you reduce data interpretation misunderstandings, you allow other teams to know about your data without searching a wiki, and you allow your devs to document your schema where they define it. It’s a win-win for everyone.
- Avro schemas are defined using JSON. Because every developer knows or can easily learn JSON, there’s a very low barrier to entry.
- An Avro object contains the schema and the data. The data without the schema is an invalid Avro object. That’s a big difference with say, CSV, or JSON.
- You can make your schemas evolve over time. Apache Avro has a concept of projection which makes evolving schema seamless to the end user.
To illustrate my point, here’s an Avro schema:
Looks good, reads well, doesn’t it?
Avro data serialization is efficient in space, can be read by any language, and therefore has a smaller footprint on the CPU. You can even apply a compression algorithm such as Snappy on top of it to reduce the size of your payloads further.
Yes there a few drawbacks.
- It takes longer in your development cycle to create your first Avro object. To me, that’s a good thing, but still, some developers will reject it because it’s not as straightforward as writing JSON. Regardless, the small initial up-front investment (longer in dev cycle) is, typically, far outweighed by the time savings later on when one does not have to troubleshoot data / data format problems. In practice, schemas are being used to read or write data way more often than schemas are defined/updated. So 1h of time spent agreeing with all stakeholders on the Avro schema is much less LOE than 1 week spent on figuring out why your data pipeline suddenly broke over night.
- Avro is a binary format. In that regards, you cannot just open an Avro file with a text editor and view its content like you would with JSON. I strongly believe viewing an Avro object should be supported by IDEs in the future, but we’re not there yet. To get around it, you can use the avro-tools jar.
- It will take some time to learn Apache Avro. There’s no free lunch.
Protobuf, Thrift, Parquet, ORC, etc…?
In the open source world, there are many different data serialization formats and you may cherish one over another. In that regards, I won’t say X is better than Y. But here’s one fact: the Confluent Schema Registry is only working with Avro for now. The day it supports Protobuf, I’ll consider using it as a data format. Until then, I’m being a pragmatic developer and go along with Avro.
Apache Avro is a great data format to use for your fast data pipeline. In conjunction with the Schema Registry, you will have a killer combo. If you want to learn some more about Avro, I dedicate 1h30 of content to learning it in my course.
The Confluent Schema Registry — your safeguard
Now that we have found a data serialisation format that already solves our data schema challenge, we still don’t know how to solve another big problem. Our Kafka cluster still takes 0s and 1s as inputs, and it doesn’t verify the data payload
That’s where the Confluent Schema Registry comes in!
Confluent Schema Registry architecture
The Confluent Schema Registry lives outside and separately from your Kafka Brokers. It is an additional component that can be set up with any Kafka cluster setup, would it be vanilla, Hortonworks, Confluent, or any other provider.
Here’s what the architecture with it looks like:
As you can see, your producers and consumers still talk to Kafka, but now they also talk to your Schema Registry. How does that solve our problems?
The Kafka Avro Serializer
The engineering beauty of this architecture is that now, your Producers use a new
Serializer, provided courtesy of Confluent, named the
KafkaAvroSerializer. Upon producing Avro data to Kafka, the following will happen (simplified version):
- Your producer will check if the schema is available is in the Schema Registry. If not available, it will register and cache it
- The Schema Registry will verify if the schema is either the same as before or a valid evolution. If not, it will return an exception and the
KafkaAvroSerializerwill crash your producer. Better safe than sorry
- If the schema is valid and all checks pass, the producer will only include a reference to the Schema (the Schema ID) in the message sent to Kafka, not the whole schema. The advantage of this is that now, your messages sent to Kafka are much smaller!
As a bonus, Confluent also provides a Serde for Kafka Streams!
And there we go, we have another component performing the Schema safeguard, at 0 performance cost. Actually, because the messages are now smaller, you will get a performance improvement in throughput, all while getting type and schema safety.
Isn’t this lovely?
The Schema Registry is a very simple concept and provides the missing schema component in Kafka. If you start using it, it will need extra care as it becomes a critical part of your infrastructure. Rest assured though, you can deploy a highly available setup of multiple schema registries just to make sure you can take one down without affecting your overall data pipelines!
Bottom line, you’re missing out if you’re not using the Schema Registry. Combined with Apache Avro, it really starts making your data pipelines safer, faster, better, and fully typed. What’s not to like?
Okay, I’m in — where do I start?
By then I hope to have triggered your curiosity and ensured you will consider using Avro and the Schema Registry in your real-time data pipelines. Here are some links to get started if you’re interested:
- Avro Documentation — good technical read but can be a lot to digest at once
- Confluent Schema Registry Documentation — great to get started. Has quite a few examples and I always refer to it when in doubt
- My Udemy Course about Avro & the Schema Registry — easiest way to get started. Yours truly explains all you need to know in 4 hours.
- I wrote another blog about doing an end to end pipeline that uses Avro and the Schema Registry.
Happy learning! ❤️
Liked it? Clap 👏, Share, Comment!