Introduction to Schema Registry in Kafka

Published in

Slalom Technology

5 min readApr 15, 2020

Apache Kafka has been gaining so much popularity these past few years due to its highly scalable, robust, and fault-tolerant publish-subscribe architecture. It is one of the fastest-growing open-source projects, which was initially developed by LinkedIn in 2010 and is now being used by thousands of organizations. Although this post is about Schema Registry, if you are interested in learning about Kafka architecture, I’d highly recommend reading Kafka: The Definitive Guide. If you need additional motivation to get into Kafka, I’d recommend reading these case studies done by The New York Times and Netflix.

Data really powers everything that we do. — Jeff Weiner, CEO of LinkedIn

Why Schema Registry?

Kafka, at its core, only transfers data in byte format. There is no data verification that’s being done at the Kafka cluster level. In fact, Kafka doesn’t even know what kind of data it is sending or receiving; whether it is a string or integer.

Producer sending data in byte format to Kafka Cluster and being consumed by a consumer.

Due to the decoupled nature of Kafka, producers and consumers do not communicate with each other directly, but rather information transfer happens via Kafka topic. At the same time, the consumer still needs to know the type of data the producer is sending in order to deserialize it. Imagine if the producer starts sending bad data to Kafka or if the data type of your data gets changed. Your downstream consumers will start breaking. We need a way to have a common data type that must be agreed upon.

That’s where Schema Registry comes into the picture. It is an application that resides outside of your Kafka cluster and handles the distribution of schemas to the producer and consumer by storing a copy of schema in its local cache.

With the schema registry in place, the producer, before sending the data to Kafka, talks to the schema registry first and checks if the schema is available. If it doesn’t find the schema then it registers and caches it in the schema registry. Once the producer gets the schema, it will serialize the data with the schema and send it to Kafka in binary format prepended with a unique schema ID. When the consumer processes this message, it will communicate with the schema registry using the schema ID it got from the producer and deserialize it using the same schema. If there is a schema mismatch, the schema registry will throw an error letting the producer know that it’s breaking the schema agreement.

Data Serialization Formats

Now that we know how schema registry works, what kind of data serialization format are we using with the schema registry? There are a few important points that we should consider when choosing the right data serialization format:

If the serialization format is binary.
If we can use schemas to enforce strict data structures.

Following are some data serialization formats as per the above considerations:

Comparison of various data serialization formats

AVRO is the winner

Avro is an open-source binary data serialization format that comes from the Hadoop world and has many use cases. It offers rich data structures and offers code generation on statically typed programming languages like C# and Java.

Avro has support for primitive ( int, boolean, string , float etc.) and complex ( enums,arrays, maps,unions etc.) types.
Avro schemas are defined using JSON.
It is very fast.
Avro embedded documentation prevents us from the guessing game of what each field means.
We can have a default value for fields which is very useful when we evolve our schemas.

Let’s look at a sample Avro schema file:

Sample AVRO schema

You can see how easy it is to write an Avro schema. It is also a standard data serialization format for the Confluent Schema Registry.

Schema Evolution

With time, our AVRO schemas will evolve. We will add new fields or update existing fields. With evolving schemas, our downstream consumers should be able to consume messages seamlessly without sending a production alert at 3 AM. Schema Registry is specifically built for data evolution by versioning each schema change.

When a schema is first created, it gets a unique schema ID and a version number. With time, our schemas will evolve, we add new changes and if changes are compatible, we get a new schema ID and our version number increments. There are two ways to tell if a schema is compatible: by using a maven plugin (in Java) or by simply making a REST call. This compatibility check compares schema on the local machine with the schema on the schema registry.

There are various patterns for schema evolution:

Forward Compatibility: update producer to V2 version of the schema and gradually update consumers to the V2 version.

Backward Compatibility: update all consumers to the V2 version of the schema and then update producer to V2 version.

Full Compatibility: when schemas are both forward and backward compatible.

Conclusion

Schema Registry is a simple concept but it’s really powerful in enforcing data governance within your Kafka architecture. Schemas reside outside of your Kafka cluster, only the schema ID resides in your Kafka, hence making schema registry a critical component of your infrastructure. If the schema registry is not available, it will break producers and consumers. So it is always a best practice to ensure your schema registry is highly available.