Schema Registry in Kafka: Avro, JSON and Protobuf

The importance of having a structured data schema for messaging-based systems

Published in

C# Programming

7 min readMay 15, 2023

Bottle with a message inside floating in a river, symbolizing communication and connection through messages in water. — Photo by Andrew Measham on Unsplash

Kafka is a distributed and scalable data processing platform that has become one of the most popular tools in data ingestion and processing. Kafka can handle large volumes of data and allows users to process and analyze this data in real time. But, to ensure that the data is processed, it is essential to use a schema to confirm its structure and avoid runtime errors. In this article, we will discover why using a schema registry in Kafka is important and perform a trade-off analysis of the three common data formats: Avro, JSON, and Protobuf.

Why use a schematic register in Kafka?

When using data in Kafka, it is important to ensure that it is well-formed and structured. But, if data is stored in Kafka without prior validation, run-time errors may occur that can be costly and difficult to fix. A schema registry provides a way to ensure that data is validated before it is stored in Kafka.

A schema registry is a tool used to define and confirm the structure of data that is stored in Kafka. In a schema registry, developers can define what the data should look like and how it should be validated before it is stored in Kafka. This allows data to be validated before it is stored in Kafka, which reduces the possibility of runtime errors.

Schema records can also help ensure forward and backward compatibility when changes are made to the data structure. When a schema record is used, the data is stored with more schema information that can be used to ensure that applications reading the data can interpret it.

Comparison of data formats: Avro, JSON, and Protobuf

Now that we’ve discussed why it’s important to use a schema registry in Kafka, let’s compare three common data formats: Avro, JSON, and Protobuf. Each format has its advantages and disadvantages, and it is important to understand them to determine which data format is best suited for your use case.

Avro

Avro is a data format developed by Apache that is used in Kafka. Avro has several advantages, including:

It allows you to define an explicit schema for data, allowing for more rigorous validation and greater forward and backward compatibility.
It is a compact and efficient format for transmitting data, making it suitable for use cases where bandwidth is limited.
It has support for complex data types, making it suitable for use cases where complex data structures are needed.

Avro also has some disadvantages. For example, Avro can be more difficult to put in place and use than other formats, especially for developers who are not familiar with it.

JSON

JSON is a data format used in web and mobile applications. JSON has several advantages,

including:

It is easy to read and write, making it suitable for use cases where human readability is important.
It is a user data format supported by many tools and systems.
It is a lightweight and simple data format, making it suitable for use cases where simplicity is important.

But, JSON also has some disadvantages. For example, it is not a very efficient data format for transmitting large amounts of data, as it requires more bandwidth than other more compact data formats. Besides, JSON has no support for complex data types, which limits its usefulness in use cases where complex data structures are needed.

Protobuf

Protobuf is a data format developed by Google that is used in distributed systems. Protobuf has several advantages, including:

It is a compact and efficient data format for transmitting data, making it suitable for use cases where bandwidth is limited.
It has support for complex data types, making it suitable for use cases where complex data structures are needed.
It is easy to put in place and use in many different programming languages.

It should be noted that Protobuf also has some disadvantages. For example, it can be more difficult to read and write than other formats, especially for developers who are not familiar with it. Also, Protobuf does not have native support for schema validation, which means that more tool such as a schema registry is needed to confirm the data.

Tradeoff between three data formats

In summary, each data format has its advantages and disadvantages. Avro is suitable for use cases where complex data structures and greater forward and backward compatibility are needed, while JSON is suitable for use cases where simplicity and human readability are important. So, Protobuf is suitable for use cases where bandwidth is limited and complex data structures are needed.

Below is a table summarizing the advantages and disadvantages of each data format:

How can it be implemented with C# and .NET?

Let’s start with the simplest, in my opinion, which is JSON.

The code in this article will be simplified for length reasons, but the full code can be found at the following GitHub link.

JSON

For this, we need to add the following nuget package: “Confluent.SchemaRegistry.Serdes.Json”.

dotnet add package Confluent.SchemaRegistry.Serdes.Json --version 2.1.1

And we must add both a consumer and a producer of Kafka. We can do it in the following way for the producer:

var producer = new ProducerBuilder<string, T>(producerConfig)
                   .SetValueSerializer(new JsonSerializer<T>(schemaRegistry, 
                                                              jsonSerializerConfig))
                   .Build();

And the next code would correspond to the consumer:

var consumer =  new ConsumerBuilder<string, T>(_consumerConfig)
                .SetKeyDeserializer(Deserializers.Utf8)
                .SetValueDeserializer(new JsonDeserializer<T>()
                                          .AsSyncOverAsync())
                .SetErrorHandler((_, e) 
                        => Console.WriteLine($"Error: {e.Reason}"))
                .Build();

In both cases, we must not forget to register the URL of the SchemaRegistry

var schemaRegistry = new CachedSchemaRegistryClient(schemaRegistryConfig);

If we execute the code and launch some messages, we will be able to see in the Confluent Control Center, the schema generated in the topic that was configured.

Protobuf

In the case of using Protobuf, the issue is a little more complicated. Let’s start as before with the necessary nugets packages, which are “Confluent.SchemaRegistry.Serdes.Protobuf” and “Grpc.Tools”.

dotnet add package Confluent.SchemaRegistry.Serdes.Protobuf --version 2.1.1
dotnet add package Grpc.Tools --version 2.54.0

By adding the Grpc.Tool package will provide us with the necessary tools to work with .proto files and translate them to a C# file to be able to work with it.

syntax = "proto3";

message vehicle
{
    string Registration = 1;
    int32 Speed = 2;
    string Coordinates = 3;
}

For the library to help us to convert our .proto file to .cs file, we have to add the following line in the csproj.

<ItemGroup>   
    <Protobuf Include="proto\vehicle.proto" />
</ItemGroup>

When compiling the source generator it will generate the .cs file in the project obj as shown in the following image.

Next, we will see how the code changes both from the producer:

var producer = new ProducerBuilder<string, T>(producerConfig)
                   .SetValueSerializer(new ProtobufSerializer<T>(schemaRegistry))
                   .Build();

As well as the consumer:

var consumer = new ConsumerBuilder<string, T>(consumerConfig)
                     .SetValueDeserializer(new ProtobufDeserializer<T>()
                                               .AsSyncOverAsync())
                     .SetErrorHandler((_, e) 
                          => Console.WriteLine($"Error: {e.Reason}"))
                     .Build();

Of course, in the same way, that we have included the Schema Registry when we have worked with JSON, we have to do it here too.

If we go back into the Control Center, we can see how it is registered with a protobuf schema:

Avro

To work with Avro we will need to include the following nuget package, “Confluent.SchemaRegistry.Serdes.Avro”.

dotnet add package Confluent.SchemaRegistry.Serdes.Avro --version 2.1.1

We need to add an Avro schema again with a generator to work from C#.

{
    "namespace": "SchemaRegistryExamples.Avro",
    "name": "Vehicle",
    "type": "record",
    "fields": [
        {
            "name": "registration",
            "type": "string"
        },
        {
            "name": "speed",
            "type": "int"
        },
        {
            "name": "coordinates",
            "type": "string"
        }        
    ]
}

In our case we must install the avrogen tool, if you don’t have it already installed, and execute the following command:

#To install the tool
dotnet tool install --global Apache.Avro.Tools

#To convert the Avro schema to a c# class
avrogen -s Vehicle.avsc . --namespace "SchemaRegistryExamples.Avro:AvroConsole.Entity"

Perfect, we now have everything we need to display the producer and consumer code.

For the producer, it is very similar to the rest:

 var producer =   new ProducerBuilder<string, T>(producerConfig)
                    .SetValueSerializer(new AvroSerializer<T>(schemaRegistry))
                    .Build();

And for the consumer:

var consumer = new ConsumerBuilder<string, T>(consumerConfig)
                        .SetValueDeserializer(new AvroDeserializer<T>(schemaRegistry)
                                                  .AsSyncOverAsync())
                        .SetErrorHandler((_, e) 
                                => Console.WriteLine($"Error: {e.Reason}"))
                        .Build();

As in the rest, it is necessary to register the Schema Registry.

Finally, we enter again in the Control Center and we see how it is registered in our topic in Avro Schema.

Conclusion

In general, the choice of the appropriate data format will depend on the specific use case and the needs of the system in question. In general, Avro is recommended for systems that must forward and backward compatibility, JSON for use cases where simplicity and human readability are important, and Protobuf for systems that need to send large amounts of data.

It is important to note that regardless of the data format used, the use of a schema registry such Schema Registry is a best practice to ensure data validation and compatibility between the different systems interacting in the ecosystem.

The complete code can be found at the following GitHub link.