Schema evolution is not that complex

Francesco Nobilia
Data rocks
Published in
7 min readJul 3, 2020
https://stock.adobe.com/images/human-evolution/17468143

When I first started working with Apache Kafka and the Confluent Schema Registry, I did not focus enough on how to structure the Apache Avro schemata I used to represent my Kafka messages.
Kafka seemed fantastic, and I was very keen on starting playing with it. Just in a few hours, I quickly managed to publish and consume my first messages. In a few days, the POC was ready. A few weeks later, the code was production-ready. It was mid-2016, my Kafka solution was smoothly running in production, and I was quite happy with my small success.

At that time, there was no substantial material online about Apache Kafka and the problem of schema evolution (or maybe there was some excellent material that I did not know about back then: I have just realised one of the masterpieces of data applications Designing Data-Intensive Applications was published in January 2016 😵).

A few weeks after releasing my new service using Kafka and Avro schema, the day of the first schema update quickly arrived.

https://media.giphy.com/media/3XDXN8tBv5KkjRQpJz/giphy.gif

Upon registering the lastest version of my Avro Schema, the Kafka client returned the following error message

{
“error_code”: 409,
“message”: “Schema being registered is incompatible with an earlier schema”
}
https://media.giphy.com/media/xT9IgnBQ283DrNSCli/giphy.gif

I started looking into the issue, and that’s how I came across the notion of backward and forward compatibility in the context of schema evolution.

Schema evolution

Schema evolution is a fundamental aspect of data management and consequently, data governance. Applications tend to evolve, and together with them, their internal data definitions need to change.
In the context of schema, the action of changing schema representation and release its new version into the system is called evolution.

Often when dealing with Kafka, Avro schema is used to define the data stored in Kafka records. Like SQL schemata in a database, Avro schemata allow developers to define the shape of the data before any record gets written into the system.

As you may already know or you may have guessed from the title of this post, schemata are not immutable. In a schema management system, each schema may have multiple versions. What should be immutable is a specific schema version. For instance, Confluent Schema Registry is built so that when a schema for a specific subject is initially created, it receives a unique id and a version number. After successfully submitting a schema evolution for the same subject, the Schema Registry assigns to the new update a new identifier, and it bumps the version number. In this way, both the original and the updated version of the schema can still be accessed. Neither version should be overridden with a different representation to guarantee immutability. Both versions need to be always available to allow producers and consumers to adopt the latest schema version at a different speed.

This brilliant feature of supporting multiple concurrent versions for the same schema is what makes schema evolution tricky.

The impact of schema evolution tends to be overlooked until the first production issue. When evolving a schema, we should always consider that downstream consumers should be able to seamlessly handle data serialised with both the old and the new schema. This awareness ensures a decoupled evolution for producers and consumers and therefore, faster product iterations while preventing painful production incidents.

So the question is: How can I seamlessly evolve my schemata? 🤔
The answer to this question is: You need to apply the compatibility rule that better fits your use case! 🧐

Backward and forward compatibility 👩‍🎓

Quoting Wikipedia

Backward compatibility is a property of a system, product, or technology that allows for interoperability with an older legacy system, or with input designed for such a system, especially in telecommunications and computing.

Still quoting Wikipedia

Forward compatibility is a design characteristic that allows a system to accept input intended for a later version of itself. […] A standard supports forward compatibility if a product that complies with earlier versions can “gracefully” process input designed for later versions of the standard, ignoring new parts which it does not understand.

In other words, backward compatibility means that data written with an old schema version can be read with a new schema version, and forward compatibility means that data produced with a new schema can be read with the old schema. Note how backward compatibility involves consumers while forward compatibility is dealing with producers.

A schema that is both backward and forward compatible is defined fully compatible. Full compatibility verifies when old data can be read with the new schema, and new data can also be read with the old schema.

Backward and forward compatibility 👩‍💻

We examined the theory; it is now time to start with the practice.

A schema is backwards compatible as long as only the following operations are executed will evolving it
* delete fields
* add optional fields
* add mandatory fields with a meaningful default value
* convert a mandatory field into optional

A schema is forward compatible as long as only the following operations are executed will evolving it
* add fields (optional or mandatory, with or without a default value)
* remove optional fields
* remove mandatory fields associated with a meaningful default value
* convert an optional field into mandatory

It is worth mentioning that converting a mandatory field into an optional one, or the opposite, turning an optional field into a mandatory one is not always possible because some serialisation frameworks do not support such functionality.

Full compatibility

Full compatibility is the dream of every developer that tries to plan for the future while preventing eventual headaches. As every dream option, full compatibility does not come for free. The only way to have a fully compatible schema is to either have all fields optional or every mandatory field must have a default value associated with it.

In this context, a default value should be a piece of valuable business information configured by the developer according to the domain represented by the schema. Default values used by the serialisation framework for not initialised fields should be avoided because they could lead to misleading behaviour.
User-definable default values is a functionality that is not supported by every serialisation framework. For instance, between Avro, JSON, Protobuf and Thrift only Avro and Thrift support them.

Using only optional fields while defining a schema is a valid option (for instance, in Protobuf 3 every field is optional) but it comes with the cost of losing domain-specific cardinality checks. The schema will exclusively provide type validation.

Forward compatibility is not synonymous of extensibility

Forward compatibility is not interchangeable for extensibility. Extensibility indicates that something has been designed so that it is easy to evolve. Forward compatibility could be one of the reasons why a schema is extensible, but it does not mean that every forward-compatible schema leads toward and extensible solution.

What if I need to introduce a breaking change?

https://media.giphy.com/media/HP0JqpBxLLGpO/giphy.gif

Life is not perfect and sometimes breaking changes are not avoidable. If the schema management system admits such change, the best thing do to is to identify all downstream consumers or upstream producers. After you identified your stakeholders, the next step is to agree on a plan to handle the upgrade as seamlessly as possible. Sadly, there is no silver bullet to handle breaking changes. The best strategy to adopt depends on the use-case and the business context.

One way to mitigate breaking changes is by ensuring that the schema owner is the producer. By doing that, breaking changes will become a problem owned by producers only. A consumer will have no chances to introduce a breaking change. Having producers owning their schemata mitigate the breaking change issue, but it opens a new problem: how can a producer identify all downstream consumers? This question deserved a detailed answer, and we will cover it in a future post.

Conclusion 🚀

Schema evolution is not that complex. There is a simple ruleset that should be followed to evolve schemata. But even before looking into the evolutionary ruleset, the very first rule to apply is thinking twice while crafting the first iteration of your schema. Being strategic instead of tactical while defining schemata usually goes a long way.

https://media.giphy.com/media/d3mlE7uhX8KFgEmY/giphy.gif

Sources and useful reading

Acknowledgements

Thanks to Iveta Mikoczy for taking the time to review this post.

--

--