Evolving Schemas with Schema Registry

This article explores Schema Registry compatibility modes and how to evolve schemas according to them.

Thiago Cordon
Data Arena
12 min readMar 6, 2021

--

Photo by ian dooley on Unsplash

People who work with data know how painful can be when an unexpected change is made in your data source. When it occurs, you can take hours to adjust the downstream processes if they are not prepared to deal with such change and it can also impact the quality of the data you are delivering.

Fortunately, we have mechanisms to control these changes and mitigate impacts on downstream processes when a change is made. In my last article about this topic, I discussed an approach to merge different schemas in Spark, and in this article, I will show how to evolve schemas in a compatible way using Confluent Schema Registry.

To demonstrate the concept, I’ll create different schemas and check the compatibility of them according to the compatibility options available in Confluent Schema Registry.

Let’s consider that we have the following scenario:

Confluent Schema Registry — Image provided by the author.

Some important concepts to clarify:

  • Topic: contains messages and each message is a key-value pair. Both key and value can be serialized as AVRO, JSON, or Protobuf.
  • Schema: is the definition of the data structure. A topic can have schemas for keys, values, or both.
  • Subject: represents a scope in which schemas can evolve. The schema versions are maintained under subjects.

The Schema Registry can be configured to validate the schema when the producers are submitting messages and that validation occurs according to a compatibility mode that will be explained further in this article.

The schema validation can be enabled with the following topic property:

  • Enable schema validation for value: confluent.value.schema.validation
  • Enable schema validation for key: confluent.key.schema.validation

You can check in this code how to configure it when creating a topic via Python or you can do it via Confluent Control Center UI.

Schema validation config in Confluent Control Center UI — Image provided by the author.

Transitive x Non-transitive compatibility modes

The Confluent Schema Registry compatibility modes are divided into two groups that you can choose when you set compatibility:

  • Transitive: means that the new schema is checked against all previous schema versions.
  • Non-transitive: means that the new schema is checked against the last schema version only.

Schema compatibility tests

To test the schema compatibility, I will use these schemas. They were created as AVRO schemas and the code can be found here.

Schemas used to test — Image provided by the author.

Two notes about the schemas:

  • avro_schema 2: the default value means that it will be the value assigned to this field if it doesn’t have a value when the data is deserialized.
  • avro_schema 6: the alias defined in this schema is an alternate name for this field. The identifier is the alternate column name for the identifier_new column.

To check the compatibility between the schemas, I created a Python application to tests the schemas against a list of compatibility modes. The schemas are passed to the check_schema_compatibility function as a dictionary where the key is the schema name and the value is the AVRO schema (more details here).

For non-transitive compatibility modes, the application will compare the first schema in the dictionary schema against the other schemas in the dictionary. For transitive compatibility modes, the application will try to evolute the schemas in the sequence that they appear in the schema dictionary.

To run the application, follow these steps:

  • Start the Confluent Docker containers — it will start among other services the Kafka, schema registry, zookeeper, and a Confluent control center UI available at http://localhost:9021/ where you can manage the Kafka cluster.
  • Wait a few minutes until all the services are running. You can check your services using docker ps . The output should resemble the following image. If some service is not started, you can run the docker-compose up -d again.
  • Build the Python image which will be used to run the application.
  • Then, run the application using this image and connect it to the network used by the Confluent containers. Define the topic name in the topic_name parameter and a list of compatibility modes to be tested in the compatibility_type_list .
  • The output will show the compatibility check according to a compatibility mode.

Backward compatibility mode

The BACKWARD compatibility means that consumers using the new schema can read data produced with the last schema version but it does not assure compatibility with the versions before the last version.

In this compatibility mode, the consumer schema should be upgraded first.

Backward compatibility — Image provided by the author.

Running our tests with BACKWARD compatibility mode, we have the following output:

  • avro_schema1 x avro_schema2: Compatible because the avro_schema2 has a default value in the date column. Consumers with the schema avro_schema2 can read the avro_schema1.
  • avro_schema1 x avro_schema3: Not compatible because the avro_schema3 doesn’t have a default value in the date column.
  • avro_schema1 x avro_schema4: Not compatible because the column identifier changed from string to int.
  • avro_schema1 x avro_schema5: Not compatible because consumers with the schema avro_schema5 will not find the column identifier_new in avro_schema1.
  • avro_schema1 x avro_schema6: Compatible because the column identifier_new has an alias to the column identifier. Although the first_name is missing in the avro_schema6, it is backward compatible because consumers with the schema avro_schema6 can read the messages generated with the avro_schema1 schema — the consumer will just ignore that field.

Backward_transitive compatibility mode

The BACKWARD_TRANSITIVE compatibility means that consumers using the new schema can read data produced by all previous schema versions.

In this compatibility mode, the consumer schema should be upgraded first.

Backward_transitive compatibility — Image provided by the author.

Running our tests with BACKWARD_TRANSITIVE compatibility mode, we have the following output:

  • avro_schema2: Compatible with avro_schema1 because of the default in the date field.
  • avro_schema3: Not compatible with previous schemas avro_schema1 and avro_schema2 because the field date doesn’t exist in the avro_schema1. As it’s not compatible, it’ll not be registered in the Schema Registry.
  • avro_schema4: Not compatible with previous schemas avro_schema1 and avro_schema2 because the identifier column has a different data type — int instead of string.
  • avro_schema5: Not compatible with previous schemas because of the column identifier_new which is not present in the previously registered schemas avro_schema1 and avro_schema2.
  • avro_schema6: Compatible because the column identifier_new has an alias to the column identifier. Although the columns first_name and date are missing in the avro_schema6, it is backward_transitive compatible because consumers with schema avro_schema6 can read the messages generated with the schemas avro_schema1 and avro_schema2 — the consumer will just ignore those fields.

Forward compatibility mode

The FORWARD compatibility means that consumers using the last schema can read data produced with the new schema version but the new schema version does not assure compatibility with the versions before the last version.

In this compatibility mode, the producer schema should be upgraded first.

Forward compatibility — Image provided by the author.

Running our tests with FORWARD compatibility mode, we have the following output:

  • avro_schema1 x avro_schema2: Compatible because consumers with the avro_schema1 can read the avro_schema2 even with a new column— it will just ignore the new column.
  • avro_schema1 x avro_schema3: Compatible. Even though the avro_schema3 has a new column (date), the consumer with the avro_schema1 can read the avro_schema3 — it will just ignore the new column.
  • avro_schema1 x avro_schema4: Not compatible because the column identifier changed from string to int.
  • avro_schema1 x avro_schema5: Not compatible because the column identifier in avro_schema1 is not present in avro_schema5.
  • avro_schema1 x avro_schema6: Not compatible because the column first_name in avro_schema1 is not present in avro_schema6.

Forward_transitive compatibility mode

The FORWARD_TRANSITIVE compatibility means that consumers using the last schema can read data produced by the new schema and all previous schema versions.

In this compatibility mode, the producer schema should be upgraded first.

Forward_transitive compatibility — Image provided by the author.

Running our tests with FORWARD_TRANSITIVE compatibility mode, we have the following output:

  • avro_schema2: Compatible because the consumer with the avro_schema1 can read the avro_schema2 even with a new column — it will just ignore the new column.
  • avro_schema3: Compatible because the consumer with the avro_schema2 can read the avro_schema1 because the date column has a default value and can also read the avro_schema3 as it has the same columns as avro_schema2.
  • avro_schema4: Not compatible because the data type of the column identifier is different (int instead of string) and the date column is missing.
  • avro_schema5: Not compatible because the columns identifier and date are missing.
  • avro_schema6: Not compatible because the columns identifier, first_name, and date are missing.

Full compatibility mode

The FULL compatibility means that a consumer with the new schema can read data produced by the last schema and a consumer with the last schema can also read data produced by the new schema. The fully compatible schemas are both backward and forward compatible but it does not assure compatibility with the versions before the last version.

In this compatibility mode, the schema upgrade can be done in any order (consumer or producer).

Full compatibility — Image provided by the author.

Running our tests with FULL compatibility mode, we have the following output:

  • avro_schema1 x avro_schema2: Compatible because it is forward compatible (consumers with the avro_schema1 can read the avro_schema2 even with a new column — it will just ignore the new column) and backward compatible (consumers with the avro_schema2 can read the avro_schema1 because the avro_schema2 has a default value in the date column).
  • avro_schema1 x avro_schema3: Not compatible. Although it’s forward compatible (consumers with the avro_schema1 can read the avro_schema3 ignoring the date column), it’s not backward compatible (consumers with the avro_schema3 cannot read data from avro_schema1 because avro_schema1 doesn’t have the date column).
  • avro_schema1 x avro_schema4: Not compatible. It’s not forward compatible (consumers with the avro_schema1 cannot read the avro_schema4 because the identifier column has different datatypes) nor backward compatible (consumers with the avro_schema4 cannot read the avro_schema1 because the identifier column has different datatypes)
  • avro_schema1 x avro_schema5: Not compatible. It’s not forward compatible (consumers with the avro_schema1 cannot read the avro_schema5 because the column identifier doesn’t exist in this schema) nor backward compatible (consumers with the avro_schema5 cannot read the avro_schema1 because the column identifier_new doesn’t exist in this schema).
  • avro_schema1 x avro_schema6: Not compatible. It’s not forward compatible (consumers with the avro_schema1 cannot read the avro_schema6 because the column first_name doesn’t exist in this schema) but it’s backward compatible (consumers with the avro_schema6 can read the avro_schema1 because the column identifier_new has an alias to the column identifier and although the first_name is missing in the avro_schema6, it works because the consumer will just ignore that field).

Full_transitive compatibility mode

The FULL_TRANSITIVE compatibility has the same rules as the FULL compatibility except that the new schema needs to be compatible with all schema versions.

In this compatibility mode, the schema upgrade can be done in any order (consumer or producer).

Full_transitive compatibility — Image provided by the author.

Running our tests with FULL_TRANSITIVE compatibility mode, we have the following output:

  • avro_schema2: Compatible because it is forward_transitive (consumers with the avro_schema1 can read the avro_schema2 even with a new column — it will just ignore the new column) and backward_transitive compatible (consumers with the avro_schema2 can read the avro_schema1 because the avro_schema2 has a default value in date column).
  • avro_schema3: Not compatible. Although it’s forward_transitive compatible (consumers with the avro_schema2 can read the avro_schema1 because the date column has a default value and can also read the avro_schema3 as it has the same columns as avro_schema2), it’s not backward_transitive compatible (consumers with the avro_schema3 cannot read data from avro_schema1 because the field date doesn’t exist in this schema).
  • avro_schema4: Not compatible. It’s not forward_transitive compatible (consumers with avro_schema2 or avro_schema1 cannot read the avro_schema4 because the data type of the column identifier is different) nor backward_transitive compatible(consumers with avro_schema4 cannot read previous schemas avro_schema1 and avro_schema2 because the identifier column has a different data type — int instead of string).
  • avro_schema5: Not compatible. It’s not forward_transitive compatible (consumers with avro_schema2 or avro_schema1 cannot read the avro_schema5 because the column identifier is missing) nor backward_transitive compatible (consumers with avro_schema5 will not find the column identifier_new in avro_schema1 and avro_schema2).
  • avro_schema6: Not compatible. It’s not forward_transitive compatible (consumers with avro_schema2 or avro_schema1 cannot read the avro_schema6 because the columns identifier and first_name are missing), but it’s backward_transitive compatible (consumers with avro_schema6 can read the avro_schema2 and avro_schema1 because in avro_schema6 the column identifier has an alias to the column identifier. The columns first_name and date would be ignored by the consumer).

Final considerations

As shown in this article, the Schema Registry provides an automatic way of control changes in schemas, assuring compatibility and providing an organized way of evolving with less impact on downstream processes.

The compatibility modes explained here can be summarized in the following table.

Compatibility modes summary — Image provided by the author.

Please, feel free to reach out if you have other ideas on how to solve this problem or if you have an interesting use case to share. We are stronger together. 😃

--

--