Data Schema Management
What, Why and How.
Change is the only constant. Event data schema management is important, and yet, often ignored. This post discusses the components and change management of Event Data.
Why Data Schema Management?
In a CQRS (Command Query Responsibility Segregation) architecture patterned system, data producers and consumers are different processors mediated by a shared Event Store. Consumers should be rest assured that data events they consumed from the Store has a schema they expected. After the initial schema is defined, applications may need to evolve it over time. When this happens, it’s critical for the downstream consumers to be able to handle data encoded with both the old and the new schema seamlessly.
A classical example is illustrated by an Uber presentation in which one of the producers changed the “created_at” data type from String to Float, and the Consumer suffers.
Gwen Shapira summaries the importance of Schemas as follows:
What is Schema Evolution?
When a data format or schema changes, a corresponding change to consuming clients need to happen. However, there are times that the client code cannot change instantaneously, or choose not to change. This means that old and new versions of the code, and old and new data formats, may potentially all coexist in the system at the same time. In order for the system to continue running smoothly, we need to maintain compatibility in both directions:
A reader using a newer schema version should be capable of reading data written with an older version.
A reader using an older schema version should be able to read data from producers that are using a newer version.
Apply both Backward and Forward capabilities.
Introducing Apache Avro:
Apache Avro is a binary encoding format developed within the Apache Hadoop project. It has the following characteristics:
- It uses JSON for defining schema
- It serialize data in a compact binary format (See Appendix 3 for size saving with Avro)
- Avro binary encoding is more compact than other formats like MessagePack, Thrift and Protocol Buffers
- It is programming language agnostics
- Avro assumes that the schema is present when reading and writing files, usually by embedding the schema (or pointer to the schema) in the files themselves.
- Also unlike Thrift or Protocol Buffers, there are no tag numbers in the schema.
Here’s an example of an Avro schema:
How Avro Handles Schema Evolution?
It starts with the writer’s schema is contained in the Avro container file. When an application wants to decode some Avro data, the application expects the data to be in some schema (known as the Reader’s schema), which may be generated from a version of the schema during the application build time.
The key idea with Avro Schema Evolution is that the Reader’s schema don’t have to be the same as the Writer’s schema, they just need to be compatible! When data is read (i.e. decoded), Avro library resolves the differences by looking at the Writer’s schema and the Reader’s schema side by side and translating the data from the Writer’s schema into the Reader’s schema.
What are the Schema Compatibility Rules?
In the Apache Avro web site, it details out how Writer’s and Reader’s schema differences are resolved. Gwen Shapira offers a simpler compatibility tips in her Simplify Governance of Streaming Data presentation:
How to Enforce Compatibility Rules? Introducing Schema Registry
Now that we know if schema is evolved in a compatible manner, the application will be resilient to schema changes. This brings to the next question, how can we enforce schema evolutions that are compliance with the compatibility rules?
The Confluence team open sourced the Schema Registry. It provides a RESTful interface for storing and retrieving Avro schemas. It stores a versioned history of all schemas, provides multiple compatibility settings and allows evolution of schemas according to the configured compatibility setting (i.e. forward compatible, backward compatible, fully compatible and NONE). Additionally, it provides serializers that plug into Kafka clients that handles schema storage and retrieval for Kafka messages that are sent in the Avro format.
Together with an open source Web UI from Landoop, it makes managing schema changes intuitive.
The figure below depicts the data production flow with Confluent Schema Registry.
What about JSON Schema and Data Validation?
JSON Schema is a specification for defining the structure of JSON data. It is usually used for documentation and validation. Like Avro schema document, JSON Schema document is written in JSON format. However this is where their similarity ends. Specifically, here are the differences between the two:
- JSON Schema does not assist data serialization and deserialization like Avro does
- JSON Schema is not understood by Confluent Schema Registry for schema storage and evolution rules enforcement
- JSON Schema document can specify expected JSON data value (e.g. by Regular Expression) so that data validation tools can be built to validate JSON data values; On the other hand, Avro schema document is not intended for data validation (but for schema validation)
Avro Enhancement to Support Data Validation
Suggested by a stackoverflow discussion, we can enhance Avro with data validation via the “logicalTypes”. Here is an example:
With Regular Expression validation added into Avro schema documents, we build validation tools to validate data during data production, in addition to the regular Avro schema check. Reject submission if either validation fails:
In this post, we discuss the importance of data schema management. Apache Avro not only reduces data storage size and serialization/deserialization time, but offers a strong notion of forward and backward compatibility. Enhance Avro with Regular Expression data validation, together with open source products like Confluent Schema Registry and associated Landoop UI, we can build applications that are resilient to changes.