EXPEDIA GROUP TECHNOLOGY — DATA
Practical Schema Evolution with Avro
Beyond the specification — applying schema evolution to real-world situations
I’ve been fortunate enough to be involved with the design and implementation of a number of large-scale data platforms that are used by 1000s of customers, be they analysts, engineers, or scientists. A common trait shared by these platforms is that they used Apache Avro to provide strong schema-on-write data contracts. Importantly, Avro also offers the ability for customers to safely and confidently evolve their data model definitions. After all — we should expect the shape of data to change over time.
However, in my experience, our customers often struggled to engage with Avro’s schema evolution strategies and rules. I empathise completely; in spite of my 5+ years of experience with Apache Avro, I still find myself walking through the specification to figure out exactly what behaviour I should expect. Asking schema compatibility questions ahead of time can help us avoid getting caught up in evolutionary dead-ends — a place where one is tempted by breaking changes. Fear not — escape is always possible.
With these complexities in mind, I’ve compiled a comprehensive set of Avro evolution Q&As based on our platform customer feedback. My intention is to create a practical, user-focused resource that serves as an accessible companion to the official Avro specification.
I’ve organised the questions into a number of categories:
- Compatibility modes (
BACKWARD
,FORWARD
,FULL
,*_TRANSITIVE
) - Record mutation (
field
,alias
) - Field mutation (
null
,default
) - Type promotion (
int
,long
,float
,double
,string
,bytes
) - Type extension (
union
,enum
) - Collection types (
array
,map
) - Other
Please reach out for corrections and additions.
Compatibility modes
Compatibility modes describe sets of guarantees between schemas; declaring whether or not one schema can read a record written by another. They are not explicitly declared in the core Avro project but are instead defined in the popular open-source Schema Registry project created by Confluent. These incredibly useful schema concepts are now applicable to many schema systems — not just Avro. They are used throughout this document.
The specific semantics of each compatibility mode are already described in detail elsewhere. However, practical information regarding when to apply different compatibiliy modes is lacking, so I’ll try and remedy that here.
When should I use transitive compatibility modes?
Almost certainly always; transitivity is required when consumers must be compatible with all prior or future versions of a schema. In long-lived, large-scale data systems this is very likely to be the case.
There may be niche circumstances where data retention or validity periods are small enough that transitivity is not a concern as consumers have in context only one prior/future schema at any given time. I’ve not experienced those.
Which compatibility mode should I use?
In the context of long-lived and large-scale data systems you have three main choices:
- When you need to access a long tail of historical data — Data-at-rest systems typically access records written with multiple versions of a schema. Therefore
BACKWARD_TRANSITIVE
is the preferred compatibility mode for systems that involve batch consumers, such as data lakes. - When you do not own/control your streaming consumers — Streaming systems typically benefit from
FORWARD_TRANSITIVE
compatibility when consumers are highly decoupled from producers; consumers continue to operate independently of changes made to the producer’s schema. If you do not own, control, or have the ability to influence the development life-cycle of your consumers then this is essential. - If you are subject to both of the above situations then
FULL_TRANSITIVE
is recommended.
When must I update my consumer’s schema?
That depends on the compatibility mode:
BACKWARD
compatibility mode —In practice this should be done before, or at the same time as updating the producer’s schema. Technically, the schema of all consumers must at the very least be updated prior to reading any records produced with a new schema. ABACKWARD
compatible schema can read only records written with earlier versions of the schema, not new and as yet unseen versions.FORWARD
orFULL
compatibility mode — There is no requirement to update a given consumer’s schema; it will always be able to read records written with new and as yet unseen versions of the schema.
When must I update my producer’s schema?
That depends on the compatibility mode:
FORWARD
compatibility — You must update the schema of all of your producers before updating those of any consumers as it has the potential to write records that cannot be read by earlier versions of the schema.BACKWARD
orFULL
compatibility — There is no requirement to update a given producer’s schema; it will always write records that are compatible with later versions of the schema.
Record mutation
Actions that pertain to record
types.
Can I add a new field?
Only if the new field declares a default
or you are using a FORWARD
compatibility mode.
Can I remove an existing field?
Only if the field has a default
or you are using a BACKWARD
compatibility modes. If relying on a default
in a TRANSITIVE
mode, the default must have been declared since the very first version of the schema so that all potential consumers are able to substitute a value on the reading of new records that no longer contain the field.
Can I rename an existing field?
Almost. You can always achieve something similar to a rename with an alias
using any compatibility mode.
Field mutation
Actions that pertain to the fields of record
types.
Can I add a default value to an existing field?
This is always supported.
Can I remove a default from an existing field?
This is permitted for all non-TRANSITIVE
compatibility modes (whose use is atypical). It is also always permitted in the more typicalFORWARD_TRANSITIVE
compatibility mode.
You cannot remove a default
from a field in BACKWARD_TRANSITIVE
or FULL_TRANSITIVE
modes unless the field has existed without interuption since the very first version of the schema. If at some time, a version of the schema enabled a producer to write a record without the field, then the field must forevermore declare a default
value to act as a substitute in this instance.
Can I make an existing field nullable?
Only with BACKWARD
compatibility modes.
Type promotion
Actions that pertain primitive types.
Can I promote a numeric type?
If you are using aBACKWARD
compatibility modes the then the following promotions are allowed, otherwise promotion is not permitted:
int
→ long
→ float
→ double
Can I demote a numeric type?
If you are using aFORWARD
compatibility modes the then the following demotions are allowed, otherwise demotion is not permitted:
double
→ float
→ long
→ int
Can I change string to bytes or vice-versa?
This is always supported as the two types are interchangeable.
Type extension
Actions that pertain to other extensible types.
Can I make a type a union or add another type to an existing union?
Only with BACKWARD
compatibility modes.
Can I remove a type from an existing union?
No, this is never permitted.
Can I add another value to an existing enum?
This is possible with BACKWARD
compatibility modes with careful consideration. Please familiarise yourself with the Safety considerations when using enums in Avro schemas.
Can I remove a value from an existing enum?
No, this is never permitted.
Collection types
Actions that pertain to collection types.
Can I evolve the elements of an array?
It all depends on the element type of the array
; to know for sure determine whether the element type can be evolved in the manner that you require.
Can I evolve the values of a map?
It all depends on the value type of the map
; to know for sure determine whether the value type can be evolved in the manner that you require.
Other questions
If you want to know if you can maintain compatibility across changes not described above, you probably can’t. But try interpreting the specification and applying it to your use case.
If you found this article useful or use Apache Avro in your projects, check out these posts on Avro enum usages and how to deal with breaking changes.