EXPEDIA GROUP TECHNOLOGY — DATA

Practical Schema Evolution with Avro

Beyond the specification — applying schema evolution to real-world situations

Elliot West
Expedia Group Technology

--

Photograph of a Biston betularia peppered moth.
Biston betularia — photo by Chiswick Chap — CC SA

I’ve been fortunate enough to be involved with the design and implementation of a number of large-scale data platforms that are used by 1000s of customers, be they analysts, engineers, or scientists. A common trait shared by these platforms is that they used Apache Avro to provide strong schema-on-write data contracts. Importantly, Avro also offers the ability for customers to safely and confidently evolve their data model definitions. After all — we should expect the shape of data to change over time.

However, in my experience, our customers often struggled to engage with Avro’s schema evolution strategies and rules. I empathise completely; in spite of my 5+ years of experience with Apache Avro, I still find myself walking through the specification to figure out exactly what behaviour I should expect. Asking schema compatibility questions ahead of time can help us avoid getting caught up in evolutionary dead-ends — a place where one is tempted by breaking changes. Fear not — escape is always possible.

With these complexities in mind, I’ve compiled a comprehensive set of Avro evolution Q&As based on our platform customer feedback. My intention is to create a practical, user-focused resource that serves as an accessible companion to the official Avro specification.

I’ve organised the questions into a number of categories:

  • Compatibility modes (BACKWARD, FORWARD, FULL, *_TRANSITIVE)
  • Record mutation (field, alias)
  • Field mutation (null, default)
  • Type promotion (int, long, float, double, string, bytes)
  • Type extension (union, enum)
  • Collection types (array, map)
  • Other

Please reach out for corrections and additions.

Compatibility modes

Compatibility modes describe sets of guarantees between schemas; declaring whether or not one schema can read a record written by another. They are not explicitly declared in the core Avro project but are instead defined in the popular open-source Schema Registry project created by Confluent. These incredibly useful schema concepts are now applicable to many schema systems — not just Avro. They are used throughout this document.

The specific semantics of each compatibility mode are already described in detail elsewhere. However, practical information regarding when to apply different compatibiliy modes is lacking, so I’ll try and remedy that here.

When should I use transitive compatibility modes?

Almost certainly always; transitivity is required when consumers must be compatible with all prior or future versions of a schema. In long-lived, large-scale data systems this is very likely to be the case.

There may be niche circumstances where data retention or validity periods are small enough that transitivity is not a concern as consumers have in context only one prior/future schema at any given time. I’ve not experienced those.

Which compatibility mode should I use?

In the context of long-lived and large-scale data systems you have three main choices:

  • When you need to access a long tail of historical data — Data-at-rest systems typically access records written with multiple versions of a schema. Therefore BACKWARD_TRANSITIVE is the preferred compatibility mode for systems that involve batch consumers, such as data lakes.
  • When you do not own/control your streaming consumers — Streaming systems typically benefit from FORWARD_TRANSITIVE compatibility when consumers are highly decoupled from producers; consumers continue to operate independently of changes made to the producer’s schema. If you do not own, control, or have the ability to influence the development life-cycle of your consumers then this is essential.
  • If you are subject to both of the above situations then FULL_TRANSITIVE is recommended.

When must I update my consumer’s schema?

That depends on the compatibility mode:

  • BACKWARD compatibility mode —In practice this should be done before, or at the same time as updating the producer’s schema. Technically, the schema of all consumers must at the very least be updated prior to reading any records produced with a new schema. A BACKWARD compatible schema can read only records written with earlier versions of the schema, not new and as yet unseen versions.
  • FORWARD or FULL compatibility mode — There is no requirement to update a given consumer’s schema; it will always be able to read records written with new and as yet unseen versions of the schema.

When must I update my producer’s schema?

That depends on the compatibility mode:

  • FORWARD compatibility — You must update the schema of all of your producers before updating those of any consumers as it has the potential to write records that cannot be read by earlier versions of the schema.
  • BACKWARD or FULL compatibility — There is no requirement to update a given producer’s schema; it will always write records that are compatible with later versions of the schema.

Record mutation

Actions that pertain to record types.

Can I add a new field?

Only if the new field declares a default or you are using a FORWARD compatibility mode.

Can I remove an existing field?

Only if the field has a default or you are using a BACKWARD compatibility modes. If relying on a default in a TRANSITIVE mode, the default must have been declared since the very first version of the schema so that all potential consumers are able to substitute a value on the reading of new records that no longer contain the field.

Can I rename an existing field?

Almost. You can always achieve something similar to a rename with an alias using any compatibility mode.

Field mutation

Actions that pertain to the fields of record types.

Can I add a default value to an existing field?

This is always supported.

Can I remove a default from an existing field?

This is permitted for all non-TRANSITIVE compatibility modes (whose use is atypical). It is also always permitted in the more typicalFORWARD_TRANSITIVE compatibility mode.

You cannot remove a default from a field in BACKWARD_TRANSITIVE or FULL_TRANSITIVE modes unless the field has existed without interuption since the very first version of the schema. If at some time, a version of the schema enabled a producer to write a record without the field, then the field must forevermore declare a default value to act as a substitute in this instance.

Can I make an existing field nullable?

Only with BACKWARD compatibility modes.

Type promotion

Actions that pertain primitive types.

Can I promote a numeric type?

If you are using aBACKWARD compatibility modes the then the following promotions are allowed, otherwise promotion is not permitted:

intlongfloatdouble

Can I demote a numeric type?

If you are using aFORWARD compatibility modes the then the following demotions are allowed, otherwise demotion is not permitted:

doublefloatlongint

Can I change string to bytes or vice-versa?

This is always supported as the two types are interchangeable.

Type extension

Actions that pertain to other extensible types.

Can I make a type a union or add another type to an existing union?

Only with BACKWARD compatibility modes.

Can I remove a type from an existing union?

No, this is never permitted.

Can I add another value to an existing enum?

This is possible with BACKWARD compatibility modes with careful consideration. Please familiarise yourself with the Safety considerations when using enums in Avro schemas.

Can I remove a value from an existing enum?

No, this is never permitted.

Collection types

Actions that pertain to collection types.

Can I evolve the elements of an array?

It all depends on the element type of the array; to know for sure determine whether the element type can be evolved in the manner that you require.

Can I evolve the values of a map?

It all depends on the value type of the map; to know for sure determine whether the value type can be evolved in the manner that you require.

Other questions

If you want to know if you can maintain compatibility across changes not described above, you probably can’t. But try interpreting the specification and applying it to your use case.

If you found this article useful or use Apache Avro in your projects, check out these posts on Avro enum usages and how to deal with breaking changes.

--

--