EXPEDIA GROUP TECHNOLOGY — DATA

Handling Incompatible Schema Changes with Avro

What should you do when you need to make a breaking change to your data model?

Elliot West
Expedia Group Technology

--

Apache Avro has the notion of schema compatibility that allows us to determine whether or not a schema is compatible with one or more earlier or newer schemas with respect to some compatibility constraint. That we can have compatible changes necessarily implies that we can also have incompatible changes. In such cases, what can we do to achieve these breaking changes while minimising disruption to consumers, be they stream or batch.

A breaking change means carefully orchestrated migration and associated disruption. Therefore I suggest that breaking changes should be avoided whenever possible, even if that means that the desired end state schema can only be achieved with compromise. It actually turns out that, depending on the compatibility mode it is possible to at least achieve a functionally equivalent schema, if not something that resembles the desired state, through a sequence of managed compatible changes.

This article demonstrates how an example incompatible change can be implemented as a sequence of compatible changes, with varying degrees of success.

A breaking change

Suppose we have a business requirement where we need to change a string field containing a composite full name, into a field that has a record encapsulating separate name elements. The transition from string to record is clearly a breaking change:

Current state

record Person {
string name; // example: "Joan Smith"
}

Desired end-state

record Person {
Name name;
}
record Name {
string first_name; // example: "Joan"
string last_name; // example: "Smith"
}

Compatible change sequence

This is clearly a breaking change. However, we now describe the steps required to arrive at a schema that is functionally equivalent to the desired end-state, if not aesthetically so. By avoiding a breaking change we can minimise interruptions to consumers that would otherwise be caused by migrations between incompatible versions of a dataset.

Step 1 — Add a default

You can remove fields that have a default, so we do add a default now so that we can later remove the field. Choose a default value that can hold no current meaning in consumer systems and can be later used to identify the field as deprecated. Continue to populate the field with data for your consumers.

Note: this step is not required for BACKWARDS or BACKWARDS_TRANSITIVE where fields may be removed without defaults.

record Person {
string name = "<DEPRECATED>"; // example: "Joan Smith"
}

Step 2 — Introduce the new field (possibly with a default)

We are introducing the field we want in our end state. We cannot use the desired field name yet however because it will be overloaded. Additionally, for compatibility modes other than FORWARDS or FORWARDS_TRANSITIVE we must provide a default value. The producer should populate both fields - the existing and the new with valid data. Now communicate to all consumers that they should start using the new field. When they are all doing this, you can move on to the next step.

Note: If using FORWARDS_TRANSITIVE or FULL_TRANSITIVE, this is the best outcome you can expect.

record Person {
string name = "<DEPRECATED>"; // example: "Joan Smith"
Name person_name = {"first_name":"<NOT_IN_USE>","last_name":"<NOT_IN_USE>"};
}
record Name {
string first_name; // example: "Joan"
string last_name; // example: "Smith"
}

Step 3 — Remove old field

Because our old field has a default, and no consumers are now using it — we can now remove it.

Note: If using BACKWARDS_TRANSITIVE, this is the best outcome you can expect.

record Person {
Name person_name = {"first_name":"<NOT_IN_USE>","last_name":"<NOT_IN_USE>"};
}
record Name {
string first_name; // example: "Joan"
string last_name; // example: "Smith"
}

Step 4 — Remove default

Now we can remove the default from the new field. Note that this does not apply to FORWARDS as the field can be declared in step 2 without a default.

record Person {
Name person_name;
}
record Name {
string first_name; // example: "Joan"
string last_name; // example: "Smith"
}

Step 5 — Rename field

And finally, we can effectively rename it by providing an alias with the desired name:

record Person {
@aliases(["name"])
Name person_name;
}
record Name {
string first_name; // example: "Joan"
string last_name; // example: "Smith"
}

Compatibility Modes

This section summarises the step sequences that can be applied in each compatibility mode, and what the best achievable outcome schema is in each case. Note that while the final schema may not be as succinct as the desired end-state schema, a great amount of disruption has been avoided that would otherwise have resulted from an incompatible change.

Key for symbols used in transitions table
Transitions by compatibility mode

Final results

These are the best achievable outcomes available for each compatibility mode.

Backwards

record Person {
@aliases(["name"])
Name person_name;
}
record Name {
string first_name; // example: "Joan"
string last_name; // example: "Smith"
}

Backwards Transitive

record Person {
string name; // example: "Joan Smith"
Name person_name = {"first_name":"<NOT_IN_USE>","last_name":"<NOT_IN_USE>"};
}
record Name {
string first_name; // example: "Joan"
string last_name; // example: "Smith"
}

Forwards

record Person {
@aliases(["name"])
Name person_name;
}
record Name {
string first_name; // example: "Joan"
string last_name; // example: "Smith"
}

Forwards Transitive

record Person {
string name = "<DEPRECATED>"; // example: "Joan Smith"
Name person_name = {"first_name":"<NOT_IN_USE>","last_name":"<NOT_IN_USE>"};
}
record Name {
string first_name; // example: "Joan"
string last_name; // example: "Smith"
}

Full

record Person {
@aliases(["name"])
Name person_name;
}
record Name {
string first_name; // example: "Joan"
string last_name; // example: "Smith"
}

Full Transitive

record Person {
string name = "<DEPRECATED>"; // example: "Joan Smith"
Name person_name = {"first_name":"<NOT_IN_USE>","last_name":"<NOT_IN_USE>"};
}
record Name {
string first_name; // example: "Joan"
string last_name; // example: "Smith"
}

If you found this article useful or use Apache Avro in your projects check out my post on the topic of Avro enums and other commonly asked Avro questions.

--

--