EXPEDIA GROUP TECHNOLOGY — DATA

Safety Considerations When Using Enums in Avro Schemas

The use of enums is usually a good practice, but this is not always the case when using Apache Avro

Elliot West
Expedia Group Technology

--

Four aces in a hand of playing cards.
Photo by Oleg Magni from Pexels

Apache Avro is commonly used in both batch and real-time data systems to describe extensible and defendable data schemas. Avro enables the creation of data systems where data types can be mutated without impacting producers or consumers. However, it is not without its issues. One particular pain–point concerns the enum data type — which clearly offers great utility for sets of known values with low cardinality — but it can cause issues if not applied with due consideration.

Fortunately, later versions of Avro have introduced features to address these challenges. The difficulties you may experience with enums will vary considerably depending on the versions of Avro you are using, the rate of change of symbols in your enums, and the number of consumers that use the schemas containing those enums. Firstly, let me state that this is a solved issue if you are using Avro 1.9.0 or later everywhere.

This article describes the key challenges that enums present, and how you can deal with them in both earlier and later versions of Avro.

TL;DR

If you just require concise guidance on enum usage without any explanation — scroll to the bottom of this article.

The problem

Avro is much loved because of its schema compatibility features. However, it turns out that use of the enum type can erode the utility of this feature. As an analogy, let’s first consider the flexibility of Avro’s record field. As a user I can define a field, and as long as I provide a default value, I can be certain that:

  • Older records not containing the field will be readable by a schema that declares the field — the reader will see the default value (aka backwards compatibility)
  • Older readers consuming events containing a value for the new field, will simply ignore it (aka forwards compatibility)

Now an enum is a little bit like a record — it is extensible — we can add symbols to it much like we can add fields to a record. So what happens when we do that? Let’s assume we add a new symbol to an enum that is the type of a field in a record.

  • Older records not containing the new symbol will be readable by the schema that contains the new symbol — the set of all possible symbols in the old records is a subset of the set of symbols defined in the latest schema.
  • Older readers consuming events containing the new symbol will fail — they have not seen the symbol before and do not know how to handle it!

Extending enums breaks forwards compatibility. What this means in practice is that your consumers will not be able to consume any events containing the new symbol until they’ve had their schema definition updated.

Worst case scenario

Imagine that you are an organisation with thousands of schemas. You decide for reasons of consistency and to enable platform level functionality — to include a common header record in every single record that is produced in the org. You design a common header type that can be imported into the schemas of all of the data applications in your org. Perhaps you include a trace id, timestamp, and other fields. These are probably not useful to business consumer’s applications (that might care more about the domain specific fields), but are of great value to platform systems. Now, imagine that in that header you include a source field that takes one of three values. Sensibly you encode this as an enum.

This all works very well until you need to add a new source. Doing so introduces a new symbol to the enum and necessitates a release of the common header type, and all of the schemas that import that type (no small task). When we start producing events containing the new symbol we start to see problems. Applications that have not updated their schemas start rejecting records that contain the new symbol — even though the field containing the symbol is never used by the application. The applications do not recognise the symbol and cannot selectively deserialise individual fields within the record — they must unpack it in its entirety, including the common header, even if the application does not use it.

Clearly we need to update the applications to use the latest schema, then they’ll understand the new symbol and be able to deserialise the events. So what is the set of applications that we need to update? It includes any application that consumes records containing the common header, and our ambition was to include the header in every record. In short it is every consumer in the org — that’s a lot of applications!

How later versions of Avro fix the problem

Avro added the idea of a default symbol in 1.9.0. This is much like a default field value. It provides the reader with a fallback symbol to use if it encounters a symbol that it does not recognise. It is complimentary to the field defaults. With this primitive, we get everything we need to achieve forwards compatibility with extensible enums. Using Avro 1.9.0+ and adding default symbols fixes the problems associated with extending enums and forwards compatibility. Consider the example below:

{
"type": "record",
"name": "MyRecord",
"fields": [
{
"name": "my_field",
"type": {
"type": "enum",
"name": "MyEnum",
"symbols": [
"a",
"b",
"Unknown"
],
/*
* Symbol default - for forwards compatibility -
* new in Avro 1.9.0
*/
"default": "Unknown"
},
/*
* Field default - for handle backwards compatibility
*/
"default": "Unknown"
},

Here we specify two defaults:

  • A field default — informs a reader to use the default value if the field is not present in the record. Also informs older readers that they can skip this field if it is not present in their schema but is present in the event.
  • Since 1.9.0: A symbol default — informs the reader to use the default symbol if they have read a symbol that they do not recognise.

Note that a symbol default can be added retrospectively, and are also tolerated (but ignored) by earlier versions of Avro. Therefore in cases where we really have to use enums, we should adopt the practice of adding symbol defaults in our schemas so that they become useful as our systems converge on Avro 1.9.0+.

Guidance

If you are not able to use Avro 1.9.0+ everywhere in your pipelines, here are some guidelines on how and when to use enums in Avro schemas, and some alternatives.

When you should avoid enums

  • You are creating a widely used common type — use a string instead.
  • You are unaware of the complete set of consumers who will need to read your type — use a string instead.

When you can use enums

  • You have full control of all of your consumers, and can orchestrate the update of their schemas when needed.
  • You are confident that your type will only ever touch systems using Avro 1.9.0+ and understand the default symbol feature.
  • You aren’t sure which Avro versions are in use, but are confident that your enum has known fixed cardinality, and will never have new symbols added.
  • You aren’t sure which Avro versions are in use, but are confident that you will never require forwards compatibility (landing data to batch only for example).

If you found this article useful or use Apache Avro in your projects, lookout for two upcoming articles that describe how to handle those inevitable situations where you wish to evolve your schemas in an incompatible way, and commonly asked Avro questions.

--

--