Enums in distributed systems
Enums. Every software engineer has seen them. Maybe it was in your introductory programming class. Maybe it was at your first software engineering job. Or perhaps you saw them when reading an API spec or working on an open-source project. Properly-used, enums are a powerful tool for creating code that is readable while remaining resource-efficient, and they are available in almost every statically-typed language. They also eliminate an important class of programming errors caused by divergent definitions of types that draw their values from an enumerated domain of values.
In a monolithic application with an atomic deployment profile, there are few downsides to modeling a defined set of values of a particular type as an enum. But in a distributed system — the kind of system Conductor designs, deploys, and operates at large scale — enums present unique challenges in both software development and deployment. In this post, we’ll describe the kinds of development and deployment we think are ideal; point out where enumerated types can get in the way of fast and safe iteration; and suggest alternatives for situations where the downsides of enums may outweigh the benefits. In the process, we also hope to be able to save coastal cities around the world from fire-breathing monsters, because, yes, monster early-warning systems and enums may not mix as well as you might think…
Suppose that you have a system composed of a number of independently-deployable components and you want to roll out changes to the system. What options do you have, and what are their costs and benefits? (One example of this situation is a relational database and a separate application that executes queries against it, but even two processes on the same machine that need to speak the same protocol require an answer to this question.)
The simplest way of doing this might be a “migrate the world” deployment, where all the components are taken offline and all are simultaneously upgraded to the most recent version of your codebase. This has the advantage of simplicity, but it has a lot of downsides as well: an entire vertical slice of your system will be offline for some period of time, and rollback might not be straightforward. These characteristics can, in turn, lead to less frequent “kitchen sink” releases, which encourage large system changes that are less likely to be smoothly deployed — an unattractive, self-reinforcing cycle.
An alternative is to ensure that system components can be iteratively deployed so that subcomponents can be released in nearly any order. This is attractive precisely because it enables no-downtime deployments with easy rollback and independent deployment schedules. To enable this approach, a software team needs to make changes that are both backward- and forward- compatible wherever possible until deployment processes can guarantee that all dependencies have been successfully migrated. In the example of a standard relational database and application, that might include standard techniques like replacing an “ALTER COLUMN” migration with three separate migrations: ADDing a new column and migrating existing data to it, modifying the application to use the new column, and then DROPing the old column. The same idea of splitting a breaking change into several backward- and/or forward- compatible changes can be applied to many of the technologies in Conductor’s stack, from HBase to browser JavaScript. (While NoSQL stores claim to be schema-less, the reality is that code accessing them expects data to be in a particular “shape,” so changing that shape on disk often requires changing client code even if no explicit datastore migration is required.)
We run distributed systems across hundreds of machines at Conductor, so a “stop the world” deployment is highly unattractive to us. We certainly don’t want to take down any part of our stack intentionally unless absolutely necessary, so it’s very important to us to choose technologies and development approaches that allow incremental development and deployment in a distributed scenario.
Two important technologies in our toolbox that help with making changes forward- and backward- compatible are Thrift and Protocol Buffers. Both of these technologies allow for strongly-typed but simple definition of datatypes and services and include efficient on-the-wire encodings that can be read and written by a variety of languages. They are amenable to being stored in NoSQL stores as varied as HBase and MongoDB. And as importantly, these technologies are explicitly designed to support backward- and forward- compatibility — they are designed with distributed systems in mind (not surprising, considering that they originated at Facebook and Google, respectively). Both suggest a set of guidelines for how to make changes to datatype definitions so that these changes don’t break upstream or downstream systems. There are some very specific details for each of these technologies (for Thrift, see Section 5.3 of the original specification; similar details can be found for Protocol Buffers), but to summarize the most important rules:
- Don’t remove “required” fields
- Don’t add “required” fields
- Don’t change the types, field identifiers, or names of any existing fields.
The first two are pretty straightforward — consumers get confused when a field they’ve been told will always be there is removed or they’re told to expect a new field but receive input from an older client that doesn’t include that field. The third one seems straightforward: after all, you’d expect that if a client was expecting the fourth field of a message to be an encoded integer but it instead contained an encoded string that the client would get confused. But that last rule has a very specific and non-obvious edge case around enums, a native datatype that both Protocol Buffers and Thrift support.
To see how adding a value to an enum can have unanticipated effects in distributed systems, imagine that researchers have recently discovered a new dangerous creature type that they’ve been modeling as an enum:
Version 1 (Pre-discovery):
enum DangerousCreatureType { FRANKENSTEINS_MONSTER = 1, YETI = 2, LOCH_NESS_MONSTER = 3 }
Version 2 (Post-discovery):
enum DangerousCreatureType { FRANKENSTEINS_MONSTER = 1, YETI = 2, LOCH_NESS_MONSTER = 3, GODZILLA = 4 }
These may look like the same type, right? After all, they’re enums with the same name — and aren’t enums just integers under the hood? And for these reasons, your compiler will likely not flag the use of one definition of the type itself where it was expecting the other. But in fact, they are (in most important ways) different types. Most importantly, they have different data domains — values that are acceptable given the type. By adding a value, we’ve effectively changed the type of the enum.
This may be easier to see if you consider two different systems running two different versions of the code. Suppose the Tokyo Civil Defense system is running V1 of the code and waits expectantly for messages broadcast by sensors all around the city. As part of standard maintenance, the ocean sensors were just migrated to V2 of the code. One fine day those monitors start writing records that mention a large fire-breathing lizard sighted off the coast of Japan. The ocean sensor team is very happy — due to their upgrade, the sensor network can now broadcast messages that warn of Godzilla’s approach. But what about the poor V1 Civil Defense system? It’s unclear what it should do with such a message. Should it drop it? Throw an exception? Totally crash? Any of these are possible because the V1 system has received a message that doesn’t match its expectations — it knows nothing about Godzilla. And so a backwards-incompatible change inadvertently introduced by those researchers has imperiled tens of millions of lives.
While we don’t save the world from mythical creatures at Conductor, we do try to model a lot of datatypes that help marketing organizations understand and act on their natural search presence. Sometimes, we model information like the kinds of items one might see on search results as enumerated types in our Protobuf and Thrift definitions. From a data modeling perspective, this is attractive for all the reasons that using a enum is in a single monolithic application. But it’s had some important implications on our deployment approach whenever we want to add new search result types — something that can happen either when we change our analysis of search results or when search engines actually change what they display. For the reasons we’ve discussed above, those kinds of changes have required us to follow a different deployment mechanism. In these cases, we must deploy consumers of the changed records before we deploy code that produces the new records. This ensures that consumers won’t be exposed to records from a different enum domain than the one they were expecting.
We think our experience has some implications for other organizations that are considering modeling certain bits of data as enums in distributed systems. We suggest development organizations consider at least the following approaches when it comes to enumerated types in distributed systems:
- Accept consumer-first deployment. If you can easily deploy downstream consumers of enumerated types before producers of the new values are deployed, you can avoid the problem of unreadable messages. This may not actually be a difficult approach for many organizations and doesn’t necessarily require any customer-visible downtime.
- Be willing and able to drop or rewrite unreadable messages. If your client code can be robust to unexpected enum values and you’re willing to drop or rewrite messages with unknown enum values when they reach a consumer, this may be an approach for you that preserves more deployment flexibility. Be aware, though, that with some technologies — including Protobufs and Thrift — this may not be easy to control. These technologies may encode and decode messages in generated or library code that you may not necessarily be able to change; when this code is presented with unexpected enum values, it may fail or throw exceptions that impede graceful recovery.
- Treat enum value changes just like type changes and apply the “compatibility shuffle.” If you wanted to change the type of any other field in a backward-compatible way, you could follow the standard “shuffle” of adding a new field with the new type, modifying consumers, and then removing the old field. In the case of enums, this would mean first defining a completely new enum with the new values and having producers dual-write to the new and old fields, then modifying client code to read only from this new enum type, and finally removing the original enum type. This can be unwieldy and can lead to some ugly names for your enum types (like *_V20), but it follows standard practices for compatibility in incremental deployments.
- Restrict the cases modeled as enums. All this suggests that enum values are painful to change. It may be best to use enums for either natural phenomena that are unlikely to change or very stable business entities. It may be that for less stable business entities — even those best modeled as an enum at any specific point in time — alternative modeling is the best tradeoff given the deployment challenges enum changes can generate. For some cases, it may be possible to model the enumerated values as constants in a namespace without explicitly gathering them into a enum.
We don’t know if any of these changes would’ve helped the Tokyo Civil Defense system, but we hope they help you in creating scalable systems that are easy to modify and understand. Feel free to share your own stories with distributed systems and datatypes in the comments below. We look forward to learning along with you.