Conductor R&D
Published in

Conductor R&D

Enums in distributed systems

Enums. Every software engineer has seen them. Maybe it was in your introductory programming class. Maybe it was at your first software engineering job. Or perhaps you saw them when reading an API spec or working on an open-source project. Properly-used, enums are a powerful tool for creating code that is readable while remaining resource-efficient, and they are available in almost every statically-typed language. They also eliminate an important class of programming errors caused by divergent definitions of types that draw their values from an enumerated domain of values.

In a monolithic application with an atomic deployment profile, there are few downsides to modeling a defined set of values of a particular type as an enum. But in a distributed system — the kind of system Conductor designs, deploys, and operates at large scale — enums present unique challenges in both software development and deployment. In this post, we’ll describe the kinds of development and deployment we think are ideal; point out where enumerated types can get in the way of fast and safe iteration; and suggest alternatives for situations where the downsides of enums may outweigh the benefits. In the process, we also hope to be able to save coastal cities around the world from fire-breathing monsters, because, yes, monster early-warning systems and enums may not mix as well as you might think…

Suppose that you have a system composed of a number of independently-deployable components and you want to roll out changes to the system. What options do you have, and what are their costs and benefits? (One example of this situation is a relational database and a separate application that executes queries against it, but even two processes on the same machine that need to speak the same protocol require an answer to this question.)

The simplest way of doing this might be a “migrate the world” deployment, where all the components are taken offline and all are simultaneously upgraded to the most recent version of your codebase. This has the advantage of simplicity, but it has a lot of downsides as well: an entire vertical slice of your system will be offline for some period of time, and rollback might not be straightforward. These characteristics can, in turn, lead to less frequent “kitchen sink” releases, which encourage large system changes that are less likely to be smoothly deployed — an unattractive, self-reinforcing cycle.

An alternative is to ensure that system components can be iteratively deployed so that subcomponents can be released in nearly any order. This is attractive precisely because it enables no-downtime deployments with easy rollback and independent deployment schedules. To enable this approach, a software team needs to make changes that are both backward- and forward- compatible wherever possible until deployment processes can guarantee that all dependencies have been successfully migrated. In the example of a standard relational database and application, that might include standard techniques like replacing an “ALTER COLUMN” migration with three separate migrations: ADDing a new column and migrating existing data to it, modifying the application to use the new column, and then DROPing the old column. The same idea of splitting a breaking change into several backward- and/or forward- compatible changes can be applied to many of the technologies in Conductor’s stack, from HBase to browser JavaScript. (While NoSQL stores claim to be schema-less, the reality is that code accessing them expects data to be in a particular “shape,” so changing that shape on disk often requires changing client code even if no explicit datastore migration is required.)

We run distributed systems across hundreds of machines at Conductor, so a “stop the world” deployment is highly unattractive to us. We certainly don’t want to take down any part of our stack intentionally unless absolutely necessary, so it’s very important to us to choose technologies and development approaches that allow incremental development and deployment in a distributed scenario.

Two important technologies in our toolbox that help with making changes forward- and backward- compatible are Thrift and Protocol Buffers. Both of these technologies allow for strongly-typed but simple definition of datatypes and services and include efficient on-the-wire encodings that can be read and written by a variety of languages. They are amenable to being stored in NoSQL stores as varied as HBase and MongoDB. And as importantly, these technologies are explicitly designed to support backward- and forward- compatibility — they are designed with distributed systems in mind (not surprising, considering that they originated at Facebook and Google, respectively). Both suggest a set of guidelines for how to make changes to datatype definitions so that these changes don’t break upstream or downstream systems. There are some very specific details for each of these technologies (for Thrift, see Section 5.3 of the original specification; similar details can be found for Protocol Buffers), but to summarize the most important rules:

  1. Don’t remove “required” fields

The first two are pretty straightforward — consumers get confused when a field they’ve been told will always be there is removed or they’re told to expect a new field but receive input from an older client that doesn’t include that field. The third one seems straightforward: after all, you’d expect that if a client was expecting the fourth field of a message to be an encoded integer but it instead contained an encoded string that the client would get confused. But that last rule has a very specific and non-obvious edge case around enums, a native datatype that both Protocol Buffers and Thrift support.

To see how adding a value to an enum can have unanticipated effects in distributed systems, imagine that researchers have recently discovered a new dangerous creature type that they’ve been modeling as an enum:

Version 1 (Pre-discovery):

enum DangerousCreatureType { FRANKENSTEINS_MONSTER = 1, YETI = 2, LOCH_NESS_MONSTER = 3 }

Version 2 (Post-discovery):

enum DangerousCreatureType { FRANKENSTEINS_MONSTER = 1, YETI = 2, LOCH_NESS_MONSTER = 3, GODZILLA = 4 }

These may look like the same type, right? After all, they’re enums with the same name — and aren’t enums just integers under the hood? And for these reasons, your compiler will likely not flag the use of one definition of the type itself where it was expecting the other. But in fact, they are (in most important ways) different types. Most importantly, they have different data domains — values that are acceptable given the type. By adding a value, we’ve effectively changed the type of the enum.

This may be easier to see if you consider two different systems running two different versions of the code. Suppose the Tokyo Civil Defense system is running V1 of the code and waits expectantly for messages broadcast by sensors all around the city. As part of standard maintenance, the ocean sensors were just migrated to V2 of the code. One fine day those monitors start writing records that mention a large fire-breathing lizard sighted off the coast of Japan. The ocean sensor team is very happy — due to their upgrade, the sensor network can now broadcast messages that warn of Godzilla’s approach. But what about the poor V1 Civil Defense system? It’s unclear what it should do with such a message. Should it drop it? Throw an exception? Totally crash? Any of these are possible because the V1 system has received a message that doesn’t match its expectations — it knows nothing about Godzilla. And so a backwards-incompatible change inadvertently introduced by those researchers has imperiled tens of millions of lives.

While we don’t save the world from mythical creatures at Conductor, we do try to model a lot of datatypes that help marketing organizations understand and act on their natural search presence. Sometimes, we model information like the kinds of items one might see on search results as enumerated types in our Protobuf and Thrift definitions. From a data modeling perspective, this is attractive for all the reasons that using a enum is in a single monolithic application. But it’s had some important implications on our deployment approach whenever we want to add new search result types — something that can happen either when we change our analysis of search results or when search engines actually change what they display. For the reasons we’ve discussed above, those kinds of changes have required us to follow a different deployment mechanism. In these cases, we must deploy consumers of the changed records before we deploy code that produces the new records. This ensures that consumers won’t be exposed to records from a different enum domain than the one they were expecting.

We think our experience has some implications for other organizations that are considering modeling certain bits of data as enums in distributed systems. We suggest development organizations consider at least the following approaches when it comes to enumerated types in distributed systems:

  1. Accept consumer-first deployment. If you can easily deploy downstream consumers of enumerated types before producers of the new values are deployed, you can avoid the problem of unreadable messages. This may not actually be a difficult approach for many organizations and doesn’t necessarily require any customer-visible downtime.

We don’t know if any of these changes would’ve helped the Tokyo Civil Defense system, but we hope they help you in creating scalable systems that are easy to modify and understand. Feel free to share your own stories with distributed systems and datatypes in the comments below. We look forward to learning along with you.

--

--

A glance behind the scenes by the people who built the award-winning Conductor platform

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store