PlanGrid is a part of Autodesk’s Construction Cloud. We are constantly introducing new features, which means we rely heavily on feature flags. As a startup, we used a home-grown feature flag system. It worked well at first but was limited in functionality and scalability, and we ended up outgrowing it and migrated from our old system to LaunchDarkly. In this post, we’ll describe how we made that transition seamless, decreased our internal maintenance workload, and achieved faster and more consistent response times for our applications in the bargain.
Background & Motivation
To set the stage for our migration journey, we should talk briefly about the beginning. PlanGrid was founded in 2012. In the early startup years, infrastructure and systems were constantly evolving. One of those evolutions included introducing feature flags to manage phased rollouts of functionality and providing user or project level entitlement to specific features. We needed something basic, cheap, and quick, and built our own feature flag management system, “Flipper,” which we’ll describe a bit more later. Fast forward several years, and Flipper had experienced significant growing pains, to the point where a near-complete rewrite had even been started (and subsequently scrapped due to performance concerns). As an organization, we had outgrown our homemade system. Still, as luck would have it, we became part of Autodesk, and our new parent company had an existing relationship with LaunchDarkly.
Flipper itself was a relatively simple system (in theory). Feature flag data was stored in a Redis repository. There was no “User Interface” in the traditional sense; rather, users interacted with Flipper via a dedicated Slack channel. A bot listened for commands and relayed the relevant instructions to the back-end system.
A Flask web application provided the API that our Slack bot and application code used to interact with flags behind the scenes. A shared Python library provided a wrapper around the REST API so applications could call, for example,
flipper.is_active('feature_name', 'user_id') and get back a True/False response.
The Good: For users (our developers, support, and even sales staff), Flipper was simple to use. Everyone loved the Slack integration. We’re all on Slack constantly, so having our feature flags at our fingertips was great!
The Bad: As mentioned above, there was no UI. While the Slack integration was nice and convenient, some things aren’t all that amenable to a text-only query/response mode of interaction. As the list of flags grew, asking Flipper to show you all flags for a given environment resulted in a “wall of text” (eventually, so big that it couldn’t fit in a Slack message). There was no way to visualize things like flag activity. Additionally, Flipper didn’t do a lot of logging. One could look back through Slack history to see who changed what, and when, as our organization grew, this became increasingly difficult.
The Ugly: With no UI, as the list of flags grew lengthy, people quickly learned that asking Flipper for a list of flags was not going to be very helpful. So, generally speaking, nobody asked for a list of flags anymore. The adage “out of sight, out of mind” rings true: Flags piled up and were rarely removed. By the time we migrated off of Flipper, we had several hundred feature flags, of which a vast majority were dormant or defunct.
We hold a bi-annual “Hackweek,” during which everyone converges on our San Francisco HQ, drops whatever they’re working on, and, well.. hacks. It’s a time to try out wild new ideas or to scratch at things that have been itching in the backs of our brains for months. Flipper itself had been built during an earlier Hackweek, and it was during another Hackweek several years later that the migration from Flipper to LaunchDarkly began.
As a starting point, a team including some of the original engineers who built Flipper constructed a Proof-of-Concept integration that would allow our existing Python library to optionally talk to LaunchDarkly, abstracting the code responsible for Redis I/O into a pair of “data adapter” classes. Essentially, in this PoC, we were using LaunchDarkly as a data store for feature flags.
If you’re already familiar with LaunchDarkly, you might be thinking, “What? That’s silly! Why would you take a feature-rich system with a full web UI and use it as a glorified database?” As it turns out, this was actually one of the keys to our success in our migration. We had a couple of dozen distinct applications within our ecosystem that referenced our Python library for Flipper. We knew it would be folly to expect all of those teams to rip-and-replace feature flag management code in a timely manner and then flip a switch to have everyone start using the LaunchDarkly web UI in perfect synchronicity. Using our existing library as a shim between flag requests and LaunchDarkly allowed us to restrict early changes to shared library code and common settings.
One of the biggest challenges we faced was ensuring continuity. Developers were frequently adjusting flags, and user-facing code was requesting flags thousands of times per second. We needed to make the migration transparent and could not tolerate downtime, so we used a cautious, phased approach. While the initial PoC included a simple switch to route traffic to either LaunchDarkly or Redis, we needed to be able to transition and test applications one at a time for an orderly migration. To accomplish this, we used environment variables as rudimentary “feature flags,” which allowed a single engineer to manage the migration across multiple applications. Using this approach, we gradually introduced changes through the following phases:
In the first phase, we introduced dual-writes so that any changes to feature flag states would be written to both LaunchDarkly and Redis systems. Once all applications had been enabled for this phase, we ran ad hoc scripts that did a one-time mass copy of all existing flag data from Flipper to LaunchDarkly. We then performed periodic comparisons of all flag data across both data stores to ensure they remained consistent over the next few weeks.
The next phase of our rollout was a dual-read regime. During this time, dual-writes continued as in the previous phase, but now reads were first routed to LaunchDarkly. If any error occurred, the library would fall back to read and return the value stored in Redis while logging an alert to notify our development team that this had occurred.
Notably, one of the key applications migrated was our Flipper Slack handler, so throughout the migrations, users were able to continue managing their flags via that familiar interface.
After running in dual-write, dual-read mode long enough to convince ourselves that things were working as expected, we entered our penultimate phase. The only change in this phase was behavioral: we started encouraging people to use the LaunchDarkly UI to manage flags. This was the point of no return, as we now allowed data between our legacy system and LaunchDarkly to diverge. Users could still “opt-out” by using the Flipper Slack channel, which would still dual-write, but LaunchDarkly had now officially become our source of truth for reading feature flag data.
The final phase was removing the ability to “opt-out,” making the Slack channel read-only, requiring all flag changes to be made via LaunchDarkly’s UI, and removing the dual-write and dual-read “features” from our library, making it purely a LaunchDarkly shim. Moving forward, applications are gradually being updated to use LaunchDarkly’s SDK directly, after which we will be able to deprecate our shim library entirely.
Overall, this migration process took a few months to complete. We were able to orchestrate the orderly migration of dozens of applications through the approach described above, with minimal disruption to our developers who rely on feature flags, and with zero downtime.
So we’ve pretty well covered the “what.” Let’s talk about the “why.” What did we get out of this?
A significant benefit is that we reduced our development burden. We no longer have to maintain and upgrade a home-grown feature flag system. Similarly, the overhead associated with the maintenance of the underlying infrastructure was reduced.
We are now also able to tap into the increased functionality of LaunchDarkly. As we’ve described above, our Flipper system was simple. That made it easy to use but also meant we had limited functionality, with only basic percentage-based rollouts and individual ID targeting. By migrating to LaunchDarkly, we have gained access to richer targeting options and a lot more information about how our flags are being used. It’s now easy to see when a flag has been fully enabled (and should be removed) instead of just letting flags pile up forever.
One particularly striking benefit has been in the area of performance. We can best illustrate this with a few graphs captured around the time we flipped the switch to start reading flags from LaunchDarkly on our largest web services. Each of these graphs represents approximately the same timeframe, with a vertical marker indicating when we made the switch. All axes are linear.
For a frame of reference, let’s first look at the overall throughput for that service. This is indicative of normal weekday traffic:
The first notable change we can see is the raw volume of Redis requests that were previously being generated by Flipper. This graph covers approximately the same time span as the throughput graph above:
..and the amount of time they were consuming:
Last but not least, overall latency for the endpoint in our middleware service that was handling feature flag requests saw correlating improvement. The average response time dropped to about half of what it was previously. Especially striking here are the more consistent response times seen by the 99th and 95th percentile lines. Our system is now much less susceptible to heavy user load driving many feature flag evaluations.
Now that we have fully transitioned our backend services, we are looking at ongoing improvements such as having our mobile client applications interface directly with LaunchDarkly, vs. having them go through our backend services as “middleware.” This will further reduce the load on our services and enable the use of features like richer targeting options.
We are also excited about monitoring flags more effectively and using that sort of information to maintain a significantly less cluttered inventory of flags.
About the Author
Rick Riensche is a Senior Software Engineer in Autodesk Construction Solutions’ Backend Platform Team. He joined Autodesk in February 2019, following a long tenure at a National Laboratory where he served as a Research Scientist and Software Engineer. He holds a Ph.D. in Computer Science from Washington State University (GO COUGS!), and his numerous hobbies include Python, feature flags, long walks on the beach, electric guitar, and video games.