Five Uses of the Knife†: A Software Architecture study with LaunchDarkly, AWS Lambda, and Apollo GraphQL

Ken Shih
Making Meetup
Published in
6 min readApr 21, 2021

--

Photo by NASA on Unsplash

On every other Friday at Meetup we have an hour-long “Engineering Collective” meeting where the agenda is in two parts:

  1. We go over ADRs
  2. People reserve 5–10 minute slots to share work or introduce discussion topics

Someone brought up their addition of a feature flag into our GraphQL edge API using LaunchDarkly’s Node client SDK.

, a Staff Engineer, pointed out that in his recent experience with performance optimization of our AWS lambda-based Apollo GraphQL server, he noted that the addition of such use of the library had added significant latency (note: these problems were not due to the library itself, which is well-documented, but on how, in some cases, we had done the implementation).

He and team had recently brought p95 performance of the graph from 450ms down to 275ms over their heroic efforts in Q1. While it was 75ms shy of our OKR and still far away from our eventual goals of sub-100ms, it was an extremely impressive feat, especially since the graph was one of our hottest spots of change by multiple teams in Q1 and regressions to latency were almost daily occurrences.

I noticed that the ensuing discussion played out 4 distinct and typical software engineering paradigm approaches beautifully, so I thought I’d capture it, as an interesting scene to ponder…

The Points

, Senior Director of Engineering, pointed out that we should remove flags after we were finished with our testing in production or A/B testing, so that any latency problems would be temporary and acceptable for features that weren’t hardened yet.

I, a Staff Engineer, noted that we had seen implementation problems in the past, and that we should wrap the bare use of LaunchDarkly in a library to deal with initialization and startup problems that we’d seen in other areas of the codebase that used LaunchDarkly.

, mentioned earlier, suggested using a distributed cache, so that we could pay startup penalty only infrequently, instead of on every underlying lambda task creation (managed automatically by AWS and with varying cardinality)

, another Staff Engineer, pointed out that the LaunchDarkly client already caches flags, but that on initialization it loads all the feature flags and that startup time can be significant. He said we should consider moving our server from Lambda to ECS, so that instead of cycling lambda processes frequently, long-running service tasks would only pay on initialization, but then be more predictable afterward.

The Approaches

One management approach: best practice and governance.

One evolutionary approach: erect an interface, then strangle & DRY.

One domain approach: optimize your tier for its characteristic and concrete uses.

One platform approach: structure your architecture for resiliency, in this case, vertically scale.

The Pros & Cons

The management approach

Pros:

  • We have too many flags, keeping them clean makes things easy to understand
  • The best code is no code, clean up after yourself. After your performance is hardened production, you’re just left with the good stuff. And, while the feature is under development, there’s a built-in “warning” signal in the code that other engineers can read and understand that something is in progress.
  • Doing this approach doesn’t limit doing the other approaches!

Cons:

  • Practice can be difficult to establish and maintain. The cost of not linting, not sufficiently marketing the practice, or not sufficiently governing the practice means the advantages of the Pros are simply lost.
  • Feature flags are a tool and not just a tool for A/B test features. Part of a graceful failure, bulkheading strategy, or oncall playbookunder critical conditions might be to turn off a feature. In that class of feature flag use cases, you actually want to keep the feature flag around, so a uniform policy of “delete that” limits the utility of feature flags themselves.

The evolutionary approach

Pros:

  • Naive first tries always have problems, as do well intentioned second and third tries. Interface orientation lets you improve the system, with low coordination cost. You manage your improvement behind a Strategy Pattern, so even though your implementation can dramatically improve, your clients don’t have to do a thing to get the benefits of that improvement.
  • Doing this approach doesn’t limit doing the other approaches!

Cons:

The domain approach

Pros:

  • Often the source-of-truth of the data is not very concerned about performance. For example, a tax or accounting microservice would need to prioritize correctness and transactionality over performance. But, a federated GraphQL exposing tax data needs to be high-performant. The two domains have different design goals. For GraphQL, a cache and/or index are essential architectural tools to deliver SLO regardless of how the source-of-truth system is designed.
  • Doing this approach doesn’t limit doing the other approaches!

Cons:

  • Simple in construction… caches are surprisingly difficult to get right. Classic abuses of caches confuse caches and indexes or operational data with reporting data. Invalidation, over or under caching, and liveness issues can be subtle or, worse, obvious.
  • Caches can hide problems with the underlying data if you’re not disciplined about not introducing business logic outside the source-of-truth.

The platform approach

Pros:

  • Authors in the system can simply write more naive code and the system operates with fewer problems.
  • Vertical scaling is easy to reason about, and has a kind of architectural simplicity, where you don’t have to coordinate various subsystems. New systems, especially, can benefit from vertical scaling, so over-architecturing in YAGNI can be managed.
  • This may be a more cost effective approach?
  • Doing this approach doesn’t limit doing the other approaches!

Cons:

  • Hiding things behind a magic powerful server can be very problematic. Monolithic memory use can be as bad as other monolithic uses, hiding resource use to the point of unmanageability. That is, poorly encapsulated, concern-separated bundles become overly entangled in their shared resource pools. The tragedy of the commons () in memory use is as easy to occur as over-coupling in library use or in poorly modularized code.
  • ECS is harder to operationalize and maintain than Lambda, so lower AWS cost MIGHT be replaced with higher complexity and maintenance cost, ultimately COGS. A long-running process means your system has to be that much more well-factored in memory use and resource consumption than a process that dies every XX minutes. Lambda forces some architectural discipline simply by its constraints, similar to how the design of node encourages horizontal scaling, by design.
  • ECS is simply more flexible than Lambda & sometimes having fewer constraints means, well, not enough constraints.

The Solution is just part of the Story

In truth, the “solution” works “best” when all these concerns are addressed in concert and each in sufficient measure.

Occam’s Razor in one of these dimensions does not always mean the razor applies to the system or problem as a whole.

For example, to solve our “problem” perhaps we could simply initialize the LaunchDarkly client, during Lambda function initialization, that way the cost of initialization is done BEFORE a user attempts to fetch the data, and no additional runtime tax is paid during the individual request-response cycle for this feature.

This did not require the management approach! But we should clean up test flags & keep permanent bulkhead shutoff valves where needed.

This did not require the evolutionary approach! But if we should access LaunchDarkly through, at least, an interface function on the server so that we access the initialized client on the server instead of initializing our own in every use case.

This did not require the domain approach! But, certainly, we will need a cache for other purposes.

This did not require the platform approach! We didn’t need to migrate to ECS just to optimize this problem, but this does not mean managing operating expenses won’t make us migrate later.

While it did not require, concretely, any of the discussed approaches, it was a remarkable illustration of symposium. We exposed each of our biases and tendencies like characters in a play and each communicated progressive suggestions to the engineering organization in one tight scene. I just loved what everyone brought to the table and how we’ll create a better system together because of it. The bi-weekly Engineering Collective just reminds me again, writing software is a human endeavor, Community Matters, diversity matters, and when we work together, we are better together.

I wonder what David Mamet thinks? †

† — David Mamet wrote a book called Three Uses of the Knife, where I filched the title of this blog article.

--

--