Why you should invert your feature flags in tests

Simon Labute
Samsara R&D
6 min readJun 14, 2023

--

I had my code working, gated the change behind a feature flag, and updated the tests I knew of with the feature flag on and off. The code reviewer had signed off — Ship it. Despite my confident adherence to best practices, when we rolled out the flag, it turned out I’d shipped a bug. Even worse, there were tests that could have caught it. 😱

When developing software at scale, we’ll frequently deploy new code behind feature flags, and default them to “off” in production to have a controlled rollout. For testing, conventional wisdom says to have automated tests run as close to production as possible, so most testing frameworks set feature flags by default to off as well — seems sane enough. In practice though, doing this makes it hard to know which tests are relevant, it will progressively degrade your test suite’s coverage, and you should do the exact opposite.

I know, I know, it feels wrong — that’s not what customers experience! But let me walk you through the issues with the conventional wisdom as a codebase grows over time, how we’ve benefited from inverting feature flags in tests at Samsara, and why you should too.

👀 The conventional wisdom

Most of the time, feature flags are defaulted to “off” in production as you’re working on the feature before shipping it to customers. Most test frameworks assume it’s best to use the same default so that the testing environment reflects production.

In this world, the idealized workflow goes something like this:

  1. Introduce the feature flag.
  2. Update relevant tests to cover both flag variations, ensuring features still work with the new code.
  3. Release the feature.
  4. Quickly follow up by removing the flag from the codebase, and now all tests in the codebase use the new code path.

That sounds great on paper, but it rarely unfolds that way despite our best intentions. In practice, we often see something closer to this:

  1. Introduce the feature flag.
  2. Update relevant tests. Um, which tests are even relevant to update? I maybe know the ones my team owns, but I definitely don’t know about all downstream consumers of my code. I guess I’ll just do a cursory check for any mention of my feature in these files?
  3. Release the feature. This takes a bit longer than we originally hoped!
  4. Remove the flag. Now I have to try to remember to remove the feature flag, but this isn’t exactly a top priority, so it might stick around for a year or two.

I want to hone in on the problem in step 2, which I don’t think gets as much airtime as it deserves (I know I didn’t appreciate it before working at a larger company).

Some idealistic code review from me that only works if the code author knows what all the relevant tests are.

At Samsara, we have tens of thousands of tests ranging from unit to integration tests, spanning across the boundaries of our many microservices. It’s simply not practical to audit every test that your new feature-flagged code might impact. It’s fine to say in a workflow doc somewhere or code review comment to “make sure to update all relevant tests,” but how are engineers supposed to do that in practice?

With feature flags defaulted to off, the DogFood team doesn’t know to make sure PetFoodReport still works

To highlight the problem, imagine merging a logic change without any feature flag. Your regression suite runs in CI and, hopefully, catches any unintended behavioral changes impacting downstream consumers before they can deploy.

Now try again with a feature flag, and those same regression tests pass! You’ve effectively opted out of your entire regression suite except for the tests you specifically remembered to modify. This is entirely contrary to the point of a regression suite in the first place, which is to catch things you didn’t think of!

🔀 Flip the flags

The solution is pretty simple: have tests pretend we didn’t feature flag our code. Rather than inherit the same initial default being used in production, we can default them to their intended rollout state of “on.”

If you recall from the above, the main drawback of having flags off is not knowing which tests would break if the flags were on. This single change addresses those issues.

With the feature flag defaulted to on, downstream tests fail clearly in CI.

An engineer who pushes their code to CI will now get a report of all the breaking tests and can proactively decide to:

  1. Add coverage for old and new variations.
  2. Update the test to account for new behavior.
  3. Set the flag to “off” (this probably isn’t the right thing to do, given it creates tech debt for deprecating the flag).

Notably, we still expect engineers to test both variations where appropriate. For any test whose expected behavior has changed, the engineer must decide how/whether to update the test and discuss the implications with relevant downstream teams via code review.

We hire engineers to make good tradeoffs that leverage their experience, and testing is no different. With a combinatorial number of possible scenarios to cover, you’re necessarily choosing a stopping point somewhere and leveraging your intuition about which code can be reasoned about independently. Our intent in flipping flag defaults isn’t to force engineers to write more tests but to ensure their tradeoffs are intentional rather than inadvertent.

Our workflow now looks closer to this:

  1. Introduce the flag.
  2. Update relevant tests. Now you know which ones to update because they fail in CI!
  3. Release the feature.
  4. Remove the flag. Yay, this is really easy since there’s no tech debt of tests that must be updated! At most, you’re deleting coverage for the removed path.

This also means that all new tests use the forward-facing functionality. Even if you have a lot of feature flags lingering in production — maybe they’re in an almost-fully-rolled-out state for quite a while or just not removed from the codebase — your regression suite won’t slowly degrade in quality. If you default flags to off, every new test runs more and more code that customers aren’t running. I bet that’s not your intent!

🔚 Yes, even in end-to-end tests

I’ve seen advice that says to run integration or end-to-end tests with feature flag values that match the current production state. That seems to work in simple cases, but it needs a bit more nuance.

For instance, what do you do when the flag is halfway rolled out? When should you update the test itself if a behavior change is expected?

We’ve experienced that exact issue at Samsara in our end-to-end tests. Projects ran with flag values from production, happily passing deploy checks so long as the flag wasn’t yet rolled out to the entities we use for end-to-end testing. Then, as soon as we’d change the production config to roll out the flag fully, our test suite would fail, and someone would get paged to go in and evaluate if there was an outage or if we just needed to fix a test.

We solved this with the same approach: we just flip the flags on during end-to-end tests by default, and engineers update the tests when they introduce the functionality in the first place.

✅ Takeaway

I mentioned earlier that the conventional thinking is to default a feature flag upon creation to the same value for customers and in testing. The guiding principle is to maximize the test code paths that match production code paths.

I think that’s a great principle, but in practice, this is in fact maximized over time by defaulting flags to the inverse of their initial production “off” default.

At Samsara, we made this change across our testing frameworks and have seen good results, with direct examples of PR authors finding breaking changes they didn’t know about pre-merge, contacting the owning team, and not shipping bugs. Alas, we also had bugs before making this change that could have been caught had we built this sooner — but hey, at least we can prevent the next ones!

It’s rewarding to build infra where you can see the impact: saving our colleagues time resolving bugs and keeping a high bar for the products we ship to customers. If this resonates with you, Samsara could be a great fit. We’re growing — check out our open roles.

I’d like to give a shoutout to Ethan Goldberg for his contributions to this infrastructure & post.

--

--