Using Feature Flags for product experimentation
Question: What is a Feature Flag?
Answer: It is a logic toggle in your code, the value of which is consistent and persistent for a given user. It is used to decide if it should execute some code for the logged-in user. Typically they’ll be used for 3 reasons:
- Killswitch on a risky feature/logic
- Progressive rollout
If it’s still confusing, don’t worry I’ll break down both common mistakes and best practices.
There are many Feature Flag providers out there, but why wouldn’t you just build your own, they’re just a random number generator right? There’s more happening behind the scenes that make it worth using a 3rd party provider, like LaunchDarkly or ConfigCat. We’ll get to that later, but first, why bother?
Getting better signal ‘in-product’.
Joining Eucalyptus was exciting for many reasons, one key reason was the highly successful and rigorously scientific approach to acquisition and ad buying. The key first exposure of any customer to the ecosystem of products that Euc has to offer.
The Awareness and Acquisition stages of the funnel leverage SaaS products and dashboards (Facebook Ads, GA, hotjar etc). Only 2 of 6 stages in the key funnel — Awareness, Acquisition, Activation, Reactivation, Retention, Referral — get covered by these sophisticated tools. I was taking the opportunity to expand that rigorous approach to experimentation and measurement once the customer goes beyond the acquisition stage; once they are in the domain of our product itself.
Why do we split test?
Split testing allows both old and new to exist side-by-side, typically a 50/50 split. This allows all external factors to be controlled for as both groups of users will have those confounding factors apply to them, cancelling its impact out.
I.e. if the metric was 24% before the test and then the control drops to 20% and the variation drops to 22% (due to some external factor), our experiment has actually driven an uplift of absolute 2% or relative 10% (20%->22%). But without the split test a “before-and-after” look at the metrics would show a 24%->22% drop (~9% relative drop).
Every person is different, across many different dimensions of biases, needs and wants. It’s almost impossible to draw a conclusion on a whole population (customer base) from the actions of a single customer. But sampling enough people of a population allows you to get pretty close to certain how the entire population will act on average, assuming the sampling is random — but this is a topic for a whole other article.
“We don’t have time” — Why experimentation matters:
To complete an experiment takes time. It takes time to develop (sometimes longer than just building the new feature due to the complexity of supporting both old and new functionality in the test), and time to run. The length of the experiment is a function of primary metric uplift and volume of daily users seeing that feature. The larger the uplift it drives or larger the volume of users the less time it takes.
On low volume parts of your product experience it can takes weeks to get a statistically significant result on optimisation work (you’ll often need thousands of users in each cohort to measure a few percent change to a 95% confidence interval). This can lead to many people deciding there is not time to run an experiment and instead we should just roll it out, because “we’re very confident”.
The issue with this is it does not control for any external factors. Which have been especially pronounced this year. Beyond regular seasonality we’ve had a wild ride. From week to week, customer’s financial and lifestyle situations have changed rapidly across entire populations— which can lead to similarly pronounced swings in metrics.
So you can roll out a feature and the metric goes down, but you have no idea if it would have gone down more and your change has had a positive effect shielding against an even more severe drop. If you’re rolling out features multiple times a day like we are, the chance of this timing coincidence is higher than you’d think.
You call a crisis meeting to understand what’s happened. The change is blamed as the obvious culprit. A lot of new commits have merged and deployed via CI/CD so you work to pull the old code from the git history and piece it back together which takes a few days.
You deploy the new “old” code.
The metric drops further. So you’ve burned an extra week of dev time and are doubly worse off than before. Confused and frustrated, you must now decide if you’re going to reimplement the change.
This can happen the other way where you wrongly attribute an uplift to a change and then double down on it falsely. Either way, you’re running worse than blind, misinformed.
If you’re a startup, you are always against the clock of your runway slipping away. It can be tempting to believe you may not have time to run it as a test. But the thing is, you don’t have time NOT to run it as a test. Slow is smooth, smooth is fast.
All of this is to say that prioritising a split test makes sense in most circumstances because it allows you to say for certain if the change had a positive or negative impact regardless of macro trends and seasonality.
Through the looking glass
There’s 3 parts to any experiment:
- Code of the new feature itself that will be experimented with.
- Feature flag allowing you to show different users different features.
- Code for measuring the impact of the change (typically an analytics or event system)
We’re focusing on (2) in this article but we will cover the rest in time.
Building a Feature Flag Client
When I built the Feature Flag client for our products I wanted to avoid them having different practices causing complexity and confusion. I bundled as much of the complex logic for setting it up into the client as possible.
Two main things were included: Identifying the user with the FF provider and firing an “Exposure Event”.
As mentioned before, Feature Flags are persistent, so if the same user goes to the same page it should show the same functionality to avoid polluting your results by showing the same user randomly different experiences. This means you need to identify the user so the provider can return the same value, typically your own unique UserId for that user works well as it won’t ever change.
The Feature Flag provider will give you an SDK where you only need to provide the Flag Key (defined by you when you create the flag in the provider’s web app), the user identifier (UserId) and a default value if it fails to fetch. By abstracting the provider SDK using a Singleton Adapter Pattern it will be easy in future to swap providers with a very small surface area change.
I also included an analytics event to fire automatically with the Flag Key and resolved value of the user’s cohort (Flag Value) with the UserId. This means that whenever a user is exposed to a feature we can use that as the “anchor point”. Left Joining all conversion analytics events and then group and count by cohort value in the exposure event to compare the performances.
By doing this work upfront we’re now able to expand the Growth Engineering team and for the most part they will not need to sweat the details of setting up an experiment because the logistics will be handled by just calling featureFlagClient.getValue(‘<flagKey>’ , ‘<defaultValue>’).
Multivariate Feature Flags
A common mistake when doing an experiment is using a boolean feature flag. This is fine for a progressive rollout or a killswitch but when you want to accurately measure the impact of your change a multivariate flag is a must.
The reason why is as follows:
With a boolean feature flag all users move from False to True over time until you’re at 50/50 rollout. This means you will have fired exposure events for False for a given user before you fire one for True for the same user. Complicating your analysis unnecessarily.
With multivariate however you can have a ‘not-enrolled’ cohort alongside a ‘variation’ and ‘control’ cohort. People start 100% in ‘not-enrolled’ and then you roll out:
- 0% V, 0% C, 100% NE →
- 10% V, 10% C, 80% NE →
- 25% V, 25% C, 50% NE →
- 50% V, 50% C 0% NE
Now you can rest easy knowing that any exposure event for a given user is valid and can be trusted as their “true cohort”. Because no user ever crosses from variation to control and vice versa.
Disclaimer: It is possible to use a boolean feature flag but you need to keep track of when exactly you hit 50/50 and then exclude any exposure event from before that time, a bit messy and risky if any miscommunication happens.
Consistency is Key
Using a provider is a valuable use of resources. They will handle hidden complexity:
- Ensuring your cohorts are properly balanced over time.
- Employing random sampling to avoid time or geography based biasing of your cohorts.
- If you roll back a flag the most recently enrolled people will be excluded first and then reroll it out a bit later the same people will be re-enrolled, i.e. a LIFO and FOFI system. This minimises pollution impact.
- Ensures a given user resolves the FFs the same every time, assuming nothing else has changed (you haven’t enrolled or unenrolled them).
- Some will allow you to have your flags interact, e.g. have two flags resolve the same for a given user to align the experiences or preclude a user for one experiment if they’ve been a part of another. Something non-trivial to do at scale on your own.
- Manages all the states of the FFs in one place with a nice UI that non-engineering staff can use, support staff etc. They can roll back a harmful change without needing a code merge or deploy. Minimising customer impact.
- Target individual users for given flags to include and exclude them.
- Maintain an audit log, which is one of those things (like insurance) you only wish you had when it’s too late.
All of this allows you to decouple flag management from flag implementation.
Picking a provider
There are quite a few good feature flag providers on the market. Each of them looking to differentiate themselves in one way or another. Some are good for small startups with a small budget, and others are vertically integrated with huge marketing tool stacks and aim more for enterprise. We currently use ConfigCat due to the generous free tier options and tight knit community around it.
Its functionality is limited compared to something like LaunchDarkly but due to the fact that we have a telehealth business model our users do not frequently engage with the web app on a week-to-week basis. Paying per MAU doesn’t make sense. ConfigCat charges per flag resolution above a free cap. You’ll need to think about your user behaviour and pick the pricing model that suits you.
Overall you’ll need to invest a little bit of time to set up Feature Flags in your product, then as a growth team it’ll be on you to establish best practices, but at Atlassian all teams began to learn these from the growth team and feature flags spread rapidly.
Just don’t forget to clean them up after you get your experiment results!