A Framework for Metric Selection in Experimentation — The Goldilocks Zone

Julian Aylward
Gousto Engineering & Data
5 min readAug 25, 2022
Photo by Javier Miranda on Unsplash

Not too big, not too small, but just right. We all know the story of Goldilocks and the 3 bears. The Goldilocks zone is used in Astronomy to refer to planets that orbit a star at a distance that is “just right” to support life as we know it, and both the story of Goldilocks and the Goldilocks zone has a lot of parallels to the quest by product analysts in this world and perhaps beyond in search of the right primary metric to use in experimentation (AB testing).

You could be forgiven for wondering why this is a big deal. Can’t you just pick the metric you want to increase and stop faffing about? How hard can it be, right?

It turns out it’s not as simple as all that, and we are intrinsically walking a tightrope between “measuring the thing we actually want to move”, and measuring the most appropriate primary metric for our experiments which may not be the same things.

Let’s take a practical example from Gousto, where experimentation is a core part of our strategy/philosophy. The “thing we actually want to move” is often Customer Lifetime Value (CLV) or profit (EBITDA), but the metrics are too far removed from, insensitive to the change we are normally making. On the other hand, let’s say we are adding a new module to our app homepage, if we pick a metric too close to the change (e.g. percentage of customers interacting with the module, click through rate), we’ll obviously see an increase in this metric, but you’d be pretty hard pressed to argue that this would obviously mean we had increased CLV and therefore ultimately deem the experiment to have been successful.

So, back to the goldilocks zone. Measuring the thing we’re changing is “too small”, measuring the thing we actually want to move is “too big” so we need to find something in between that is “just right”. As we ramp up experimentation across Gousto across an increasing number of squads, the metrics might differ, but many of the principles remain the same. One size/metric does not fit all needs, but similarly a pic n mix approach to metrics will also end in a mess. Inspired by this great article by Curtis Stanier, I’ve attempted to codify a set of principles, or framework for metric selection the allows us to explore and evolve our primary metrics at Gousto while also retaining trust and avoiding a metric soup:

Sensitivity and MDEs: The metric must be sensitive enough to achieve significant results at the magnitude of the experiments being run in a given domain over an acceptable time period (likely 2–4 weeks). How small does your Minimum Detectable Effects (MDE) need to be to achieve significant results (positive or negative) with an acceptable significance rate, allowing you to learn and iterate fast? Reducing your MDE to an arbitrarily small effect size is always a trade off with experiment runtime, might not make commercial sense and in many cases simply wont be possible!

Low(er) variance: Variance plays a key role in the statistical power of a test, and therefore the sample sizes required for a given effect size. The sample size required for a given power increases as a square of the variance which is easily demonstrated by Lehr’s rule of thumb. A metric like Sales Value Per Customer might have much higher variance than e.g. Orders Per Customer.

Statistically Sound: A previous definition of our Menu Conversion Rate violated the Assumption of Independence which can seriously increase your false-positive rate (type 1 error rate). Normally distributed data and similar variance between test groups may also be beneficial, especially at lower sample sizes.

Causal Relationship with North Star Metric: A good primary metric must be causally related to our North Star Metric (the thing we actually want to move), giving us the confidence that optimising “locally” will deliver the right results “globally”. This might sound simple, but the reality probably involves a lot of analytical grunt work, back testing, validation etc. Finding correlation is relatively simple, proving a causal relationship is harder.

Prefer Leading to Lagging: Some metrics will take longer than others to “mature”. For example, in fashion retailing, returns run at ~30% of gross sales. Needing to wait 30 days after your AB test concludes for your net sales including returns to “mature” will significantly slow down your end to end experiment lifecycle and “learning rate”. In some cases it might be unavoidable, but consider whether the pros of higher accuracy outweigh the cost of a reduce ability to learn and iterate fast.

As Universal as Possible: We want the minimum number of metrics to “select” from as our primary KPIs. Metrics should be thoroughly tested and understood. A pick n mix approach to KPIs is problematic for multiple reasons, namely trust, consistency, continuity, and the statistical bear trap of “hunting” for the metric or metrics that (conveniently) confirms you hypothesis (multiple comparison problem).

Resistant to Change: Some metrics might be highly dependant on the user interface which is much more likely to evolve at a faster rate than the core proposition of your company. For example, at Gousto our recommendations are getting pretty impressive, but we’ll likely continue to have a menu section allowing customers to select their recipes for some time yet. The number of recipes is constantly increasing necessitating changes to the UI and new features to help customers navigate the increased choice. Selecting a metric that is overly that is overly tied to the UI (e.g. number of categories browsed) is more prone to change over time in a manner that may cause problems or confusion. We cannot preempt everything, but we should try and select “change tolerant” metics where possible.

While this doesn’t give you a magic recipe to find the perfect metric, hopefully is does provide a set of criteria or framework against which to assess potential primary metrics for experimentation, take your stakeholders on the journey with you and dispel any myths that finding a metric is just a case of picking one out of a hat.

You might have heard of Facebook’s “aha moment”, or Netflix’s 50 Movies in the first two months which, through extensive analysis and testing enabled both companies to confidently invest millions, or billions of dollars in optimising against these proxy metrics. For those prepared to do their homework, the prize can be significant!

Check out more stories from Gousto and make sure to follow us here

--

--