Proxy metrics and Gousto’s “Aha” moment(s)

Published in

Gousto Engineering & Data

8 min readDec 4, 2023

Eureka!

We’ve all heard the story about experiment proxy metrics at Facebook and their now famous “aha” moment . The legend goes something like this: step one, sign up customers; step two, get new users to connect with 10 friends in 7 days; step 3, hit 1 billion users and buy a big bag to put the money in.

Given the meteoric rise of Facebook, clearly they were doing something right (remember Myspace anyone?). Experimentation certainly had a significant role to play in that success, but proxy metrics however are not a silver bullet. Building a mature experimentation capability is hard and there is much more to it than “just” finding your “aha” moment proxy metric.

It can also be a journey, requiring a significant amount of analysis, backtesting previously run experiments (the more the better) and “field testing” your new metric. This is often a process that can take months, or years, not weeks as this post on proxy metrics at Netflix by Gibson Biddle describes. So be warned, it might be less “eureka” while sitting in the bathtub one day, more “buckle up, we are on a journey” than Facebook’s almost legendary “aha” moment might have you believe.

Lots of other big Tech companies and platforms have reported rather similar aha moments to Facebook. For example, Twitter and LinkedIn’s claimed Aha moments are basically analogues of Facebook’s. While Netflix’s best known proxy metric case study relates to their recommendations (% of users that rate more than 50 movies in a month), they have had plenty more proxy metrics in use, dependant on the squad or domain. Again, this might be a surprise to many who expect we can find one proxy metric to rule them all.

At Gousto we’ve also found that proxy metrics have played a big part in the success of some of our squads. We haven’t (yet) found a grand unifying proxy metric that we can boldly claim to be our “aha” moment, more like several proxy metrics that have massively contributed to the success of specific squads and domains. More like “ah” moments if you must.

What is the point of proxy metrics if it’s a lot of effort?

Per my previous article here, often the thing you want to measure and move is totally useless as a primary metric for experimentation. Too “big” and hard to move, too “small” and tied to the UI of your product, or simply something that doesn’t actually correlate or causally link to the thing you want to move (north star metric). A proxy metric needs to be in the Goldilocks zone: not too big, not too small, but just right.

This is critical to achieving a healthy significance rate (positive or negative) and enabling the squad to achieve the necessary ability to learn from their tests and deliver wins as a result. No matter how big your sample sizes are, you can’t ever measure the impact on your stock price or (net) profit as a result of rolling out your new feature via AB testing!

Setting off on the journey

We defined a metric selection framework to help us evaluate potential proxy metrics as well as give confidence to the wider business that we weren’t just playing metric pick n mix. We then set off on our proxy metric journey, knowing what success would look like and the tools we would need, but not knowing where the journey would take us or how long it would take.

After years of rapid growth and scaling up in our tech team, we had the following situation:

Front end and experiment data in a redshift database that was creaking
No standardised experiment dataset, no SQL code from previous experiment analysis stored in GitHub (or anywhere for that matter 😨), meaning backtesting was difficult if not impossible
A squad focused on our recipe recommendations who didn’t know if their work was delivering any value and struggled to get any significant results from experiments
A squad focused on improving our menu design and UX who were struggling to get any significant results

The real world of business is never as clear cut as many people imagine, even in companies you think have really got their s**t together, and Gousto is no exception. Despite various challenges, we didn’t have the luxury of time. With multiple squads who had no idea if their work was delivering any value, the risk of doing nothing was greater than the risk of doing something.

It’s as much about the journey as the destination

Knowing that we needed to get moving, we focused on two potential proxy metrics

User Menu Conversion Rate (UMCVR) to measure the impact of changes to the menu UI and UX, essentially a user-scoped measure of conversion on our menu
Basket Match (BM) to measure the impact of changes to our recommendations algorithm, a measure, of the percentage of recipes a customer purchases from the top N ranked recipes they see on the menu. For example, if I add 2 out of a total 4 recipes from my top N recipes by rank, I would have a basket match of 50%

Basket match has the added attractiveness as a proxy metric in that it allows us to run offline simulations. If we could prove a causal link to Average Orders Per User, we could simulate changes to our recommendations algorithm using historic data with the aim of (retrospectively) maximising basket match before launching a test to validate any improvements.

For UMCVR, we manually ran a handful of backtests which was horribly painful at the time to assess the impact on win rate and MDEs and rushed it out the door. We were careful to clearly communicate that this metric was being “field tested”, and would be reviewed as we gather more data points to assess its performance.

For Basket Match we ran a “backoff” test for our recommendations algorithm, with the control featuring randomly ranked recipes in the hope of finally getting a significant result, and in doing so demonstrate that recommendations were actually adding value. The result came back significant and positive for both Basket Match as well as AOPU (Average Orders Per User) from which we can easily estimate financial benefit. Success!

Well, success of sorts… One data point is not enough to prove a causal relationship between our proxy metric Basket Match and AOPU. But it was already enough to get us going.

We now knew that recommendations (and by extension the team) were delivering value
We actually had two data points not one. The AOPU and basket match for the control and variant respectively
We stuck a log function through the two data points, assuming diminishing returns as basket match increases, creating a sensible, if as yet unconfirmed relationship between AOPU and basket match
We tested further big and bold upgrades to our recommendations algorithm with the aim of getting additional significant results for AOPU that would confirm the causal relationship between the two metrics

Backtesting and proving the causality

For basket match we accumulated additional results from testing over time that have allowed us to demonstrate with high confidence the causal relationship between Basket Match and AOPU. Each time we get an additional result that is significant on both metrics, we adjust our modelled log function.

This means we can now simulate the benefit of updates offline with confidence, deploy them as an AB test to validate the simulated impact, then roll out and bank the benefit even for relatively small movements in basket match given the improved sensitivity of this metric over AOPU.

The story for UMCVR was not so straight forward and ultimately it turned out to be a dead end, but not a waste of time!

As I mentioned, proxy metrics are not a walk in the park. The more historic experiments you have to backtest and the more mature your experimentation capability, the easier it is to identify potential metrics and validate their performance. In the months after launching UMCVR as a “field test”, we also made a number of improvements that standardised our experimentation analysis process and improved our ability to backtest, significantly streamlining the process. Over time, this has led to us building a fully automated Experiment Analysis Tool (EAT) which monitors experiment health, calculates top line metrics and writes out clean and standardised data that speeds up Post Test Analysis, backtesting and other meta analysis.

So, with improved backtesting capabilities, more analytical firepower dedicated to the project and learnings from our journey so far, we started again from first principles:

Build a bunch of metrics that might make a suitable proxy metric for Average Orders Per User
Test for correlation (not causation at this stage) with AOPU
Assess their suitability as a proxy metric against our Metric Selection Framework
Backtest the most promising metrics to look for evidence of a causal relationship
Run new tests with our proxy metric as the primary metric, and confirm the causal relationship identified via backtesting persists “in the field”
Formally adopt the new metric

Following this process, the metric we identified was Average First Adds (AFA).

AFA is more sensitive than AOPU due to lower variance (statistical power is proportional to the square of variance), and a higher baseline of “conversions” (more customers add at least one item to their basket than successfully check out), leading to an improved significance rate for teams using the metric. Analysis of recent experiments with AFA as the primary metric shows a strong causal relationship with AOPU.

AFA has proven to be a powerful new tool in our experimentation toolbox for relevant teams, and unlocked significant additional wins and learnings for the teams.

So no grand unifying metric then?

Sadly not. Perhaps the social media companies have a simpler objective, which might lead to a more unifying metric for success.

Increase Active Users
Increase Session Time
Increase advertising $$$

The drivers of Gousto’s overall success are a bit more complex

Sign up users
Retain users
Increase Orders per User
Increase Average Order Value
Increase Profitability Per Order

Perhaps we just haven’t found it yet in the same way that physicists are still searching for their Grand Unifying Theory.

While it sounds tempting and glamorous (at least to experimentation nerds like me) to find a grand unifying metric, what really matters is finding trustworthy proxy metrics that enable one or more squads or domains to achieve a healthy win rate and learn and iterate as a result.

As I mentioned earlier, while proxy metrics can be really impactful, they are not a silver bullet. The more experiments you are running and the better your ability to backtest historic experiments (the more the better), the easier your proxy metric journey will be, highlighting the benefits of other aspects that constitute a mature experimentation programme.

Summary:

Proxy metrics are a powerful tool but not a silver bullet
Focus on delivering value rather than fixating on finding some grand unifying metric
The journey is just as valuable as the destination — what criteria do you need your proxy metric to meet? Effective backtesting requires standardisation of experiment data and analysis of the results
Sometimes the risk of doing nothing is greater than the risk of doing something
Its vital you take your wider stakeholders with you
Correlation is great, but it’s causation we need to demonstrate
It takes more than one experiment to demonstrate causation. You cannot confidently demonstrate the relationship from one or two data points, although you might be able to guestimate it
Bonus points for a metric you can simulate offline

Credit to Hugo Fernandes for the end to end development and roll out of Average First Adds