How Meta tests products with strong network effects

15 min readMar 14, 2024

I’m a member of a team that’s been applying cluster experimentation to products with strong network effects, such as chat and calling, since 2018. Today, I’d like to give an overview of the challenges we face in these highly-interactive domains, and how one solution — cluster experiments — has become a go-to method for addressing these challenges. We’ll discuss the essential trade-off teams must make between test power and cluster purity, and how statistical and practical test power can differ when running cluster experiments. Finally, this post will end with some advice for practitioners, based on the results from running several thousand such experiments over the past 5 years.

Case study: dropped-call redials

In a real experiment, one where test users were given a redial button after dropped calls, an increase in engagement detected when the test was small became non-significant as the experiment was scaled up. This sort of thing isn’t uncommon: novelty-effects and underpowered tests regularly see promising early results turn neutral, prompting re-work or rollback.

However, this time something was off: the total number of calls, across both the test and control users in the experiment, showed an increasing discontinuity as the experiment rolled out to more and more users.

This finding gave us a strong reason to believe there was an effect, even if our A/B test, as designed, was no longer detecting one. Our leading hypothesis was that that test user engagement was somehow spilling over to the control group, causing the number of control calls, and their increase in call duration, to rise as a function of the test group’s. If true, the increasing similarity between test and control user engagement, as experiment sizes increase, may be causing the detected effect sizes to shrink faster than the confidence intervals that contained them.

Test interference

A/B tests are used to measure a causal difference between two groups of users, one with a feature, and one without. But what happens generally when these two groups can interfere with one another, as in the case of real-time calling experiences? And, specifically, how does this affect our measurement of their differences?

Experience shows two categories of problems emerge:

Under-treatment: Experiments that rely on users being in the same treatment group will contain very few of the type of interaction you’re trying to test. For example, we may decide that a new calling feature will only be supported on calls where all users are in the same treatment group. In this case, test users can be thought of as receiving a lower “dosage” than expected or required by an experimenter. Teams intuitively run larger and larger experiments to counteract this effect.
Spillover: Experiments where test and control users can affect each other can cause misinterpretation of an experiment’s results. For example, let’s say we allow test users to use a feature, even when they’re talking to a control peer: as test user behavior changes the number of messages sent in these threads, control user engagement is also very likely to change, shrinking our measurement of their differences, and simultaneously increasing both test and control variances. Put differently, control users can be thought of as also receiving a “dosage” of the test experience. In contrast to the under-treatment case, and as we’ll explore below, this problem actually worsens as teams run larger experiments.

Analyzing experiment interactions

Let’s reconsider our calling experiment, where 5% of all users are assigned to the test group, and 5% to control. As each user has been randomly and independently assigned to treatment, we can easily compute the likelihood that both members on a two-person call will end up in the same treatment group:

The under-treatment case is immediately apparent: only 2.6% of test user calls are with other test users. This doesn’t seem at all like the experience an experimenter intended, and we’re very unlikely to find a significant difference between test and control users when over 97% of test user interactions happen with people receiving the status quo experience. Note that the control-to-control and control-to-production experience is identical, so this issue doesn’t affect our control group.

What’s less apparent in this analysis of a single experiment size is the influence spillover will have on our experiment as a test grows larger:

No matter the size of the experiment, it’s always twice as likely that a call involving a test user will be with a control user, rather than with another test user.

At very large test sizes, where every user has been allocated to either test or control, ⅔ of all redials initiated by test users will spill over to control users, causing the differences we measure between the two groups to shrink. Outcomes like these can cause us to miss out on shipping good experiences, or miss bad product experience changes until after we’ve already shipped them to production.

A well-designed experiment must not only be capable of introducing an effect, but that effect must also be contained in order to be measurable. What we really need to do is isolate the test and control users, enabling us to better compare test interactions against control interactions.

Enter cluster experiments

Cluster experiments are our method for isolating test and control groups, and specifically their interactions, from one another. The procedure is pretty simple:

Users are assigned to clusters based on who they interact with the most
We then randomize users to treatment based on which cluster they belong to, so that all users in the same cluster will belong to the same treatment group

We found this method of randomization-by-cluster and exposure-by-user worked best at Meta, as it enabled cluster experiments to reuse all of the existing user-level metrics teams already used to measure themselves. Likewise, as it’s only a tweak to our randomization procedure, teams could make the switch to cluster experiments without making any changes to their existing app or service experiment configurations.

And notice how clustering solves both of our problems: By isolating users, we increase how often users in the same treatment interact, while decreasing how often users in different treatments do:

Under-treatment cases are dramatically reduced, by boosting the number of threads where both users in a thread are in the same treatment group, and
Spillover cases are reduced, as control and test users are less likely to interact with each other.

Building clusters

Famously, Meta refers to the global web of connections between people as “the graph.” But one big graph of users makes for bad experiments (more on this later). We’ll need to break this graph down, into smaller sub-graphs of users, in order to establish test and control samples.

In our experience, modularity-maximizing graph-partitioning algorithms, such as Louvain, tend to produce the best clusters for experimentation. New products, where groups of users already exist within disconnected sub-graphs, may be able to use these natural clusters without modification.

Evaluating clusters

The number one question we get from teams is what it means to have the “right” clustering. While a complete answer depends on how susceptible the experiment’s design is to spillover, we use the following metrics when selecting graph-partitioning hyperparameters:

Purity: what percent of relationships (graph edges) and what percent of their value (edge weight), are contained within the clusters?
Intra-cluster correlation coefficient (ICC): how correlated are metric values for users within the same cluster?
Coverage: what percentage of users, and their interactions, will be assigned a cluster?
Cardinality: how many clusters are there?
Variability: what’s the distribution of cluster sizes?

Of these measures, purity and ICC are the most important. Purity determines how many within-treatment vs. between-treatment interactions will occur, while ICC determines the reduction in power we take when we randomize by cluster.

How cluster purity affects group interactions

Returning to our previous example of a 10% A/B test, let’s compare how user-randomization compares to cluster-experimentation, across clusters of different purity.

Many people have a lightbulb moment when they realize that the 0% purity case — a special case of cluster experiments where every user is in their own cluster — is just classic user-randomized experimentation in disguise.

Think about it: if everyone’s in their own cluster, every interaction must be with a different cluster, which is exactly the same as not using clusters at all!

Armed with this detail, notice that since all curves are either monotonically increasing or decreasing, user-randomization — the 0% purity case — represents the worst-possible isolation. This means literally any clustering that increases interaction isolation will be an improvement over not running a cluster experiment. In fact, even a 1% experiment, using clusters with 1% purity will show a 3x improvement in test-to-test interactions, relative to a user-randomized test.

We can see how cluster purity affects experiments of any size when we sweep from 1–100% (dotted lines show our user-randomized baseline):

Notice how the percentage of same-group interactions is, by definition, strictly larger than the cluster purity, and grows with the experiment size. Nascent products, ones with sparse graphs, have even been known to have purities near 100%. When those products rely on network effects to grow, cluster experimentation unlocks the ability to run A/B tests for the teams supporting them.

How clustering affects effective sample size

When compared to classic user-randomized tests, cluster experiments introduce an additional source of variance into our calculations: the intra-cluster correlation coefficient, or ICC.

A foundational A/B testing assumption is that users are randomized to treatment independently of one another. But remember, we’ve intentionally clustered users together, based on who they interact with the most, and then exposed them to the same treatment! Even if the clusters are independent of one another, the people within those clusters are definitely not acting independently.

When compared to traditional A/B testing assumptions, a nonzero ICC can be thought of as effectively shrinking the experiment’s sample size. For a given cluster, this shrinkage is quantified by 1 + (cluster size — 1) * ICC.

Consider this example: if everyone in a 4-person cluster always calls each other, and everyone has exactly the same number of calls (say, 10), this cluster will have an ICC of 1.0 (i.e. their calling behavior is perfectly correlated):

In this example, we have a cluster size of 4 and an ICC of 1.0. 1 + (4–1) * 1.0 = 4. Since confidence intervals are calculated using the standard deviation, and the standard deviation is the square root of the variance, our estimates in this example would be off by a factor of √4 = 2. Put differently, without accounting for the ICC, our confidence intervals would be only half as wide as necessary at our given alpha level, making us very likely to report false positives.

Let’s try another example: what if rather than clustering people based on who they interact with the most, we instead randomly assign users to clusters, and then randomly assign those clusters to treatment:

In this case we would assume users have been assigned to treatment entirely independently, which is another way of saying that we expect their metric values to be uncorrelated. With an ICC of zero, 1 + (cluster size — 1) * ICC = 1 means that we won’t underestimate our variance at all. This is exactly what we’d expect, as this method results in the same assignment of users to test groups as classic user-randomization.

The effective sample size of a test could be as low as the number of clusters exposed in the ICC = 1.0 case, and as high as the number of users exposed in the ICC = 0 case. Since clusters are rarely fully-connected, and users even more rarely have identical metric values, clustering tends to bring a small but significant impact on test power.

Trading off power and purity

Because the lowest, 0% purity clusters, the ones where every interaction is with a different cluster, is one where every user is in a cluster by themselves, we can consider traditional user-randomization as a special case of cluster experiments where every user belongs to their own cluster.

Likewise, in a densely-connected network, the highest-possible purity, 100%, would be achieved only when every user is in the same cluster.

Taken together, we have a power-purity continuum we can use to consider the right trade-off for our experiment:

At one extreme, we have user-randomization. Purity is zero, but statistical power is at its highest. We can likely choose this option if our test won’t be affected by spillover — when we’re experimenting on products without network effects.
At the other extreme, we have a mega-cluster. Purity is 100%, but without a control group counterfactual, statistical power is effectively zero. Tests like this aren’t unheard of: testing in a full country, and constructing a counterfactual from a forecast or neighboring country, was the most-common way of incorporating network effects in our launch estimates in the days before cluster experiments.
The space in-between is our frontier. A common approach is to generate a set of cluster candidates by way of hyperparameter sweep, evaluating each for the between-group interactions and minimum detectable effect (MDE) sizes required by the experiment.

As the ICC will vary dramatically depending on the specifics of the product and the metrics being evaluated, we strongly suggest all experimenters run an interaction analysis, as done in Analyzing experiment interactions above, as well as a power analysis, before iterating with their product teams. In our experience, diminishing returns in either direction quickly kicks in, yielding a goldilocks zone with a much smaller range of trade-offs to consider.

Another way to think about the power-purity trade-off is one of statistical power vs. practical power. If statistical power is the probability of detecting an effect when there is one, practical power also considers the likelihood of the test setup introducing an effect that can be detected.

Put differently, you may be able to achieve smaller confidence intervals by using lower-purity clusters, but ask yourself: is that low-purity experiment the one I’m trying to run, and do I even trust its results for decision-making in my business?

Next steps for your team

We’ve provided a detailed writeup in our paper, Network experimentation at scale, which includes a more complete discussion of how we build clusters, randomize clusters to treatment, compute an unbiased estimate for the variance, and the tweaks we’ve applied to our A/B testing framework to make usage seamless for experimenters.

Author: Tyrone Palmer

A special thank you to Brian Karrer, Liang Shi, Monica Bhole, Matt Goldman, Charlie Gelman, Mikael Kontugan, Feng Sun, Patrick Larson, Andrei Bakhayev, and to the hundreds of people who have run a cluster experiment at Meta. This work would have been impossible without you.

Appendix: Notes for Practitioners

As promised, here’s some hard-won, practical advice for those running cluster experiments for the first time.

Cluster decay: In the experimentation context, clusters are essentially a prediction of who will communicate in the future. As time goes on, we tend to see cluster purity decay as a result, though this decay tends to be asymptotic, and cluster experiments have proven useful even when run as long as a year. The edge weights you feed into a clustering algorithm will have the largest impact on its construction, and a little extra effort to avoid overfitting, or to incorporate additional signals predictive of future user interactions, can go a long way in extending the runway of your clusters.

The new-user problem: New users won’t have a cluster, meaning they won’t, by default, be exposed to your experiment. Depending on the mix of new vs. existing user interactions that show up in your metrics, as well as the maturity stage of your product, simple or complex approaches to assigning new users to clusters may be appropriate (including ignoring this problem entirely).

Mixed experiments: Cluster experiments sound great on paper, but how do you prove that they’re an improvement? One method we’ve developed, called mixed-experimentation, involves us running a two-armed test, each with their own control group, randomizing users to treatment by their cluster-id in the first arm, and by user-id in the second, and then comparing the results. Many PMs have been won over by this approach, but equally important is that teams have avoided the added overhead of unnecessarily running cluster experiments when they’re not a fit for their product.

Updating clusters: While clusters based on more-recent data tend to perform better than ones computed a long time ago, interactions between pairs of users may change quickly or slowly. It’s usually not necessary to compute new clusters every day — or even every month — depending on your product’s own network. An active experiment usually shouldn’t have its clusters updated, as users may be reassigned to new clustering, wreaking havoc on your experiment. Likewise, network experiments that should remain mutually-exclusive from each other, and that therefore share a clustering, will need to coordinate their cleanup phases.

Integrating with existing A/B testing tools: Users are typically assigned to a treatment group based on the hash of their user-id, or some other unique identifier. Cluster-randomization is easily enabled by randomizing on users’ cluster-ids instead, typically achieved through a lookup in a low-latency, scalable key-value store. Exposure logs should likewise be modified to include the cluster-id, allowing calculation of the ICC when computing effect sizes and confidence intervals.

Dealing with “whale” clusters: Modularity-maximizing algorithms tend to produce a nice mixture of clusters of small sizes, depending on the connectedness of the input graph. However, almost every clustering results in a few whale clusters with sizes many orders-of-magnitude larger than your average cluster size. It’s easy to end up with huge imbalances in the number of users in a treatment group when one of these “whales” are exposed, and we usually recommend that teams either exclude them from their tests or break them down in data post-processing.

Reusing clusters for network analysis: Beyond experimentation, network clusters have been a great source of insights for Meta, giving teams an opportunity to visualize how groups of people use our products every day, and highlighting the ways in which groups tend to grow or shrink over time.

Other kinds of experiments to consider running instead: If you don’t need to maintain a consistent experience for users over the course of your experiment, e.g. if you’re running infrastructure experiments where the treatment is generally invisible to the end users, you might consider randomizing edges to treatment instead of users. For example, randomizing messaging experiments on the chat thread would allow you to ensure all users within a single thread have the same experience, without a need to use clusters at all.

Respecting user privacy: Depending on the sensitivity of a product experience, and user expectations of privacy, it may not be ethical or legal to cluster users based on their past activity, or to use those clusters to target new experiences. At Meta, product teams regularly conduct privacy reviews before new features enter development, which includes an assessment of how and where tests will be run. We also consider whether it’s even necessary to build clusters from user interactions: users of Marketplace, for example, are more likely to connect with someone in their same region than with another random person in the graph, a fact we can use to compare outcomes across geographies.

Why not just vary the treatment group sizes?

You may be wondering if our problem could be solved by simply varying the proportion of users assigned to the test group, i.e. running an experiment where the test and control groups aren’t the same size.

If we allocate 10% of all users to our experiment, and vary the proportion of users allocated to the test group from 0.1 to 0.9, we get the following breakdown of test user interactions:

Notice that when the test group’s proportion is 0.5, i.e. 50% of users are allocated to test and 50% to control, we get the same result as outlined in the first section of this post. As the experiment includes only 10% of users, the likelihood of a test user interacting with a production user is fixed at 90%.

By taking this approach, it’s indeed possible to increase the odds that test users interact more often, while also reducing the odds of test-to-control interactions. However, doing so also means accepting a control group so small that test power may soon become a concern. Also note that this approach, at best, provides only a doubling in test-to-test interactions, while offering no improvement to the isolation between test and production users. And, if you think about it, any varying of the size of the test and control groups that leads to an improvement would benefit cluster experiments as well.