Estimating the long-run value we give to our users through experiment meta-analysis

Published in

Meta Analytics Blog

9 min readFeb 16, 2022

We are part of the Facebook Feed Analytics team and we wanted to share methodologies that we’ve developed at Meta around how we create long-term value for people that use our products and technologies.

The crux of the data science and data engineering work we do at Meta Analytics in the Feed team is to help best estimate how to rank content for our users in order for them to get the most value out of our product (an overview of feed ranking here and here). Naturally people that use our technologies might leave a comment or “like” a piece of content but is that enough to optimize for? We instead believe in optimizing for long-run value not for immediate in-the-moment engagement. This ensures that people who use Facebook make connections and have a valuable experience.

But how do we know that seeing a piece of content at a given time delivers more value in the long run to someone who uses Facebook than seeing a different piece of content? Maybe one post is about an event and once our user finds out about this event they’ll be happy and will attend in a month and have fun. Another post could be a dog picture that the user “likes” while scrolling News Feed. Which one is more valuable for the user in the long run and how can we estimate this? One way we try to get at this problem is through surveys such as directly asking our users whether a post is worth their time or whether an interaction is valuable. However, survey volume is limited and have imperfections too, so we also use statistics to estimate what provides our people with long run value.

To use the power of statistics, we turn to A/B testing which with thoughtful analysis can help learn causal relationships (A/B testing is a standard industry practice). The first step in our analysis is to run a set of experiments (these are carefully thought-through experiments using aggregated data). In one experiment, for instance, some of the people who use Facebook are shown a bit more content from professional publishers, while in another one some other people are shown more content from groups, in yet another one more content is shown from friends. When running this methodology, you should include all types of content in this list of experiments that exist in your ecosystem, ie for every major content type in your ecosystem, make sure there is an experiment testing out increasing the distribution of this content type. The best approach is to try and keep the content definition broad (eg your experiment should increase ‘content from friends’ and not ‘dog pictures from friends’) so that it is easy to have a comprehensive set of types of content without having too many experiments. The experiments necessarily don’t need to be simply around what users see in their feeds, they can be different variations of product experience too.

So we run this set of these experiments and measure our outcome of interest (i.e. the outcome capturing user value) in each of the experiments after we have run them for a long time. We can run our experiments for multiple months to make sure we capture longer-run effects. We will now use these experiments to try to learn causal relationships between what types of content our users see and how much each type of content results in long-run value. Once we know these relationships we can easily decide what kind of content should we rank higher for our users.

Once our experiments are finished, we will analyze them through an experiment meta-analysis method. Call the outcome metric movement Y₁% in the first experiment (i.e. user value increased by Y₁% in the first experiment), Y₂% in the second one, … Yₙ experiment in the n-th experiment. This outcome is typically some behavioral indicator of user value or user satisfaction. Note that Y₁, Y₂ … Yₙ are expressed in % terms and not absolute terms (i.e. user value increased by Y₁ percent on average, and not by Y₁ units).

Next, in each of the experiments we measure how much composition has shifted, which will naturally shift a bit as this is what our experiments are changing. If you see 10% more friend content, you will see a bit less of other content as a result — it’s not just one type of composition that shifts but multiple. In terms of notation, assume the first experiment showed x₁₁% more friend content (where the first 1 in the variable stands for first content type = friend content, the second 1 stands for the first experiment), x₂₁% more group content, x₃₁% more other content, the second one showed x₁₂%, x₂₂%, x₃₂% and so on, where naturally some x’s could be negative. Now we can run a simple linear regression on the data aggregated up at the % per treatment level:

Yᵢ ~ coeff₁*x₁ᵢ + coeff₂*x₂ᵢ * …. * coeffₖ*xₖᵢ,

where coeffⱼ get estimated as elasticities of increasing the type of content in question and their subsequent impact on the outcome variable. k captures the k type of content we increase/decrease the distribution of in the experiments.

In terms of interpretation, if coeffⱼ is statistically significantly positive, it means showing more of it to users (holding all other content types constant) would result in positive user value. So we should show more of this type of content. More generally, any future feed experiment, even if not run long enough to know the long-run user value from it can be evaluated simply through calculating coeff₁*x₁ᵢ + coeff₂*x₂ᵢ * …. * coeffₖ*xₖᵢ. If the formula is positive, that means the experiment is likely going to lead to user value if we launch it to everyone who uses feed, if it’s negative, it would likely lead to a drop in user value.

Note that at the minimum we need n=k, i.e. we need as many experiments as independent variables (content types), and the higher the number of experiments, the better the estimates: so k<<n is ideal. To create new experiments to learn from, we can vary the treatment strengths (e.g. in one experiment we increase the distribution of friend content by 10%, in another by 20%), or run them at various different times. We need to be careful and thoughtful though on what set of experiments we include in our study so that our interpretation can capture causal effects. To increase the number of experiments and hence statistical power there is some more advanced methodology the team can use, but in practice, we found a smaller set of clean experiments works well where you include increasing/decreasing all types of content in at least one of your experiments. To reduce noise in our regressions, we advise to estimate as few parameters as possible (i.e. reduce k to bare minimum). A simple linear regression is a good baseline to anchor on before you turn to more complex versions. If a higher number of experiments is available, more advanced ML methods including proper cross-validation can also be employed but be careful to not have a lot of under-powered (e.g. barely statistically significant) experiments to avoid learning spurious relationships (see practitioner’s appendix for more details).

Note that the above linear model assumes constant elasticities, ie seeing 1% more of some type of content leads to y% change in the outcome variable and we assume that seeing 2% more leads to 2*y% change. This constant elasticity has been something that we’ve observed to hold true at Facebook and scale across users of different activity levels but should be verified in practice for anyone embarking on a similar project (e.g. by running experiments of different strengths). The interested experimenter can also analogously change the linearity assumption of our regression above but we have found this assumption to hold well in the Facebook App context too.

Also note that if we compare the coefficients, for instance if coeff₁/coeff₂ = 2, that means 1% of content of type 1 is roughly on average 2x as valuable toward outcome Yᵢ as 1% of content of type 2. Notice, however, that this is not in absolutes. E.g. if the average user has 10x as many posts to see of type 1 as type 2, then actually on average each piece of content of type 2 is roughly 10/2=5 times as valuable as content of type 1. Translating units into absolutes like this then allows you to derive a weighted metric expressed in terms of absolutes that can be used to evaluate any standard new experiment that the team runs. This metric can act as a surrogate of the long run outcome.

Finally, as usual, it’s important to validate the methodology the team has derived. If you are trying out the methodology, just because you have created a proxy on a long run outcome even if with high model fit, that doesn’t mean that optimizing for this and increasing it directly will indeed lead to an increase in the long-run outcome. Our recommendation is to directly verify the long-run surrogate in an experiment wherever possible: run an experiment that increases the distribution of some content type in the ratio predicted by the coefficients and verify that the weighted surrogate derived in the previous paragraph increases statistically significantly. Afterward the team can continue running this experiment in the long run and verify gains in Yᵢ in the long run, as predicted by the metric. A good cadence for rerunning the experiments is once every six months.

Practitioners appendix

A few questions might arise if you are looking to try this for your project. We’ll try to advise you based on what we think works best:

How many independent variables (i.e. x’s from above) to include? Our advice would be to keep the number small in the spirit of sparsity, though try to define them so that they are broad and cover the main channels through which your experiments create user value. The more parameters you estimate, the less robust your coefficients will be and since you will likely want to retest the relationship that you estimated periodically (e.g. every 6 month is a good cadence), keeping fewer factors will help keep the maintenance load lower. Also if you end up adding a lot of uncorrelated components, you will likely increase your error bars, making your proxy very hard to statistically significantly move in a normal-sized experiment.
Should I only include statistically significant experiments in my analysis? Our advice in general is not to do this. Were you to do this you’d introduce bias since you’d restrict your dataset for your regression based on the outcome variable. Now naturally under-powered experiments are less good than highly statistically significant ones. It is worth the exercise to run the largest experiment with the strongest condition that you can choose without hurting user experience because the more precise your estimates (i.e. really low p-values in your experiments), the less likely you’ll run into spurious relationships. In fact, in high-ish p-value experiments (i.e. where your experiment is not stat sig or barely stat sig at 95% confidence), you might start to worry about the user-level correlation between x’s and Y, which tend to be positive, which can lead to learning non-causal effects in your meta-analysis model.
What can I do if I’m only able to run mild barely statistically significant experiments? Ideally you would have a handful of experiments with comprehensive content types covering your whole ecosystem which you can learn from and define your proxy on. But if due to practical constraints this is not possible, there are some solutions to make sure you are not learning spurious effects. First you can try to shrink your experiment outcomes based on historical data of experimentation, or by splitting your experiments into random halves, and measuring x’s on one half of treatment vs control, and Y’s on the other half. A more pragmatic way though is to run large experiments with strong treatment conditions (as long as you don’t hurt user experience). More advanced methodology academic work details here.
All this is about the global averages, how can we personalize? Our methodology as outlined above naturally works on overall averages but as a more advanced methodology, we can easily do a more personalized version too, where for instance we restrict our analysis to a subset of users, defined by their country, region, or an individual level characteristic such as friend count, and we run our analysis and build our metric for the specific sub-population. We do suggest though for any practitioner to first start with the global analysis.

Authors: Akos Lada, Dan Day, Yi Wang, Yuwen Zhang

Estimating the long-run value we give to our users through experiment meta-analysis

Written by Analytics at Meta