How to Measure the Effects of Recommenders

Luke Thorburn, Jonathan Stray, Priyanjana Bengani

Social media platforms and their recommender systems have been claimed to increase political polarization, reduce attention spans, worsen mental health, and to have many other effects. Often there are compelling stories, anecdotes, or correlations that support these claims. For example, The Wall Street Journal recently published quotes from several teenage girls who all attribute eating disorders to content recommendations they received on TikTok. It’s hard to deny such personal testimony, but what is the scale and impact of this problem? Evidence that would support wider, more generalizable claims of causality is very hard to come by. For most of the harms above, the extent to which they are caused by algorithmic effects remains a difficult open question.

In part, this ongoing uncertainty is due to the limited access that researchers have to platform data and experimentation. This constrains the types of studies that can be conducted and, hence, the quality of evidence available. Consequently the findings that do get published are usually non-causal, and can be inconclusive, inconsistent, or in the worst cases, simply false.

The risk of low-quality evidence is that we waste time and effort worrying about non-issues, at the expense of those that are real. This is not an abstract worry. Concerns about the effects of screen time on youth, based largely on observational studies, are now thought to have been largely overblown. Concerns that access to search engines causes our memories to deteriorate were prompted by a single 2011 paper studying 28 Ivy League undergraduate students that later failed to replicate. Concerns about a “backfire effect” — in which fact checks cause people to believe false information more strongly — were prompted by a single 2010 study that used mock newspaper articles and had under 200 participants. Subsequent studies have failed to observe the effect, even under “theoretically favorable conditions.” There have been similar panics that TV causes aggression or violent video games lead to real world crime, but these concerns are generally not supported.

There are genuine questions that are really important, but there’s a kind of opportunity cost that is missed here. There’s so much focus on sweeping claims that aren’t actionable, or unfounded claims we can contradict with data, that are crowding out the harms we can demonstrate, and the things we can test, that could make social media better. … We’re years into this, and we’re still having an uninformed conversation about social media.

Brendan Nyhan, quoted in How Harmful Is Social Media? (2022)

But we know how to collect better evidence. In this post we step through the methodological options available to researchers at three levels: (1) without access to platforms, (2) with access to platform data, and (3) with the ability to conduct on-platform experiments. For each level, we describe the types of studies available to researchers, and the limits of what they can teach us. Ultimately, we argue that researchers need to be able to conduct experiments natively on social media platforms if we are to understand the causal effects of recommender systems on society.

A summary of the study types available to researchers at different levels of platform access, along with their major limitations.

Without access

Without access to platforms, researchers have two broad study types available to them: simulations and off-platform experiments.

In simulation studies, researchers assume a formal model of user behavior and use that to explore what outcomes we can expect across multiple counterfactual scenarios. Simulations are frequently used to investigate the social dynamics of polarization and filter bubbles, as well as phenomena like information cascades and the effectiveness of platform interventions such as user bans or “innoculation” against misinformation. By using simulations, researchers can have perfect control over which scenarios they are comparing, and can compare variables (such as the true internal opinions of each person) that would not be measurable in the real world.

The other option available to researchers without platform access is off-platform experiments. One variant is a deprivation study, where participants are incentivized not to use social media. A good example is this experiment in which participants were paid to stop using Facebook for four weeks in the lead up to the 2018 US midterm election, and were found to have reduced factual news knowledge but also be less polarized and feel better. Other strategies that don’t require access to the platform include asking people to change their behavior, such as by subscribing to a news source from the other side of politics, or building toy recommenders — artificial, self-contained social media platforms in which to conduct research. One of the more ambitious projects of this sort aims to build an actual news delivery app to provide researchers with a research environment.

Limited Ecological Validity

Every simulation study, and many off-platform experiments, lack ecological validity — their findings may not reflect the real world. All simulations depend on simple models of human behavior, and we cannot be certain that these models accurately reflect what real people would do. At most, simulations can tell us what might be, not what is. Similarly, in most cases off-platform experiments either aren’t conducted in a platform-like environment, or artificial environments that differ in crucial ways from real-world social media platforms. One review found that

…only a small proportion of the studies … about 18.1% … attempted to incorporate (parts of) the experience of social media use in their studies. In some of these cases, study design elements were less than ideal from an ecological validity point of view. For example, in some studies, participants were overtly restricted in the ways they could behave, for instance, by prohibiting participants to share posts on Facebook … Other studies clearly manipulated expectations, for instance, by telling participants to expect comments on their posts from coparticipants (thus rendering the fact that participants felt bad when these comments remained absent not particularly surprising)

To generate findings that have a stronger case for ecological validity, researchers need access to platform data.

With platform data

With access to platform data researchers can do all of the above, plus conduct observational studies.

Observational studies involve statistical analysis of real world platform data, but do not involve any intervention or experimentation on the part of the researchers. If the relevant data is accessible, observational studies can answer descriptive questions like “How common are diet-related posts on Instagram?” or “Are conspiracy videos on YouTube becoming more popular?”. They can also reveal correlations, such as that certain aspects of Instagram use correlate with poor mental health, or that the number of moral-emotional words in Tweets correlates with how far they spread (the so-called “‘moral contagion” phenomenon).

Correlation is not causation though, because of the risk of confounders — variables that affect both the “treatment” and the “outcome”. Does social media use cause depression, or does depression cause social media use, or does some third confounding variable (say, other life events) cause both? Correlative studies cannot tell us if recommender systems are causing harm, or merely reflecting harm that would have existed anyway. Further, there is a risk of spurious correlations, as demonstrated in the debate over the moral contagion phenomenon: one analysis suggested that the number of moral-emotional words in Tweets was no more predictive of spread than the number of Xs, Ys and Zs.

In some cases, it is possible to detect causal relationships in observational datasets, even though no experiment has been conducted. For example, say we are interested in the effect that a change in the recommender algorithm had on depression rates. We would like to estimate this by looking at the difference in depression rates among users immediately before and after the new algorithm was introduced, but something else could have happened at the same time — perhaps some depressing news event. To account for this, we can compare the change in depression rates for folks who used the product versus those who didn’t. This is called the differences-in-differences method.

Other methods for learning from such “natural experiments” include regression discontinuity methods and instrumental variable designs. If all possible confounders are known and measurable, then “controlling” for them by using a subclassification, matching, or propensity score method, or by including them as covariates in a model, can be sufficient to identify causal effects. If some confounders are unknown or unmeasurable — usually the case — then it may be possible to create a simulated control group (a synthetic control), to estimate causal effects from longitudinal data (specifically, panel data). Several studies have investigated the effects of digital media using these techniques.

Limited Data Scope & Quality

In practice, the data available for observational studies differs considerably depending on the platform. Some data just can’t be accessed, some data is only accessible ephemerally, and some data is — out of necessity — self-reported rather than objectively measured. For example, one review found 81.9% of studies into social media use relied on self-reported data. But we know that self-reported use of social media generally does not reflect actual use. Some self reporting methods are better than others: three that have recently been advocated for are stimulated recall, experience sampling, and diary studies. Still, none are as accurate as the objectively measured behavioral data that platforms collect.

Platform data that tracks individuals over time is rare but especially important because it gives researchers the ability to correlate on-platform behavior with off-platform outcomes, which might be measured using surveys.

In a complex and rapidly changing world, social science needs as many time series as possible. As someone who has spent a lot of time studying Twitter, there is a glaring hole at the center of this literature. … I would trade *almost all of the research ever published about Twitter* for a high-quality representative panel survey of Twitter users with trace data from their accounts matched with their survey responses.

— Kevin Munger, In Favor of Quantitative Description (2020)

The best example we currently have of a study linking on-platform behavior with other data was conducted internally by Instagram and leaked among the Facebook papers in September 2021. Not the widely covered “one in three girls” focus group study, which isn’t much help for inferring cause, but a different study. This study surveyed 100,000 Instagram users across 9 countries on the degree to which Instagram contributes to “social comparison” — how viewing other peoples’ posts on Instagram made people feel about themselves, whether positive (inspired) or negative (demoralized). These survey responses were linked with individual-level platform data about what posts each user had seen. This allowed measurements of how social comparison is correlated with factors including different categories of post, the presence of like counts, celebrity content and the use of filters. There was only one controlled experiment (which involved hiding like counts), but even mere correlations help with formulating and testing causal hypotheses. Giving independent researchers the ability to connect on- and off-platform data in this way would be a significant step towards better understanding the effects of recommenders.

There are other limitations on what granting researchers access to platform data can achieve. There need to be appropriate privacy constraints on what data anyone (both platform employees and external researchers) can look at. In particular, the content of private messages will in most cases be off limits. There is also some data that no one (including platforms) has access to. For example, evaluating the accuracy of statements (whether they are true or false, or somewhere in between) is very labor intensive, and difficult to do at scale. Studies that require truth assessments for large numbers of items are likely impractical.

Limited Causal Inference

Above, we mentioned several strategies that can be used to estimate causal effects from observational data. However, often this evidence is not as strong, or requires more assumptions, than would be required to estimate those effects via experimental data.

For example, say we want to know whether diet videos on TikTok are harming the mental health of teenage girls, and have access to all the (observational) platform data that exists. This dataset probably doesn’t include all variables that affect mental health, so we can’t rely on subclassification, matching, or propensity scoring. The effect of these unobserved variables is likely to vary with time, so we can’t rely on longitudinal panel data. Natural experiments arise less frequently on platforms like TikTok because the algorithm is pretty much the same across jurisdictions, and is very responsive to user behavior, so it can be difficult to find so-called “exogenous” sources of variation that are not influenced by the choices of users.

Thus, to assess causality, we really need to conduct an experiment: that is, to randomize the degree to which TikTok users are exposed to diet videos, and measure how this affects their mental health. Such on-platform experiments are required to answer many of the most important questions we have about the effects of recommenders.

With on-platform experiments

At this level of access, researchers could do all of the above, plus run on-platform experiments capable of detecting many causal relationships that are of interest.

For example, researchers could assign users to different recommender algorithms and assess the degree to which each causes or mitigates affective political polarization. They could randomly show different users various political ads or interventions like fact checks, and see whether they cause changes in future media consumption patterns. Like in the Instagram study, they could understand what types of content cause improvements or deteriorations in the mental health of users. Experiments are the gold standard for accumulating causal knowledge, and it is important that at least some independent researchers have input into what on-platform experiments are run, without being subject to veto by commercially-motivated platforms.

Limited Ability to Generalize

That said, even with the cooperation of platforms, experiments may not always be appropriate. Well-designed large scale experiments, or randomized control trials (RCTs), can provide causal information, but are not invulnerable to confounders and must be designed and interpreted carefully. Compared to observational studies, they are also relatively costly, and the results can be highly context specific. Taking the results of a well-designed study of Twitter in the US and assuming they hold for Facebook in India would be like “designing and executing a moon landing … then sending the same ship to Mars with triple the fuel and assuming things will work out.

In practice, this means that results from one platform may not generalize to others, that results from one jurisdiction may not generalize to other places, and that results from one point in time may not hold in future — “the internet changes everything, very quickly.” The cost of running experiments may not be justified if the resulting knowledge is brittle.

Quantitative description is cheap, and much of the cost is fixed. In contrast, causal knowledge is expensive and much of the cost is marginal. The marginal cost of updating [a political] database and the [some modeled] scores for each session of Congress is much lower than the fixed cost of creating those models in the first place. In contrast, the marginal cost of re-running a Twitter RCT every time Twitter’s userbase or platform policies change is very high.

— Kevin Munger, In Favor of Quantitative Description (2020)

Limited Isolation of Cause/Effect

The social media context presents some additional subtle issues that complicate experimental design and data interpretation.

One is the question of which component of a social media ecosystem is responsible for a particular harm. If we were able to prove that social media use increases the risk of radicalisation, how could we tell if that harm is due to the recommendation algorithms, the user interface, or the behavior of other users? It is very difficult to manipulate these variables in isolation. When you change the algorithm or the interface, user behavior also changes, often strategically, in ways that are not anticipated. An A/B test might capture the immediate effects, but not the long term changes to “equilibrium” behavior that would be observable after, say, 6 months.

Network effects also complicate experiments. Say a platform wants to evaluate the impact of upranking comment threads with long back-and-forth conversations. You might be one of the users selected to be part of the trial, but most of your friends won’t be. The new conversations you are involved in may peter out because your friends are less likely to be prompted to continue them. This moderates the effect of the new ranking algorithm, underestimating its impact. Such second-order network effects are difficult to avoid without changing things for everyone in an interconnected social network, which is one reason why platforms often roll out changes one country at a time.

Limited by Ethics Considerations

There are also ethical considerations when conducting experiments with human research subjects. Online platforms routinely conduct A/B tests with no external oversight, but academic studies are subject to ethics review by university ethics committees, which may prevent some large scale studies from being conducted if there is a reasonable expectation that they may disadvantage or harm some users. Based on such concerns, one study found that “people often approve of untested policies or treatments (A or B) being universally implemented but disapprove of randomized experiments (A/B tests) to determine which of those policies or treatments is superior” — even if conducting the experiment would help make everyone better off in the long run.

The current scarcity of causal evidence does not mean we shouldn’t take action to mitigate the risks that platforms cause harms. There is a history of industries, from tobacco to fossil fuels, using uncertainty to deflect responsibility and avoid taking action. The precautionary principle applies: in many cases it is probably reasonable to make policy changes based on correlational studies while still evaluating causality. Yet we don’t really understand if and how recommender algorithms harm teen mental health, increase the prevalence of false beliefs, or exacerbate polarization. If we are to understand the effects of recommenders then — be it through platform-initiated transparency measures or regulation — researchers will need access to platform data, and the ability to run on-platform experiments.

Luke Thorburn was supported in part by UK Research and Innovation [grant number EP/S023356/1], in the UKRI Centre for Doctoral Training in Safe and Trusted Artificial Intelligence (safeandtrustedai.org), King’s College London.

--

--

Understanding Recommenders is a research-driven effort to demystify recommender systems and their impact on society. A project of the Center for Human-Compatible AI at the University of California, Berkeley.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Luke Thorburn

Doctoral researcher in safe and trusted AI at King’s College London.