Experiments are the Best Kind of Transparency

Published in

Understanding Recommenders

9 min readJun 9, 2024

— Luke Thorburn, Jonathan Stray, Priyanjana Bengani

Many of the big questions regulators of online platforms care about are about causality. Do current recommender systems harm mental health? Do they make politics more divisive? Do they undermine journalism? If so, by how much? We can get some insight into these questions from outside platforms, but the most reliable answers come from experiments conducted on-platform with real content, real interfaces, real users and real patterns of use.

For now, only platform employees (and their occasional academic collaborators) have the ability to conduct such experiments, and the results of experiments that are conducted — hundreds every week, for large platforms — are rarely made public. This lack of access to strong causal evidence about the impacts of platform design decisions makes it difficult for academics to conduct public interest research, and difficult for policymakers to design effective, evidence-based regulation.

In this piece we argue that for those who play a role in governing recommenders, the most useful kind of transparency is experiments, because they provide the strongest evidence for (1) whether intervention is warranted and (2) which interventions will be effective. They also (3) level-up policy debates and (4) facilitate the regulation of outcomes in a technology-agnostic way. We review how non-experimental forms of transparency fall short, assess the limitations of current proposals (both under Europe’s Digital Services Act, which is law, and the Platform Accountability and Transparency Act that is stalled in the US Senate) with respect to experimental access, and sketch what a regulatory framework for allowing experimentation could look like.

Why experiments?

Causal evidence is important

Causal evidence is important for regulators, first, because it can tell us whether intervention is justified. There are widespread concerns that recommenders are implicated in political polarization, mental health, declining local news, and many other problems. But if recommenders are not significantly contributing to such problems — and if building recommenders differently would not help — why intervene? Attempts to regulate in the absence of causal evidence risk placing unjustifiable state-enforced constraints on speech and other human activities, and spending valuable resources (including the time and attention of regulators) in ways that ultimately don’t improve the outcomes we care about.

Second, causal evidence is important because it can tell us which specific alternatives — which alternative algorithms, which evaluation metrics, which changes to the user interface — would be better. As a regulatory domain, online platforms are particularly amenable to evidence collection. All aspects of their design can be experimentally manipulated, and most established platforms will already have infrastructure and processes in place to perform such A/B tests. In principle, this makes it possible to test the impact of interventions and write evidence-based design standards for social platforms in a way that is not possible in other domains, and can help avoid encouraging alternatives that don’t actually address the problem.

Third, causal evidence enables better policy discourse. For example, consider the ongoing and deeply felt debates over how social media impacts the mental health of young people. There is some causal evidence available, but it is often not as definitive or granular as could be provided by on-platform experiments (see, e.g., Instagram’s Project Daisy), and this leaves space for interpretation. Some are concerned that we fail young people by unnecessarily waiting for stronger evidence, and others are equally concerned that we fail young people by mistakenly or simplistically concluding that social media is the primary driver of declines in mental health. Access to stronger causal evidence could help resolve contentious claims, and shift the mainstream discussion from “whether or not to blame social media” to the details of the causal dynamics at play.

Fourth, causal evidence is important because it would allow regulators to address the impacts of design choices, and not the design choices themselves. While technology-specific design standards provide helpful guidance, it can be difficult to write laws that dictate what people can build in a way that (a) adapts to technological innovation and (b) avoids banning harmless uses of the same technology. For these reasons, it might be better to regulate outcomes. But to do this, regulators need regular, reliable access to evidence about the causal impacts of platform design choices on a standardized set of outcomes.

To answer causal questions, you need experimental access

There are ways to get some causal evidence without running on-platform experiments, but they all have significant limitations. Approaches that academics and civil society researchers commonly use are:

Mock platforms — Sometimes researchers build their own model social media apps and pay people to use them. This allows the researcher to get evidence about how design choices on this mock platform impact users. But there will always be question marks over the extent to which user behavior on the mock platform reflects their behavior on real apps, especially because niche platforms don’t experience the network effects of major players.
Browser extensions — Sometimes researchers pay people to install a browser extension that changes how a platform works and collects data on how the user responds to this change. But this method can’t be used for mobile apps, which by some estimates account for more than 80% of social media use, and the possible interventions are limited by the data available, the defensive measures many platforms have implemented to make scraping difficult, and the need to implement changes without substantial delays during page load.
User controls — Sometimes platforms give users control over their recommender. For example, X (formerly Twitter) allows users to switch between an engagement-based feed and chronological feed, and this makes it possible to compare the content that is surfaced by each algorithm. But this approach can only be used to study alternatives that are already made available by a platform.
Observational data — In some situations it is possible to extract causal evidence from observational data, using statistical methods that exploit natural experiments or carefully simulate a control group where one didn’t exist. These methods have been used in the social media context, but are limited to questions for which the data has a particular structure, and the answers they provide are never quite as convincing as conducting an actual experiment. For example, the ways that these methods try to account for confounding variables are proxies for the true randomization that would be used in an experiment.

Access to data alone is not very helpful for understanding causality. To get reliable information about what causes what, researchers really need access to run on-platform experiments.

We don’t yet have experimental access

Currently, there are no reliable ways for academics, regulators, or civil society researchers to run experiments on platforms.

There are some examples of platforms voluntarily collaborating with academics to perform on-platform experiments. Such collaborations are valuable, but infrequent. They are built on serendipitous relationships between academics and platform employees, often fall through because they are not a priority for platforms, and fundamentally leave the power with platforms to decide what experiments they will let researchers run. This has been called “independence by permission,“ which is not true independence.

Are there legal avenues to require platforms to provide experimental access? In the EU, the Digital Services Act (DSA) has two related provisions. Article 37 requires platforms to subject themselves to annual independent audits. These audits, according to additional guidance provided by the EU, may require the auditing organization to perform “tests,” the stated definition for which includes “experiments.” While the scope of these experiments is limited to assessing compliance with the DSA, the definition of “systemic risks” (for which the DSA requires risk assessment and mitigation) is very broad, and arguably includes most of the experiments that would be of interest. That said, the extent to which the word “experiments” in the delegated act is interpreted to include actual randomized controlled trials, and the extent to which auditors make use of them, remains to be seen. (We encourage those involved in the implementation of the DSA to push for experimental access via this route.)

Article 40 of the DSA requires platforms to make data available for third-party researchers, and to some extent it remains to be seen exactly what forms of data access this will enable. But it seems unlikely that Article 40 would allow researchers to request data that a platform would need to conduct an experiment in order to generate.

In the US, the proposed Platform Accountability and Transparency Act (PATA) would, if enacted, require platforms to cooperate with any “qualified research project” that has been approved by the National Science Foundation via a newly-established review process. However, there is no mention of experimentation in the draft text, and passing the bill does not appear to be a priority for Congress in the near term.

What about past experiments?

While there isn’t yet a reliable route to conduct on-platform experiments, the data access provisions under Article 40 of the DSA probably do allow for compelled release of the results from experiments that a platform has already run. Release of such experiment results is also required by the Prohibiting Social Media Manipulation Act which recently passed in Minnesota, and takes effect on July 1, 2025.

This is potentially a rich source of insight. Large platforms run hundreds of experiments every week, and seeing the results of these could significantly contribute to policy debates. For example, the documents leaked by Facebook whistleblower Frances Haugen mention experiments testing the impact of engagement-based ranking relative to a chronological feed, and the impact of removing most platform integrity measures. There are very few public experiments of comparable scale or validity to inform debates over recommender design.

Whether the Minnesota law survives legal challenges, and the extent to which Article 40 will be used in practice to access such results, however, remains to be seen. It’s difficult for researchers to know what data to ask for without knowing which experiments have been run. Further, there is a problem of perverse incentives: without additional safeguards, compelling the disclosure of experimental results might discourage them from evaluating their impact on society out of fear that the results of such experiments would become public. For these reasons, at least some power to initiate experiments should lie outside platforms.

Implementation

So how could regulation facilitate third-party experiments?

Sketch of a law

A new law could look similar to PATA — which would establish a formal approval process for qualified researchers to access platform data — but with an explicit focus on experimental access. It would need to answer a number of key questions, including:

Who decides which experiments are run? — This could be an existing scientific body (e.g., the NSF). Proposals for experiments would also need to undergo ordinary academic ethics review processes, which usually require participants to explicitly consent to participate in research. This differs from the way platforms normally conduct in-house experiments, where they rely on the blanket permissions granted in platform terms of service.
Who pays? — This could be existing academic funders, the platforms, or some combination thereof.
Who does the work of actually conducting the experiment? — This is likely to be some combination of external researchers (who are framing the research) and platform employees (who can directly work with the internal systems of a platform).

Challenges

Any legal requirement for experimental access would need to responsibly navigate concerns around privacy and research ethics, but these are solvable problems. There are standard ethics review processes in place at most academic institutions, and there are many concrete ways to mitigate privacy risks, both technical (e.g., differential privacy) and procedural (e.g., remote query execution).

A more fundamental challenge relates to free speech. In the US, compelling a platform to change the order in which it ranks content for some users would potentially violate the First Amendment, even if such a change was in the service of a temporary, public-interest experiment that was designed by academics without government influence. Platforms have already challenged less-interventional transparency measures on constitutional grounds, with some success. The US legal system can enforce private contracts, so if a platform is challenged in court for causing an outcome they previously told users they wouldn’t cause, it might be in their self interest to produce experimental evidence on the matter. But in general, direct legal mandates for experimental access might only be possible in other jurisdictions.

Conclusion

There are many different kinds of transparency for recommenders, which all serve different needs, so when calling for transparency it’s important to be specific about the intended audience and how the type of transparency requested will be helpful for them. For regulators and all those who play a role in governing recommenders, the most useful kind of transparency is the ability to run experiments on platforms, because experiments provide strong causal evidence to clarify whether intervention is justified, and if so, which specific alternative algorithmic design choices would be better. Here, we have focused on the context of recommender systems, but the same basic point is also true of generative AI.

Thank you to Nathaniel Persily for helpful guidance on the scope of PATA and the DSA. Any errors remain those of the authors.

Luke Thorburn was supported in part by UK Research and Innovation [grant number EP/S023356/1], in the UKRI Centre for Doctoral Training in Safe and Trusted Artificial Intelligence (safeandtrustedai.org), King’s College London.