The Obligation To Experiment

Tech companies should test the effects of their products on our safety and civil liberties. We should also test them ourselves.

In the 1980s, controlled impact studies by NASA and the FAA tested ideas for airline passenger safety

(Allan Ko, Merry Mou, and J. Nathan Matias all contributed equally)

When Instagram banned hashtags promoting eating disorders in April 2012, it seemed like a victory for civil society and public health. Two years into its creation, the photo-sharing site was being used by peer support communities that encouraged anorexia and self harm as a “lifestyle choice.” When journalists and a major UK charity drew public attention to the problem, Instagram took action, trying to disrupt these communities by making them unsearchable. But Instagram’s actions may have made the problem worse. According to a study by researchers at Georgia Tech four years later, Instagram’s actions drove the conversation underground and sometimes may have increased it. Communities that evaded Instagram’s intervention received 15% more comments and 30% more likes after the ban.

In “The Visible Hand” Jennifer Doleac and Luke Stein found that buyers trusted sellers less and paid less if the product included a glimpse of black skin

The history of social technology is littered with good ideas that failed for years before anyone noticed. Wikipedia suffered a six-year decline in participation before discovering evidence that their vandalism detection systems were turning people away. It took fifteen years after the creation of sites like Craigslist before researchers documented widespread racial discrimination in online classifieds. And downvoting systems were a standard design feature of online platforms for seventeen years before researchers observed that each downvote makes people behave more badly, at least in political news discussions. Those are only the stories we know — many important questions about the outcomes of software for fairness and wellbeing have never been answered in public.

The paradox of Instagram’s non-success is that online platforms are leaders in large-scale social experimentation. In 2013, Microsoft researcher Ron Kohavi estimated that the Bing search engine was conducting nearly 300 experiments per day. Dillon Reisman’s Pessimizely project shows how news sites are often conducting several experiments at once every time we view their sites. These experiments rarely look at issues of public importance, and when they do, the results are very rarely made accessible to the public. Many social computing researchers have stories about phantom studies that companies prevented them from publishing when PR or legal staff objected. Failures like those at Instagram, Wikipedia, and online classifieds are directly related to this reluctance toward public knowledge.

By early 2013, Bing was running up to 300 experiments per day, according to Ron Kohavi

Last year, the bioethicist Michelle Meyer argued that illusions about experiments can prevent the public from supporting or allowing research on new technologies and policies (see Meyer’s academic article). Meyer was writing about randomized trials, methods that estimate the potential outcomes of an intervention by comparing between randomly-assigned individuals or groups. Meyer argues that debates about research ethics often ignore the human risks of putting powerful things into the world without responsibly studying their effects. Many disagree with Meyer’s arguments about how best to establish public trust in experimental research. Yet when studies are done safely, transparently, and accountably, field experiments help us evaluate interventions that we hope will achieve public benefits, while also estimating their potential harms.

Might some risks and benefits be so important that we face a moral obligation to experiment? In our discussion group this summer, the three of us set out to imagine the characteristics of this obligation. In this post, we set aside questions on how experiments should be conducted (methods, process ethics, legal codes, etc), in order to focus on the nature of the possible obligation we’re considering. We hope this starts a conversation and we welcome your thoughts, feedback, and further references.

Who Has The Obligation To Experiment?

The obligation to experiment should vary by a platform or service’s ability to manifest significant risks to its users. For example, the obligation to experiment might apply if an entity:

  • legally collects (or has the ability to collect) extensive data about people
  • attempts to influence people’s behavior
  • functions as a key node in infrastructure, through its ubiquity in public life, or through substantial market saturation
  • is a common carrier, or is broadly understood and expected to uphold objectivity, neutrality, or public goods

Organizations and platforms that operate at larger scales should be more subject to this obligation. When a service mediates the life experiences of millions or billions of people, even small risks can add up. Examples of types of platforms that satisfy these conditions might include (but certainly are not limited to): messaging applications (Facebook messenger, WeChat, Snapchat, Slack), public and semi-public forums (Wikimedia, Twitter, Facebook, reddit, YouTube), marketplaces and markets (app stores, Etsy, Amazon, Tinder, Uber, Airbnb), search engines (Google, DuckDuckGo), network providers (ISPs), operating systems (Apple iOS, Ubuntu, Android), and Internet of Things applications (Nest, Fitbit).

Even as we encourage institutions to take this obligation seriously, the rest of us can also conduct our own studies. The history of consumer testing began with independent testing of food and drug safety. Independent audits of discrimination on online platforms are becoming more common. Software is making this independent research more accessible: Nathan and Merry’s CivilServant project supports users of online platforms to conduct experiments independently of platform operators. In some online contexts like volunteer moderation or mental health support, the scale or impact of users’ power may be large enough for the obligation to apply to us as well.

What Does The Obligation To Experiment Apply To?

How do you know if you have a moral obligation to test and publish the potential outcomes of something you’re doing in the world? Here we focus on the risks and benefits of an intervention. If the risks or promises are greater, the obligation is greater. In the following list, we outline categories of risk, along with actual technologies and policies that involve those risks. Risks to life include significant and often irreversible harms. while risks to liberty include societal values of rights and fairness.

Risks To Life

Risks to physical health

Devices that claim accurate health measurements should be obligated to experimentally validate their claims. For example, lawsuits against Fitbit refer to the harms from inaccurate sleep tracking and heart rate measurements. Devices can also have unanticipated physical health effects, such as the effects of VR headsets on children’s eyesight. And beyond physical devices, mobile applications such as the fertility tracking app Glow, have presented inconclusive research showing positive correlations between using the application and conceiving faster and getting pregnant. That’s not enough; they should conduct a proper clinical trial. As Uber and Lyft partner with hospitals to improve access to healthcare, their services should be evaluated as closely as any other hospital-provided transportation service.

Risks to mental health and personal well-being

We have already mentioned Instagram and Tumblr’s suicide prevention and mental health support features, where an untested intervention (forbidding certain hashtags) had a potentially detrimental effect (fracturing at-risk communities into separate hashtags, failing to prevent their growth, and making them harder to help). Facebook has partnered with suicide prevention hotlines to implement suicide prevention and reporting tools. High-stakes interventions such as these admirably try to address serious risks to millions of people’s mental health and well-being. These mental health projects should be evaluated systematically, especially given recent progress on designing randomized trials on the effects of suicide prevention initiatives (Wasserman 2004, Mann et al 2005, Wasserman 2015). Other systems that engage with these risks are tools for addressing harassment and hate speech, as well as apps that claim to relieve stress and anxiety.

Risks to public safety

Features and platforms which pose risks to public safety also bear the obligation to experiment. This broad category of risk also sometimes overlaps with risks to individual physical and mental health. Examples might include the police-tracking feature in the Waze navigation platform; Waze claims that “most users tend to drive more carefully when they believe law enforcement is nearby,” while policemen criticize the feature for placing police officers’ lives in danger and potentially aiding criminal activity. Claims such as these are excellent hypotheses for experimentation.

Waze controversially shows police locations. The company could easily test their effect on driver behavior.

Other platform features involving public safety are Uber’s Driving Safety services, which “measure indicators of unsafe driving and help driver partners stay safe on the road” and autopilot technology being developed at Tesla, Uber, and Google. Disaster/emergency relief services such as Facebook Safety Check and Twitter Lifeline offer connection and support during crises, but do they make a difference? Simply offering a feel-good feature doesn’t guarantee its effectiveness. How could they be improved?

National security risks

Last February, Twitter reported that it had “suspended over 125,000 accounts for threatening or promoting terrorist acts, primarily related to ISIS,” as well as taking actions to inhibit extremist speech in cooperation with law enforcement agencies, counterterrorism organizations, and national and multinational governmental entities. “We have already seen results,” Twitter writes, “including an increase in account suspensions and [terrorist] activity shifting off Twitter.” But even those who agree with mass-suspension of accounts have to take Twitter’s word for it, despite evidence to the contrary from Instagram and Tumblr’s well-documented failures at trying to disrupt risky conversations on their platforms. Other platforms have adopted preventive measures, like Jigsaw’s use of Google ads to dissuade people from joining extremist groups. A growing body of causal research offers actionable knowledge on how to prevent hate crimes and reduce prejudice. When making sense of their approach to extremist activity, companies should learn from that research and add their own studies.

Risks to Liberties

Risks to civil liberties

In the twenty years since Barlow’s 1996 Declaration of the Independence of Cyberspace, ideas of utopias of free and open speech online have been complicated by issues of content moderation, search and curation, and intellectual property. Platforms are continually negotiating and renegotiating what constitutes acceptable (and unacceptable) speech. Because large, centralized platforms manage so much of our online communications, their policy decisions influence civil liberties worldwide. Yet across over 40 years of research, we found fewer than five published, causal field studies on anything related to moderation practices online.

Content moderation and online harassment both represent risks to civic engagement and the political process. More than a quarter of Americans say “they have at some point decided not to post something online for fear of attracting harassment,” according to a 2016 Data and Society report. Knowledge of mass surveillance also has a chilling effect on people’s information seeking for political knowledge. Some have suggested that social platforms’ “I voted” features could be used selectively to favor certain candidates. In a series of randomized trials in India in 2014, researchers recently showed how search engines could be used to influence elections. While it may be implausible to conduct randomized trials on complicated risks like mass surveillance, platforms do hold the required data for “natural experiments” that allow answers to these questions.

Internet users and communities also have good reason to experiment, especially when platforms don’t prioritize public knowledge on online safety. Across the web, large numbers of bystanders and volunteer moderators make millions of decisions about behavior in online communities. Two of us, Nathan and Merry, are developing CivilServant, software that supports communities to conduct their own experiments on moderation. Our first experiment, with a 13.5 million subscriber community, found that posting rules of participation led newcomers to be 7.3 percentage points more likely to follow the rules; the intervention also increased the incidence rate of newcomer participation by 38.1% on average. Our goal is to support any online community to test the questions that matter to them.

With the CivilServant software, online communities will be able to design and run their own experiments. All results are open to community discussion and are published to an open archive of community findings.

Risks to fairness and equality

Experimentation has long been a basic tool for auditing discrimination and developing fair social systems (see Nathan’s post for a more thorough review of the role of experiments in understanding discrimination and fairness). And good ideas for creating fairness should be tested. Nathan, Sarah Szalavitz, and Ethan Zuckerman did test our ideas in a recently-published study on supporting gender equality among journalists on Twitter. Concerns of fairness also extend to technology hardware. More than one experiment has shown that people with different biological sexes experience VR technology differently and that shorter people, often women, are more likely to experience motion sickness in VR.

Who Is Owed the Obligation to Experiment?

Many experiment results go unreported, leading to false confidence in results.

Since the public should be the main beneficiary of any obligation to experiment, studies under this obligation should be public. Since experiment results include probabilities rather than certainties, there are also good mathematical reasons to expect publication. The public can be exposed to grave risks when researchers withhold the results of some trials but not others. By publishing all results, we gain a less biased picture of the effect of an idea in the world.

Can the Obligation to Experiment Speed Up Innovation?

Some might worry that an obligation to experiment might slow down the software industry’s pace of innovation. In some cases, like the failure of blood testing startup Theranos, more careful progress could have prevented serious risks. But for many technologies, remember that experiments are easier, faster, and more plentiful than in medicine. Through philosophies like lean startups and test-driven development, the practice of validating hypotheses is already common in the software industry. Think less of clinical trials and more of Bing.com’s 300 experiments per day. Some companies reportedly conduct so many experiments that they are creating datamining tools to make sense of their plentiful experiment results.

Experiments can speed up innovation by helping us discard received wisdom about interventions that just aren’t working. For years, it seemed that Instagram was doing a good thing by banning eating disorder hashtags. Similarly, downvoting buttons seemed smart, and many people begged Facebook add a “dislike” button. Thanks to causal research, we now have reason to doubt our faith in those systems and look for new innovations.

The Longwood Gardens plant wall. In 1989, NASA researchers found that some plants can filter harmful gasses. In a series of followup studies, UT Sydney researchers replicated those findings, discovering that as few as 3 plants from specific species can filter some pollutants in the absence of air conditioning. The UTS team have also tested the effects of plant walls on VOC levels. Citizen scientists at the Public Laboratory for Science and Technology are also testing DIY plant-based air filters.

Next Steps for the Obligation to Experiment

If you want to see a world where the public receives reliable information on the risks and benefits of social technologies, here are six areas where more work could make a difference.

  • We need a public interest research ecosystem that evaluates the effects social technologies on life and liberty. Other fields have equivalent institutions: Underwriters Laboratories evaluates electronic products for physical safety; the Poverty Action Lab tests international development ideas; the Chicago Urban Labs and the UK What Works Centres support towns and cities to test a wide range of social policies. Egap is a network of researchers who conduct tests on governance and policy. Yet there is no equivalent public interest entity supporting causal research on the risks and benefits of social technology. If you want to help imagine and create this public interest ecosystem, contact us(email).
  • We need clearer thinking and governance of research ethics. The recent report by Jacob Metcalf, Emily F. Keller, and danah boyd with Perspectives on Big Data, Ethics, and Society summarizes these debates. As they point out, substantial work is needed from policymakers, academics, industry, educators, and funders to create and uphold ethical practices. Academic networks routinely update recommendations on ethical research decisionmaking. Desposato’s new book on experiment ethics and Salganik’s book on the practice of social research online offer helpful contributions to this ongoing conversation.
  • We need to reshape technology criticism to move beyond promotion and moral panics, as Sara M. Watson argues in her remarkable article, “Toward a Constructive Technology Criticism.” When addressing substantial public risks, critics should come to expect systematic testing and constructively call for tests where none exist.
  • We need experiment methods that create useful public knowledge. With CivilServant, we are developing infrastructures for community-led experiments that we hope will increase public knowledge on social technologies by 100x or more. Others are developing qualitative methods within field experiments. At the What Works Global Summit this year, organizations shared online platforms that make findings readable to the public by combining the results of many studies. At the Conference on Digital Experimentation, researchers shared new methods to detect differences in how people are affected by the same intervention. And now that machine learning systems are running their own experiments, we need new ways to evaluate the outcomes of artificial experimenters.
  • Enlightened companies need to unleash their own staff’s remarkable capacity to enhance their contribution to the common good. Tech companies can gain the benefits of positioning themselves as learners iterating toward public goods when they conduct public interest experiments on their products. Many already employ some of the leading researchers of our time, people who are leading recent transformations in the scale and usefulness of experiments. With ethical processes for conducting and sharing public research, companies can gain public approval and advantages over competitors.
  • Governments need to think beyond privacy and the uses of data to consider the outcomes in people’s lives. As they do so, regulators should create conditions where evaluations of technology outcomes are common. In the US, regulators should remove legal restrictions on public-interest audits of platform algorithms. Finally, governments should fund more public interest research on the effects of social technologies.

Summary of Our Argument

By outlining an obligation to experiment, we argue two basic things:

  • where a common use of technology poses substantial risk or benefit to life or liberty, it should be tested systematically
  • the results of those tests should be public knowledge

These two expectations are easier to meet than any previous time in history. When Donald Campbell advocated for policy evaluation in 1969 and Archie Cochrane argued for randomized trials in health in 1972, experiments were rare in their fields. If you’re interested to discuss ways to do the same for social technologies, contact us (email).

Experiments offer a powerful lens for estimating the effects of the technologies and policies we put into the world, especially when paired with ethnography. Yet even in contexts of plentiful experimentation, we need guidance on where to direct that lens. Whether you care about mitigating harms or ensuring benefits, we hope the idea of the obligation to experiment offers clarity on who should experiment, when to experiment, and why this growing experimental knowledge should be in the public domain.