Datasheets for Datasets help ML engineers notice and understand ethical issues in training data

Karen Boyd
ACM CSCW
Published in
5 min readSep 19, 2021

In 2015, Google Photos labelled some Black users’ photos as images of “gorillas.” In 2016, software that estimates defendants’ likelihood to re-offend (to help set sentences and bail) was found to disproportionately label Black defendants as “high risk.” Journalists and researchers have verified disparate impact in machine learning (ML) models’ outcomes and accuracy on the basis of age, gender, race, and the intersection of gender and race, to pick a few.

Bias often starts with training data: they can be accidentally poorly balanced (they can contain too few examples of Black people’s faces, for example) or they can reflect real-world prejudice. An algorithm can’t tell the difference between “a longer nose-looking thing tends to correlate with the label ‘dog’ and smaller, shallower face is more likely to be ‘cat”’ and “the phrase ‘boy scouts’ predicts an interview, but when a resume contains ‘girl scouts’ it tends to go in the ‘no’ pile.”

Training data are the (usually very large) datasets that machine learning (ML) algorithms use to learn patterns. The model that results from this training uses the patterns it learned to, for example, label an image as containing a dog or a cat, a particular person of interest, or a member of the Uighur community in China. Training datasets are also the subject of concerns about privacy; training datasets that were claimed to be anonymized have been re-identified, revealing people’s movie ratings, mobile location traces, and even genomic data.

Bias, privacy, and harmful use are just a few of the potential ethical problems accompanying training data for machine learning algorithms. Along with researchers, ML engineers themselves are concerned and see the need for technical and work practice interventions that defend against the quickly-evolving and often-hidden ethical problems in training data.

In this paper, I tested one such intervention, called Datasheets for Datasets. Datasheets are files accompanying datasets, designed in part to help ML engineers notice potential ethical issues in unfamiliar training data by documenting the dataset’s context. Unlike similar interventions, Datasheets have three important features: they focus on training data (rather than already trained models), they are general purpose (and can be used for any data type and any ML technique), and they are written in plain language. This means that 1) they intervene in training data (and fairly early on in the algorithm development process) 2) they can be used by a wide variety of firms and taught in introductory ML classes to students who will go on to work just about anywhere and 3) they can be understood and evaluated by non-expert stakeholders, like managers, users, citizens, auditors, and policy-makers.

I wanted to know whether Datasheets for Datasets achieve one of their most ambitious goals: do they help engineers notice potential ethical issues in data they have never seen before? And once they notice the ethical issue, does the Datasheet help them build an understanding of the ethical issue that will help them decide what to do next?

To answer these questions, I gave 23 ML engineers a problem statement describing a chain jewelry store that wanted to catch thieves, a set of images of human faces, and 30 minutes to think aloud as they developed an approach to solve the problem. 11 participants were also given a Datasheet that explained the context of the data and explicitly named one of several potential ethical problems. In the Datasheet, I acknowledged that the dataset included more people with light skin than dark skin and more people who appeared to be male than people who appeared to be female, non-binary, or with unclear gender presentation. The Datasheet did not mention any of many other potential problems, for example, threats to the privacy of people in the images or in the jewelery store.

I asked participants to think aloud as they worked for 30 minutes, then interviewed them. I used Ethical Sensitivity, a framework borrowed from other professional studies and recently adapted for use with technologists, to observe three key activities:

  1. Recognition: Do participants notice an ethical issue at all? If they do, when? I used the think aloud transcripts to identify the first mention of a potential ethical issue: what they said, when they said it, and what they had on their screen. Also, at the very end of the interview, we asked participants whether they noticed an ethical issue: I wanted to capture recognition that happened even if participants didn’t say it aloud (perhaps because they didn’t think it was relevant to the task they were supposed to be doing).
  2. Particularization: How did participants build and understanding of the ethical issue (if they did so?) I looked for activities like seeking information about or reflecting on the situation, its stakeholders, or its potential consequences; their own responsibility, options, and resources; and the relationship between the ethical problem and their task at hand. I noted what participants said while they particularized, what information they sought out, and from where.
  3. Judgment: Did participants make a decision about how to move forward? We didn’t expect many people to decide what to do within 30 minutes of seeing the data for the first time, but we recorded the comment, time stamp, and justification if they offered them.

What I found:

  • Despite my initial concern that participants would not read it, all but one participant who were given a Datasheet did open and refer to it.
  • Overall, participants were ethically sensitive enough to notice ethical problems in the dataset, including some that I did not intentionally plant (Yay!) Participants mentioned discrimination stemming from the demographics of the training dataset (15), the high stakes in facial recognition (particularly for false positives) (7), privacy and consent in the training data (9), privacy and consent in data collected from the store (5), other privacy concerns (2), unconscious bias among law enforcement or security personnel (1), and justice implications of predicting crime and acting on those predictions (i.e. “Pre-Crime”) (1).
  • Participants who were given Datasheets were more likely to mention potential ethical problems in the dataset while they were working. This could mean that participants who didn’t mention problems until directly asked were just as ethically sensitive, but didn’t believe that ethics were relevant to the task at hand. If the Datasheet signals to ML engineers that ethics are relevant to their work and encourages them to bring it up, it is still achieving its aims.
  • Participants with Datasheets heavily relied on them while particularizing. Participants who got a Datasheet spent more than half of their particularizing time with the Datasheet on the screen.

This work has impacts on future research:

  • Think aloud was useful for identifying recognition (if not the *moment* of recognition in all cases) and particularization. Particularization is especially difficult for traditional methods to measure, and think aloud offers detailed data in context about how particularization unfolds.
  • Ethical sensitivity can test interventions that claim to help people recognize ethical issues, understand them, or make ethical decisions
  • A similar method could be used to develop, evaluate, and improve ethics training for technologists of all stripes

The headline findings of this paper are encouraging for the users of Datasheets and for the framework of ethical sensitivity for studying technologists’ work practices. For more detail, I encourage you to read the paper and feel free to reach out to me! I’ll present this paper at ACM CSCW 2021.

--

--