Sometimes when I tell people I research machine learning and it’s societal consequences, I get enthusiastic nods followed up by “right, because of biased data.” If the conversation is going to continue, this is the point where I backtrack and try to unpack what the other person is thinking: What do you mean by “biased data”? Where does data come from, and how does it become “biased,” anyway?
In fact, almost every time I try to discuss new ideas or existing work with people, this clarification period inevitably happens where everyone tries to figure out what exactly the problem is that we (or anyone, really) are trying to solve. Even reading papers on my own, an extended, somewhat confused monologue happens within my head: What assumptions about the data and desired outcomes are the authors making? What is it about the data or modeling here that’s actually problematic? In what domains is this relevant?
To be fair, it’s not surprising that a crisp lexicon for discussing these topics doesn’t exist. They are fundamentally multidisciplinary problems, bringing with them familiar terms from many different fields and an ad hoc shared vocabulary. Different applications of automated systems have a huge range of contexts, objectives and incentives. And the recent surge in interest and publications (both academic and journalistic), while very exciting and necessary, brings with it growing pains.
When talking to others, I’ve noticed both the overuse of “suitcase” words like bias— which, as Lipton and Steinhardt describe, have “no universally agreed-upon meaning” and may reference “disjoint methods and desiderata” — as well as a lack of more specific, useful terms at the right level of abstraction.
The problem with language like this is that language matters. The right terminology forms a mental framework, making it that much easier to identify problems, communicate, and make progress. The absence of such a framework, on the other hand, can be actively harmful, encouraging one-size-fits-all fixes for “bias,” or making it difficult to see the commonalities and ways forward in existing work.
An illustrative scenario
An engineer building a smiling-detection system observes that the system has a higher false negative rate for women. Over the next week, she collects many more images of women, so that the proportions of men and women are now equal, and is happy to see the performance on the female subset improve.
Meanwhile, her co-worker has a dataset of job candidates and human-assigned ratings, and wants to build an algorithm for predicting the suitability of a candidate. He notices that women are much less likely to be predicted as suitable candidates than men. Inspired by his colleague’s success, he collects many more samples of women, but is dismayed to see that his model’s behavior does not change.
Why did this happen? The sources of the disparate performance in either case were different: In the first case, it arose because of a lack of data on women, and introducing more data solved the issue. In the second case, the use of a proxy label (human assessment of quality) versus the true label (actual qualification) allowed the model to discriminate by gender, and collecting more labelled data from the same distribution did not help.
Data is the product of a process
Data and models and systems are not just unchanging numbers on a screen. They’re the result of a complex process that starts with years of historical context and involves a series of choices and norms, from data measurement to model evaluation to human interpretation.
There are plenty of places in this process for things to go wrong. In the data generation process, convenience might result in a skewed representation of the true population, or historical discrimination might be present in labels that will be considered “ground truth.” As ML practitioners, we might make unproven assumptions about the homogeneity of the data, or choose a set of variables too simplistic to model the outcome we really want, or not check the variance in performance on different subpopulations as we iterate.
Each of these things can lead to unwanted consequences, but the sources of those consequences are different. Viewing a problem as its consequence can mask an underlying problem in the data generation process and may overlook more direct ways to address it.
The source of a downstream consequence matters
Identifying a problem’s source requires careful application-specific analysis. Who decided what features and labels to use, and how many, and how carefully to measure them? Do they carry unwanted associations in this particular domain? Who collected the data, and from whom did they collect it? Over what period, and from where exactly was it collected? What assumptions about the data are inherent in the model choice? What data was the model benchmarked against, and how did that data even come to be used as a benchmark?
Deciding on a “fairness rule” and satisfying it makes it easy to avoid engaging with these questions, instead adjusting outcomes to make a particular consequence, like uneven false positive rates, disappear. This is not to say that such an adjustment is never useful, but that arriving at this solution without considering the source of the problem avoids important questions about how and why an automated system exists, and relies on global assumptions about what may or may not be “fair.”
Five potential sources of harm
What follows is a framework for thinking about sources of downstream harms in an automated system — how they arise, how they fit into a typical ML pipeline, and some examples. (For more on what I mean by “harm,” I highly recommend the checking out the framework presented by Kate Crawford in her 2017 talk.)
For me, this has been an increasingly useful framework to have when interpreting existing work, thinking about new directions, and communicating effectively with others. I imagine future papers being able to state the problem(s) they address in clear, shared terminology, making their framing and assumptions more understandable.
These potential sources of harm arise at different points in an ML pipeline:
- Historical bias arises when there is a misalignment between world as it is and the values or objectives to be encoded and propagated in a model. It is a normative concern with the state of the world, and exists even given perfect sampling and feature selection.
- Representation bias arises while defining and sampling a development population. It occurs when the development population under-represents, and subsequently causes worse performance, for some part of the final population.
- Measurement bias arises when choosing and measuring the particular features and labels of interest. Features considered to be relevant to the outcome are chosen, but these can be incomplete or contain group- or input-dependent noise. In many cases, the choice of a single label to create a classification task may be an oversimplification that more accurately measures the true outcome of interest for certain groups.
- Evaluation bias occurs during model iteration and evaluation, when the testing or external benchmark populations do not equally represent the various parts of the final population. Evaluation bias can also arise from the use of performance metrics that are not granular or comprehensive enough.
- Aggregation bias arises when flawed assumptions about the population affect model definition. In many applications, the population of interest is heterogeneous and a single model is unlikely to suit all subgroups.
More detail on how each of these arise, with examples and more background, is in this paper.
As an ML practitioner, knowledge of an application can and should inform the identification of bias sources. Issues that arise in image recognition, for example, are often related to selection or evaluation bias since large publicly-available image datasets frequently do not equally represent the entire space of images that we care about. In data that is affected by human decision-makers, we often see human decisions used as proxies, introducing measurement bias. For example, “arrested” is used as a proxy for “crime,” or “pain medication prescribed by doctor” is used as a proxy for “patient’s pain.” Identifying aggregation bias usually requires an understanding of meaningful groups and reason to think they are distributed differently. Medical applications, for example, often risk aggregation bias because patients with similar underlying conditions present and progress in different ways. Recognizing historical bias requires a retrospective understanding of the application and data generation process over time.
The aim of this framework is to help people understand “bias” in ML at the right level of abstraction to facilitate more productive communication and development of solutions. Terms such as “training data bias” are too broad to be useful, and context-specific fixes don’t have the shared terminology to generalize and communicate the problem to a wider audience.
By framing sources of harm through the data generation process, we hope to encourage application-appropriate solutions rather than relying on broad notions of what is fair. Fairness is not one-size-fits-all; knowledge of an application and engagement with its stakeholders should inform the identification of these sources.
We also wanted to illustrate that there are important choices being made throughout the larger data generation and ML pipeline that extend far beyond model building. In practice, ML is an iterative process with a long and complicated feedback loop; we should be conscious that problems can manifest at any point, from initial problem framing to evaluating models and benchmarking them against each other.
Thanks for reading! If you’re interested, here’s a link to the paper with more details.