What We Talk About When We Talk About Bias (A guide for everyone)

31 min readAug 17, 2018

Adapted from my paper for the Ethics and Governance of AI class at the MIT Media Lab. Written for those with concerns about bias in artificial intelligence (AI) systems but don’t have the relevant technical background. However, it’s not too watered-down. Don’t worry, there’s a glossary and many links!

While bias is most often introduced in data and observed in predictive, real-world applications, it usually surfaces at multiple stages in the development pipeline. We attempt to cover the underlying concepts and vocabulary in order to have a meaningful discussion about various perspectives on bias. In this post, we introduce AI systems from a traditional machine learning (ML) perspective in a fair amount of detail. We provide examples and discussion of how data bias, model bias, learning algorithm bias, system bias, and human-in-the-loop bias are observed in ML systems. Hopefully, we’ll begin to bridge the technical basics of ML with the larger societal concerns of bias in the process.

Introduction

The field of machine learning (ML) — which is a subfield of artificial intelligence (AI) — is enormously empirical, with even the most experienced researchers often lacking a satisfactory explanation for why a particular technique does or does not work well. Approaches that work well in practice on datasets, as measured by error or accuracy values, can become field-wide standards without rigorous justification. An xkcd comic provides a succinct illustration of this philosophy.

ML papers sometimes produce their own explanations of why particular techniques work without any other evidence from the field. An example is the phenomenon of “internal covariate shift,” a problem both hypothesized and solved by the authors of the original paper on batch normalization, a now-common technique used for training neural networks. While this phenomenon is barely discussed beyond this paper, batch normalization is a technique that many researchers employ — because it works. The theme of using whatever works well, without having explained or justified it, portends dangerous consequences for when automated systems get it wrong.

Due to its potential for high impact and profit, deep learning moves at fast pace and is especially pervasive in industry labs, though it impacts many areas of science and technology. This competitiveness requires high operational secrecy, as evidenced by the research divisions of large companies which pay their top researchers enormous salaries but externally publish just some fraction of their results (and almost never share their datasets). The increasing availability of big data held by private companies underpins the current AI revolution in industry. However, these big datasets can also be a primary source of bias.

From Moritz Hardt’s Fair ML course. (Source)

Once neglected by the research mainstream, discussions of bias and fairness in ML have become a new norm in both research and industry. Moritz Hardt, ML researcher and UC Berkeley professor, teaches a Fairness in Machine Learning course and has written clear and comprehensible blog posts on the subject. Kate Crawford, a principal researcher at Microsoft Research and NYU who focuses on the social implications of AI, was a 2017 keynote speaker at NIPS, a top-tier ML conference. Researchers predict that the European Union’s General Data Protection Regulation (GDPR), passed in 2016, will raise the standards for industry by effectively creating a “right to explanation” for individuals who have been subject to algorithmic decisions. A 2014 report by the White House details findings about bias in algorithmically-generated models, such as “redlining in the digital economy,” and expresses concern that most “research was not able to determine exactly why a racially biased result occurred.” The tech and AI giants Amazon, Apple, Microsoft, DeepMind, Google, Facebook, and IBM are the founding partners of the Partnership on AI to Benefit People and Society, a consortium dedicated to establishing best practices for AI and increasing transparency. Google and DeepMind have established internal ethics committees with an AI focus, though little information on them is publicly available.

The increasing interest in bias and fairness in machine learning by the research community, companies, public interests, and governments suggests that bias is more than a technical problem. However, some level of technical understanding is helpful in understanding how bias arises.

Before delving into a discussion of bias, we first establish basic ML concepts. There are two common types of machine learning, supervised and unsupervised learning. We focus on supervised learning, which requires as input a training dataset of input data examples with associated labels; a typical example is a dataset of animal photos, each labeled with a class such as “cat”. We want to create a model that maps inputs to output predictions. A supervised learning algorithm learns a model from data by starting with some generic model type and optimizing the model parameters in order to minimize the training error on the training dataset — this can also be viewed as maximizing the model’s accuracy. It does so by minimizing an objective function, which is generally based on the error between the actual labels and the predicted values.

We can visualize this optimization by imagining the model parameters incrementally changing in the appropriate directions, based on each example the current model sees as input, in order for the model to generate the appropriate output (the model sees the labels in training). Optimization algorithms such as gradient descent simply push the parameters in the right directions, decreasing error. Intuitively, we should minimize the training error because we want to achieve the correct predictions on the training data as an indication that the model is working, producing expected results for data it has seen. For example, when training a deep neural network (DNN) to distinguish images with cats from images without cats, we understand that our model will probably not work well until it is able to correctly distinguish cat and non-cat images in the training dataset. Once the model achieves some reasonable accuracy, we can try it out on a new, test dataset — which is simply a dataset of the same type, with the same categories of associated labels as the training set. However, the trained model is now trying to predict the labels of the test dataset.

In this post, we break down the issue of bias by considering the stages at which it may be introduced and observed. These stages fall along a pipeline of how many ML systems are constructed. The premise is to learn patterns from data, so we begin with data bias (Section 2), which we argue is essential to describe precisely, as it is usually the stage at which bias is introduced. We move upwards to model bias (Section 3) and learning algorithm bias (Section 4), which are very closely tied. This concludes the core of any ML development pipeline. We further consider system bias (Section 5) in the malicious intent and emergent behavior cases. Finally, we consider how human behaviors can bias human-in-the-loop systems (Section 6); this discussion is crucial due to the increasing deployment of ML systems in real-world situations. We put forth the opinion that, while bias is most often introduced in data and observed in predictive, real-world applications, it affects multiple stages in the pipeline.

We don’t provide a formal definition of bias or claim to cover all cases. We simply lay out a framework of stages at which it’s useful to consider bias. Loosely, we think of bias as the ways in which a system that appears to function correctly may fall short of designer, legal, or social expectations and definitions of fairness. These expectations and definitions are subjective and need to be formally defined by societies and governments in the long term.

Section 1: Case Scenarios

We define a pipeline to illustrate various stages at which bias may occur in the following sections. Consider three case scenarios that we will trace through the first part of the pipeline (data bias through learner bias):

ProPublica’s findings that Northpointe’s COMPAS, used for predicting criminal risk scores, appears to be “biased against blacks.”
A toy example about mortality in the ICU.
An ML parable popularized by Eliezer Yudkowsky about the US Army’s failed attempt to train a neural network to distinguish camouflaged enemy tanks in forest settings.

Section 2: Data Bias

Since machine learning problems are usually formulated as learning a predictive model from real data, it’s natural to find real-world biases reflected in automated systems. We place a huge amount of blind faith in the quality of datasets when training data-hungry models.

One can train a good algorithm on skewed data and obtain a biased model — in fact, this outcome would be expected. The opaque nature of many state-of-the-art ML systems means that detecting, explaining, or correcting this bias is difficult.

A high profile societal example of bias in automated systems is ProPublica’s work on criminal justice. Julia Anguin and a team of ProPublica reporters found that, when predicting risk assessment scores, the false positive and false negative rates are biased against black individuals and favorable to white individuals. Though this phenomenon is often termed “machine bias” (as in the article title) or “algorithmic bias,” there is a clear aspect of data bias surrounding the issue; critics point out that blacks are arrested more than whites, even when committing crimes at the same rate. Law professor Ifeoma Ajunwa, who gave a testimony on big data implications before the Equal Employment Opportunity Commission, argues that the number of convictions a person has is “not a neutral variable.”

The implications of data bias are enormous. Data bias means that, regardless of the particular model used, any statistical prediction model relying on pattern recognition will exhibit the same kind of biases present in the training data.

We encounter similar issues in domains where data is uneven or sparse, and we must pay attention to baselines. Consider a toy example in which we want to predict mortality in a particular ICU. Assume the baseline mortality rate is 7%. If we train a model which gives us 93% accuracy on a test set, we should immediately be suspicious. Any trained model which always predicts survival in this ICU achieves 93% accuracy, but is nonetheless useless. Keep in mind that 100% training accuracy is not necessarily desirable either, due to overfitting (see Section 3: Model Bias). If the data is uneven, as it is in this case where 93% of the samples in a sufficiently large dataset will be associated with the survival label and 7% will be associated with the death label, it’s common to observe this form of bias. When there are large imbalances in labels, baselines are key.

Moritz Hardt writes, “the lesson is that statistical patterns that apply to the majority might be invalid within a minority group,” and when standard approaches are used at the population-level, the learned outcomes are those that best represent the majority. This disparity easily translates into unfairness for minority groups, since predictions for a majority class will be more accurate than predictions for any minority class. The below example from Hardt’s blog post illustrates how a model may be worse for a minority group. Imbalances in the data result in worse predictive power for examples with a non-majority label, as in the ICU example.

This trained Gaussian mixture model, shown with solid and dashed ellipses, achieves a poor fit on the minority data (fewer points) in the bottom left corner compared to the majority data (more points) at the top right. From Moritz Hardt’s blog post (Source).

While we may assign particular labels to a training dataset, the labeled attributes might not be what is actually learned by a model. A classic ML parable about army tanks illustrates a common pitfall of the requirement for similar training and test datasets, reminding us that we often lack insight into how a model is learning to associate input data with particular labels. According to the story, the US army developed neural network models for detecting camouflaged enemy tanks. The trained model correctly distinguished the 50 photos of camouflaged enemy tanks from the 50 photos of the forest, and the performance on the held-out test data — also 50 photos of camouflaged enemy tanks and 50 of the forest — was also highly accurate. However, after the model was shipped to the Pentagon, it was rejected as no better than humans at detecting enemy tanks. The catch: in both the training and test datasets (which were split from the same original dataset), photos of tanks were taken on cloudy days, while forest photos were from sunny days. The model had learned to distinguish sunny from cloudy days, unsurprisingly giving poor performance when employed to detect enemy tanks.

Large technology companies have come under scrutiny for the bias discovered in their products. For instance, Amazon was accused of discrimination and redlining when rolling out its same-day service, despite the fact that it does not use racial or ethnic data in its maps. While it is often illegal to use race as an algorithmic consideration, companies such as Amazon find themselves in a bind just by using seemingly reasonable metrics, such as ZIP codes with a high concentration of Prime members. Apparently benign metrics can function as proxies for race or socioeconomic status, reinforcing inequality of access to products or services in disadvantaged areas.

A breakdown of current data interrogation standards from the Dataset Nutrition Label team. (Source)

A recent survey conducted by an interdisciplinary team of the Assembly program at the MIT Media Lab and the Berkman Klein Center for Internet & Society at Harvard found that in practice, data scientists as a whole lack consistent standards, with existing standards being highly volatile and domain dependent. Out of the 34 individuals surveyed, just under half (44%) reported that their organizations lacked best practice standards for data analysis, while approximately a quarter reported that their organizations have best practice standards. This team developed the concept of a Dataset Nutrition Label, a tool which gives data scientists some standard information — both qualitative and quantitative — about the fitness and viability of the data, what sorts of biases it may contain, and what sorts of application domains might be reasonable. This project highlights a lack of rigor with respect to choosing the right dataset for a particular task. Standards such as the Dataset Nutrition Label aim to drive accountability and mitigate harm, but widespread adoption is required and even then they don’t completely prevent bias. A common problem with data bias is that, while introduced at the data level, its effects are often unclear until further down the ML development pipeline.

Section 3: Model Bias

While data bias usually leads to problems further down the pipeline, the model is another stage at which bias may become apparent or amplified. Even with unbiased data, models may be wrong or unfair in a variety of ways. Since real-world data is rarely unbiased, the particular type of model evaluation can surface varying symptoms of bias.

One common type of model bias is overfitting, the bane of Intro ML students. Overfitting occurs when a model is fit so specifically to a training dataset that it gives poor performance on the test dataset. Even with the important assumption that the training and test datasets are drawn from the same distribution or generated in the same way, overfitting is dangerous.

A simple 2D example of polynomial overfitting.

Overfitting is usually avoided by constraining model complexity using regularization. In the language of mathematical optimization, regularization introduces a new constraint. When the learning algorithm tweaks the parameters to achieve minimization of the error, it must do so while satisfying the regularization constraint. This usually increases the training error slightly, but decreases the test error due to an improved ability to generalize. We show an overfitted model (orange) contrasted with a regularized one (green) as a visual examples.

Addressing overfitting with regularization is largely an art, not a science, and certain types of regularization approaches (especially ones that induce sparsity) may not be suitable for unevenly labeled data. Overfitting and regularization are examples of how standard ML issues and techniques, when used in societal contexts, may provide varying fairness and accuracy for different groups.

In our toy ICU and enemy tank scenarios, the models are wrong in simple ways. In the ICU example, the fact that the model always predicts one class is suspicious. Looking at the baselines, it becomes obvious that a 93% accuracy is not predictive. In the camouflaged enemy tanks example, the 100% training and testing error is fishy, and the datasets provided should be more varied for a system expected to function in the real-world scenario of camouflaged enemy tank detection. Both the ICU and enemy tank model flaws may be caught by researchers with minimal probing and evaluation of the models.

Symptoms of bias vary depending on how we evaluate a model. Consider the varying evaluations of Northpointe’s COMPAS. Cathy O’Neil, data scientist and author of Weapons of Math Destruction, says “Northpointe answers the question of how accurate it is for white people and black people, but it does not ask or care about the question of how inaccurate it is for white people and black people: How many times are you mislabeling somebody as high-risk?”— and how does this affect how we judge fairness? Matthias Spielkamp, a founder of Berlin-based nonprofit AlgorithmWatch, points out in an MIT Technology Review article that the ProPublica reporters and the rebuttal by Northpointe don’t actually contradict each other, given varying definitions of fairness. While the ProPublica study compared false positive and false negative rates for blacks and whites respectively, finding them unfavorable to blacks and favorable to whites, the Northpointe study compared the positive predictive value across races and found the model to be consistent. While both types of evaluation may be technically valid, the “correct” one to use depends on societal values and definitions of fairness. Until these definitions are made explicitly legal, regulated, and enforced, there is no incentive to perform the appropriate type of model evaluation.

Unsupervised ML approaches, though free of labels, are not necessarily free of bias. Researchers at Boston University and Microsoft Research have demonstrated that word embeddings, a popular technique for representing text as numerical vectors, often exhibit high gender bias when trained on large, real-world datasets such as Google News. Since word embeddings are commonly used by ML researchers, there is a risk that gender bias will not only be reflected but also amplified in applications where word embeddings are used. After quantifying these biases, the researchers devised a sequence of steps to neutralize and “de-bias” word embedding algorithms, with techniques such as ensuring gender-neutrality of known gender-neutral words and enforcing distance constraints between particular embeddings. Their approach successfully retains the usefulness of word embeddings while decreasing the gender bias they exhibit. Practical research endeavors such as this are critical to ensure prolonged traction and interest in the issues of bias and fairness within research communities.

Joy Buolamwini of the MIT Media Lab discusses her personal experiences with bias as a black woman whose face was often unrecognizable by facial recognition software that had no trouble with white faces. Buolamwini argues that code libraries — which improve efficiency by allowing developers to build on previous work instead of re-implementing it — can help contribute to a “coded gaze,” reflecting a lack of inclusion in the technology sector as well as in training datasets. The solution is largely to solve the data bias; by promoting the utilization and widespread adoption of more diverse training datasets, Buolamwini thinks the problem of the “coded gaze” can be solved. In her examples, addressing data bias fixes the model bias as well.

Another key point from ML basics is that we generally learn parameters, not entire models. When we train a deep neural network for a particular task, we must somewhat arbitrarily choose a network architecture upfront. Examples of chosen, as opposed to learned, architectural parameters for a neural network may include the number of layers, the types of each layer (e.g. fully-connected, convolutional), the number of nodes in each layer, and the activation functions (e.g. ReLU, Sigmoid, Softmax) to use. Fixed parameters that are not modified during training are called hyperparameters. Other parameters which are not architectural but are also not learned from data include the learning rate or step size, model constants, and initialization values. Sometimes, these are chosen with the help of a validation dataset, which functions as a pre-test dataset that a model produced from a training dataset might be immediately tested on for tweaking of hyperparameters. The objective function must also be chosen well. The metric of success is nearly always accuracy; bias and fairness are not usually considered because of the enormous trust placed in the veracity of the data.

Those who study machine learning distinguish estimation error and structural error. Formally, estimation error occurs when the parameters of a hypotheses cannot be estimated well based on the training data, while structural error occurs when the hypothesis class corresponding to a particular model (e.g. a neural network architecture) cannot produce a specific hypothesis (e.g. the network architecture with specific learned parameters) that performs well on test data. Estimation error decreases with larger training datasets, as more training data allows the model to refine the parameters and perform better on test data. However, even when training data is increased (and overfitting is prevented), a model may still suffer from structural error — it may simply be the wrong sort of model for the problem, no matter how we tweak the parameters.

Model evaluation in the context of societal applications can be difficult when models have varying levels of interpretability. While simple models (e.g. linear regression) have fairly high interpretability and explainability, the large and complex state-of-the-art neural networks that power many modern AI systems are opaque in comparison — they are black-box models. There has been a prioritization of accuracy over interpretability and explainability within the research community that makes black-box models acceptable. However, many agree that only explainable models should be used in certain applications; complex deep neural networks, for example, are not commonly used in criminal justice.

While complexity and inexplicability are clear hazards to fairness, simplicity does not guarantee a bias-free model. The purportedly race-neutral model PredPol, for forecasting crime as an aid to predictive policing, is based on just three data characteristics: the type, location, and date and time of previously committed crimes. It is still found to be biased under certain evaluations. The criminal justice researchers Kristian Lum and William Isaac respond to criticism that their analysis is based on drug crime data, which is not the type of crime the software is typically used to forecast. They write that PredPol “has been used by some jurisdictions to allocate resources to police other crime types that suffer from similar statistical bias induced by the highly discretionary nature of enforcement.” As Cathy O’Neil and others have pointed out, the method of evaluation matters as just much as the model, so evaluating models on data sufficiently similar to that used in practice is critical to any analysis. And, as in the case of PredPol, not all models are used as the designers intended.

Section 4: Learner Bias

We use learning algorithms, or learners, to obtain trained models from data. The types of models we consider for an ML problem generally have associated learning algorithms to optimize the model parameters, so it is somewhat artificial to separate this section from the one on model bias. The dataset, model, and learning algorithm form the core of an ML development pipeline.

Learning algorithms take in training data and produce the tuned parameters of the given model. To learn the weight vector and offset for a linear classifier, we may use Perceptron. For a complex neural network architecture, we use gradient descent and backpropagation to learn the weight and bias parameters of the network from the training data. The pairings between models and their associated learning algorithms implies that learning algorithm bias is closely tied to the issue of model bias. The common phrase of “algorithmic bias” can be a misnomer, since it usually refers not to the learning algorithm but to the test application of a model with learned parameters. People who talk about “algorithmic bias” are often really talking about data bias emerging in a model after training.

Problems of model bias are relevant to learning algorithms as well, including the blind faith in data as well as the potentially arbitrary nature of the hypothesis class, or type of model (and associated learning algorithm), chosen for a particular problem. A learning algorithm that works well is expected to tweak a model’s parameters to replicate statistical patterns present in the data — bias included. As Moritz Hardt writes, protected classes such as race and gender can easily be “latent in the observed attributes,” so “nothing prevents the learning algorithm from discovering these encodings.” It is possible for a learning algorithm that sees no data about race to — picking up on statistical correlations — produce a model that, when evaluated, is found to produce worse outcomes for those of minority races.

A study on emotion recognition in images attempts to correct for some bias in data imbalances by “(1) identifying the set(s) of minority classes, (2) developing specialized learners that address the minority class via special focus on the class, and (3) developing a specialized learner that combines signals from both the minority and majority class models,” focusing on the learning algorithms used. The researchers combine a generalized learning algorithm with specialized learners for the particular minority classes they identify in a conscious effort to combat bias.

General considerations and limitations of models are true of learning algorithms as well. We consider the issue of robustness, which is a question of perturbation and stability. A robust model will give similar results for similar inputs, and a robust learning algorithm will output similar models when fed similar input data. Models and learning algorithms that are not robust in this manner may be more susceptible to bias.

An input image and visualization of features detected at locations throughout the image. From “The Building Blocks of Interpretability” (Source)

The issues of interpretability and explainability are relevant to learning algorithms, as they are for models; they currently enjoy a huge focus in the ML research community. The European Union’s GDPR is expected to become a strong policy incentive for this work, since it effectively creates a “right to explanation” when algorithmic decisions are used. Distill, an online publication venue that aims to make advances in ML more transparent and accessible, is further evidence that the field is pivoting in this direction. In one article, researchers break down learned features of a neural network along the hidden layers for a variety of input images in an interactive environment. We show above one of their examples of an input image alongside a visualization of features detected at a layer early in the network, with individual component images scaled by intensity; this representation makes it clear that edges in the images are strongly detected. Distill focuses on interactive interfaces that make the learning process easier to follow and interpret. With increased interpretability and explainability in learning algorithms and models, bias can become clearer at the time of training — since we have a perspective on the implicit features being learned — instead of after model deployment.

Section 5: System Bias

Context and evaluation are important in identifying bias. While “system” is an overloaded term, we consider two important types of bias which may be observed when ML systems are deployed at a large scale in the real world. Firstly, there are the malicious cases in which an existing model — its inner workings may or may not be opaque to users — is manipulated by someone who has spent a long time studying its inputs and outputs in order to produce the malicious actor’s desired output. Secondly, there is the emergent case in which we observe network effects from simple human behaviors — an obvious example is clickbait.

Targeted adversarial attacks on large-scale ML systems are a common form of malicious behavior. A team of researchers write that many such general vulnerabilities come from “unintended and harmful behavior that may emerge from machine learning systems when we specify the wrong objective function, are not careful about the learning process, or commit other machine learning-related implementation errors” —that is, from accidents that increase system susceptibility to adversaries. However, even state-of-the-art open-source models (which presumably don’t suffer from simple implementation errors or thoughtlessness) show a remarkably high adversarial susceptibility to targeted attacks. In one case, an MIT study demonstrated that Google’s InceptionV3 classifier was fooled into thinking that a 3D printed turtle, covered with a carefully engineered adversarial pattern, was actually a rifle.

On social platforms, human users click or share clickbait more often, and content ranking algorithms usually favor content that is more clicked and shared. A recent study on the spread of fake news on Twitter found that false rumors spread faster than true news; the top 1% of false rumors spread to between 1000 and 100,000 people, while the truth reached only 1000 users. Platforms such as Facebook have responded to the fake news problem by modifying their ranking algorithms to prioritize content posted by friends and family over content originally shared by news platforms on social media. The findings that high-engagement, often highly polarizing content takes priority in social networks may have as much to do with human behavior as it does with the ranking algorithms of the social media platforms themselves.

Other systems can surface bias in the results of algorithms that find patterns in large amounts of data. Collaborative filtering is a technique commonly used in recommender systems, such as suggested purchases on Amazon based on your purchase history and what “people like you” bought. The simplest version of recommender systems could surface problems of bias, especially since recommendations can change behavior, but evaluation of a recommendation algorithm on a historical dataset would not reflect these changes. Once we acknowledge that bias may be in issue in recommender systems, the immediate question of how to fairly censor bias arises. It is unclear what legal or regulatory implications there might be for a biased recommendation system, since suggestions of products to purchase is not an obvious harm — contrast recommendations to, for example, Harvard Professor Latanya Sweeny’s findings that online advertisement delivery could show emergent racism when fed black-sounding names.

While the phrase “system bias” is overloaded, we should recognize two prevalent cases of adversarial susceptibility of large systems and the emergent behavioral trends, both of which can lead to biased outcomes.

Section 6: Human-In-The-Loop Bias

A simple case of human-in-the-loop bias can occur when a human, such as a judge, a doctor, or a human safety driver in a self-driving vehicle, acts in conjunction with an automated system. While it it is clear that people may hold their own unconscious biases or overt prejudices, it is far more complicated to consider the interaction between human actors and automated systems acting in unison.

A common human-in-the-loop system is the safety driver that companies such as Waymo and Uber use for their autonomous vehicle (AV) testing. In the case of the first AV fatality, a self-driving Uber vehicle hit a pedestrian crossing a section of road at night, while the safety driver was distracted and looking away from the road. According to some experts, the safety driver would have been able to intervene and potentially save the pedestrian had he not been looking away from the road seconds before. The safety driver’s own confidence that the AV would behave as expected was a form of bias that subverted the safety driver’s purpose in the human-in-the-loop system, leading to a fatal outcome. This example also suggests that in human-in-the-loop systems, it is difficult to ensure that the human actor behaves as intended. Human actors are likely to adapt their behavior over time after interacting with a system.

Clinical contexts for human-in-the-loop systems are growing, though many say AI is not likely to replace doctors in the near future. Clinicians have been found to have cognitive biases that affect how they provide care. Implicit biases have been found along racial lines in clinical settings, with white people favored, but the exact clinical implications of this finding have not been characterized. While decision support tools (DSTs) are being adopted in order to mitigate human error and bias, they don’t entirely prevent bias in clinical outcomes, especially when clinicians defer to their judgment in close cases. This deference to DSTs is termed automation bias, occurring in situations in which “users tend to accept computer output without sufficient thought.”

We have already considered criminal justice as an important area for detecting and preventing automated bias. While risk scores “seem scientific, an injection of computational rationality into a criminal justice system riddled with discrimination and inefficiency,” various analyses have found them to be flawed or biased against particular populations. However, human judges are susceptible to various forms of bias as well. Given the imperfectness of current risk scoring systems, it’s easy to imagine that human bias coupled with bias in a risk score, through confirmation bias, could lead to even more biased outcomes — especially when trends in enforcement, which lead to biased datasets, are reflected in judicial biases.

We may also consider the interactions of ML systems with entire communities and societies when considering human-in-the-loop bias. For example, when Google Photos in 2015 began tagging black people as “gorillas” in photos, the public outcry on social media was an indication of a cultural bias — in this case, a strong reaction to an offensive historical context. We can be almost certain that Google’s engineers did not use training data in which black people were labeled as gorillas or develop a model of racist historical contexts to inject into their systems. Due to the lack of interpretability and explainability in how image classification models learn, there was no easy fix for Google, so they simply removed “gorillas” (as well as “chimp” and “monkey”) as potential tags. This mislabeling anecdote is a fluke, a specific instance which raised an outcry, among a much wider variety of one-off misclassification errors. If we consider users of Google Photos to be human-in-the-loop moderators, most errors don’t elicit strong responses. Therefore, most errors are simply tolerable and not worth fixing. At the time of this writing, Google Photos remains “blind” to gorillas — the original problem was never properly solved, and the patch fix remains.

MIT Media Lab professor Iyad Rahwan’s visualization of a human-in-the-loop system. (Source)

Human-in-the-loop bias is an important consideration as AI systems continue to pervade everyday life. A key takeaway is that, in human-in-the-loop systems, the weak point in terms of predicting overall system behavior is often the human. Human-in-the-loop systems (which can extend to society-in-the-loop systems) make us consider the tradeoff between removing potential human biases and leaving decisions to automated systems that may lack explanatory capabilities. A standing assumption is that a good tradeoff between the two will enable humans to make better decisions in fields such as criminal justice and medicine.

Consequences and Future Directions

Kate Crawford defines two types of “harms” in her discussions on AI bias. The first is “allocative harm,” which occurs when a system is biased against particular populations; racial bias is one example. The second is “representational harm,” occurring when biased systems model the world in a manner that preferences certain viewpoints; an example is to search the Internet for images of “CEO” and to see only images of men as a result. While both these examples likely stem from data bias and are detectable at multiple pipeline stages, allocative harm has historically been the focus of the research community and regulatory groups. We have discussed both types of harms in previous sections. In many anecdotes, allocative and representational harms could have been been prevented either by developers or by regulators, in multiple stages along the development pipeline.

There are a number of initiatives to address data bias, including the previously mentioned Dataset Nutrition Label project, Joy Buolamwini’s advocacy work, and a growing awareness within industry and academia that data, though powerful, should not be blindly trusted. Research on de-biasing word embeddings and on combining demographic-specific learning algorithms are key examples of how the values of fairness and representation can be instilled within research communities to mitigate both allocative and representational harms, while providing usable tools and standards to do so.

Model and learning algorithm bias become easier to detect as we are able to better inspect and explain the learning process. Work on explainability and interpretability in the ML community within both industry and academia (often with collaborations between the two) is driven by both the promise of increasing the accuracy of models and the incentive to explain system behavior in applied settings. Explaining system effects and human-in-the-loop behaviors can help build more robust systems and decrease bias in the output. The evolution of Facebook’s content ranking algorithm is one example of addressing bias in a real system interacting with user populations. The idea of society-in-the-loop systems is also compelling, but for the most part, we’re still working towards the infrastructure and standards that would make that possible.

Ultimately, the goal of researchers, developers, and regulators should be to detect and combat bias along multiple stages of the development pipeline. While testing the output of a system or product for “algorithmic accountability” is a good start, this ideal implies a need for industry to test for bias at various points in the development process instead of at the output alone. It implies questioning the canonical blind faith in data, particularly in large datasets. It will be important to be vigilant about bias from all angles, with a special focus on building inclusive and representative datasets, thorough and appropriate model evaluation, an informative legal and regulatory framework that incentivizes fairness, and a comprehensive perspective on the real-world contexts in which AI systems are applied.

Acknowledgements

Thanks to Amy Zhang and Natalie Saltiel for their feedback, and to Kasia Chmielinski for providing a summary of survey data.

Glossary

Activation Function — The function associated with a neural network node that specifies the node output given the input(s). Generally used to introduce non-linearity into models.

Artificial Intelligence (AI) — Intelligence demonstrated by automated systems. A very broad field; machine learning (ML) is one part of it.

Backpropagation — The calculus-based technique used in gradient descent algorithms to optimize the parameters of neural network models.

Batch Normalization — An empirical technique that, in practice, improves the performance and stability of neural networks by normalizing the inputs at each layer of the network.

Black-Box — In general, systems for which we know or understand only the inputs and outputs. In ML, black-box models are models for which we lack explainability. Deep neural networks are the typical example.

Classification — The ML problem of identifying what category a particular input belongs to by learning function mappings. Binary classification involves distinguishing between two classes, while multiclass classification involves distinguishing between a set of classes.

Collaborative Filtering — a technique used by recommender systems such as online vendors (Amazon, Netflix). Can be item-based or user-based.

Convolutional Neural Network (CNN) — A subset of neural networks that use convolution (or cross-correlation) operations. They are commonly used in image classification because they preserve the 2D structure of images.

Data — The data used in ML problems may be labeled, where each data point comprises of an input and the known corresponding output, or unlabeled, where each data point is just the input because there is no specific corresponding output or it is not known.

Deep Learning — A highly current subfield of machine learning that includes models such as CNNs, RNNs, and other deep neural network (DNN) architectures.

Deep Neural Network (DNN) — Neural networks with multiple hidden layers, or layers between the input and output layers.

Estimation Error — Occurs when the parameters of a hypothesis class cannot be estimated well from the training data.

Features — Specific attributes or measurable properties of of input data that are used to train models via learning algorithms. The process of choosing features must be done carefully in order to achieve high accuracy.

Feedforward — A neural network architecture in which there are no cycles and every node in a layer is connected to every node in the previous and next layers. A Multilayer Perceptron is one type of feedforward network. Contrast with convolutional architectures.

Gradient Descent — A calculus-based approach to optimizing model parameters by finding error minima. Many learning algorithms are based on gradient descent.

Hidden Layers — Layers between input and output layers in a neural network. Neural networks with multiple hidden layers are by definition DNNs.

Hyperparameters — Hyperparameters of a model are not learned, but are set beforehand and do not change. They are common for similar models.

Hypothesis Class — The set of models (learned functions) that can be realized by a learning algorithm, which tunes the model parameters.

Labels — Correct outputs or output classes corresponding to particular inputs. Labeled data is used in supervised learning.

Learning Algorithm — The algorithm that can be applied to data to produce a model belong to the hypothesis class(es) associated with that learning algorithm.

Learning Rate — The hyperparameter controlling the speed of the updates in learning algorithms such as gradient descent.

Machine Learning (ML) — A subfield of artificial intelligence (AI) that is premised on learning patterns from data. Closely tied to data science and statistics. (Note: can be synonymous with deep learning in modern contexts).

Model — A function that is a particular instance of a hypothesis class that is learned from data by a learning algorithm and used to predict test outputs given the corresponding inputs.

Neural Network — A generally complex model represented by architecture of nodes or neurons, usually with multiple layers. The learned parameters of such a model are weights and biases. While neural networks generally have input and output layers, deep neural networks also contain hidden layers.

Node — A unit in a neural network. Also called a neuron. Each node has an associated activation function.

Objective Function — The objective of the optimization problem in any ML formulation. It is usually an error to minimize, subject to some regularization constraints.

Overfitting — The phenomenon in which a particular dataset is fit too closely by a model, resulting in poor generalizability for that model. Mitigated with regularization.

Parameters — The parameters of a model are learned during training as they are modified by the learning algorithm. They are usually initialized to something strategic that will allow training to proceed smoothly.

Perceptron — A simple, linear classifier and associated learning algorithm. Analogous to a neural network with linear activation functions and just one node or neuron.

Recurrent Neural Network (RNN)— A type of neural network commonly used to model and generate sequential data.

Regularization — The usual method to address overfitting by introducing additional optimization constraints. Mathematically, a common form of regularization is to penalize the norm of the parameter (weight) vector.

Rectified Linear Unit (ReLU)— A simple, common activation function (including its corresponding node) that simply outputs zero for negative inputs and outputs positive inputs without modification.

Robustness — Informally, the stability of the learning algorithm or model output when the input is perturbed. Robustness is important for consistency and reliability.

Semi-supervised Learning — A form of supervised ML in which unlabeled examples are also leveraged. Contrast to supervised learning and unsupervised learning.

Sigmoid — A common activation function. Useful because it is differentiable (one can take the derivative) at every point, allowing for optimization through backpropagation.

Softmax — A function used to represent categorical distributions in classification, usually applied at the output layer of neural network classifiers. Also called the normalized exponential function.

Sparsity — The statistical concept of data that is mostly empty or zero. Intuitively, the opposite of dense. Useful because it can be used to reduce the complexity of very high-dimensional data.

State-Of-The-Art — In general, the cutting edge technology. In ML, state-of-the-art usually refers to the models and learning algorithms which achieve the highest accuracy.

Step Size— Same as learning rate.

Stochastic Gradient Descent (SGD) — The form of gradient descent commonly used in practice, which leverages randomization for faster learning.

Structural Error — Occurs when the hypothesis class instance corresponding to a particular model cannot produce a specific hypothesis that performs well on test data, regardless of how the parameters are tuned.

Supervised Learning — The form of ML in which known input-output pairs are used to learn the model, or function mapping between input-output pairs, for a particular problem formulation. Contrast to unsupervised learning.

Training Data — The data fed to a learning algorithm in order to determine a model by tuning parameters. May be labeled or unlabeled data. Contrast to test data.

Training Error — The error rate or proportion of the training data when run through the trained model. Contrast to test error.

Test Data— The data input to a trained model that the model has not yet seen. Used to evaluate the model’s performance and generalizability to new data. Contrast to training data.

Test Error— The error rate or proportion of the test data when run through the trained model. Contrast to training error.

Unsupervised Learning— The form of ML in which a function is inferred to describe hidden structure from unlabeled data, or data inputs for which the corresponding output is not known upfront. Contrast to supervised learning.

Validation Data — The data input to a trained model is neither the training nor test data. This dataset is often used to manually tweak hyperparameters. Contrast to training data and test data.

Validation Error — The error rate or proportion of the validation data when run through the trained model. Contrast to training error and test error.

Word Embeddings — A category of feature-learning techniques in the ML subfield of natural language processing (NLP). Used to map text into high-dimensional vectors.