Our Approach to Machine Learning Bias, Part 2

Published in

Sentropy

13 min readJul 1, 2020

This is the second blog post of a two-part series about Sentropy’s approach to addressing bias in machine learning classification tasks. You can read Part 1 here.

Disclaimer: this post references obscene language pertaining to hate speech.

By Cindy Wang

Introduction

Machine learning systems are only as good as the human assumptions that go into creating them. Biases in the data collection or modeling processes can manifest themselves downstream as incorrect predictions, which for critical applications like abusive language detection can have serious real-life consequences.

One of our biggest challenges is figuring out how to combat that bias. As machine learning practitioners, we ask ourselves the question, “How can we make sure that our model behaves fairly across all groups of users?” The answer lies not in removing humans from the loop, but rather in using human intuition to explicitly teach models how to make less biased predictions.

Our approach has two main components. The first is bias detection, which involves measuring the degree to which our models are biased and which specific groups that bias is impacting. To this end, we implemented a set of bias metrics that measure a classifier’s performance across different subpopulations of the test data, and we used them to evaluate our models. We found that two of our classifiers did indeed exhibit significant unintended bias that we hoped to minimize.

This leads us to the second component, bias mitigation, which includes making changes to our data collection and modeling processes to actively reduce bias. To combat the unintended bias we found in our models, we took the following steps:

Data augmentation: We added more data to our training and test sets to improve coverage for terms that disproportionately appear in abusive examples but are not actually abusive terms (e.g., “gay” or “Muslim”).
Modified objective function: We added a term to the objective function that specifically targets bias. Similar to how a regularization term penalizes a model for overfitting, our modification penalizes a model if its predictions are biased across different subgroups.
Slice-based model architecture: We modified our model architecture to commit extra learning capacity to important subsets, or slices, of the training data.

Below, we’ll walk you through the detection and mitigation elements of our approach in detail.

A measuring stick for bias

Before making any modifications to our training data or models, it was important to establish a reliable evaluation method and metrics that we could track across iterations. Concretely, the type of machine learning bias we focused on was the skewing of model predictions based on identity-related content within the text.

We decided to use a set of metrics called bias AUCs proposed by Google Jigsaw because they capture different types of unintended bias, are straightforward to compute, and are threshold agnostic. For a specific identity, the bias AUCs are given by computing the AUC scores for three different subsets of the test data:

Subgroup AUC: Restrict the test set to examples that mention the identity. A low value means that the model does poorly at distinguishing abusive and non-abusive examples that mention this identity.
BPSN (Background Positive, Subgroup Negative) AUC: Restrict the test set to non-abusive examples that mention the identity and abusive examples that do not. A low value suggests that the model’s scores skew higher than they should for examples mentioning this identity.
BNSP (Background Negative, Subgroup Positive) AUC: Restrict the test set to abusive examples that mention the identity and non-abusive examples that do not. A low value suggests that the model’s scores skew lower than they should for examples mentioning this identity.

This gave us three bias AUCs for each identity subgroup, each sensitive to a different type of identity-related bias. We’ll show how we used various bias mitigation tactics to significantly improve these metrics with respect to over a dozen different identities. Further discussion of the bias AUCs can be found in Jigsaw’s paper and Kaggle challenge.

Labeling identity mentions

In order for the bias AUCs to be an effective set of metrics, our test set needed to have reliable abusive language labels and coverage across a range of subgroups. Furthermore, we needed to label every example with the subgroup(s) mentioned within it. Using the identities provided in Jigsaw’s Kaggle dataset as a starting point, we labeled the following identity attributes (this list is growing as we continue to make improvements to our own classification models):

ability (people with disabilities)
asian
atheist
black
christian
conservative
jewish
latino
lgbtq_plus
liberal
muslim
political_group_other
religion_other
south_asian
white
women

With over a dozen subgroups, we faced the challenge of adding subgroup labels in a scalable yet accurate way. Our solution was to combine an automated keyword approach with human review. For each of the above identities, we manually curated a lexicon of terms that could be used to refer to the identity in question. To improve coverage of possible mentions, we then used unsupervised learning (based on the method described in this paper) to induce additional terms that are semantically similar to the existing terms.

As opposed to pure human curation, this method can discover variations of existing terms as well as newly emerging or domain-specific usages. The resulting expanded lexicons contained up to hundreds of terms, and allowed us to assign subgroup labels to every example using a simple keyword match.

Our bias mitigation strategies

Using the bias metrics described above, we were able to identify that two of our classifiers (identity attack and white supremacist extremism) exhibited unintended bias. For better context, here is how we define each one:

Identity Attack: Statements containing verbal attacks, threats, or hatred that are directed at people based on a shared identity such as gender, race, nationality, sexual orientation, etc.

White Supremacist Extremism: Content seeking to revive and implement the ideology of white supremacists. These ideologies can be generalized into three categories which often overlap with each other: Neo-Nazism, White Racial Supremacy, and White Cultural Supremacy.

With this in mind, we’ll walk through the steps of our approach to bias mitigation and how we applied them to our identity attack and white supremacist extremism classifiers.

1. Augmenting the training and test data

As we mentioned in our previous blog post, biases in data are often the source of bias in a machine learning model’s downstream predictions. We observed that the initial distribution of our data was affecting our models in two ways:

Examples mentioning certain identities were disproportionately likely to be abusive. For instance, in the training set for our identity attack classifier, even though only about 11% of the examples were abusive overall, the percentage of examples mentioning the Asian identity that were abusive was much higher at 36%. In practice, this meant that the model learned to incorrectly associate terms referencing the Asian identity with outputting the identity attack label.
Our existing test set was, at first, not comprehensive enough to compute meaningful bias metrics for all subgroups. We had collected our original test set using random sampling, combined with heuristics to boost the number of abusive examples (e.g., gathering additional samples from communities with high rates of abusive content) — this method is sometimes referred to as boosted random sampling. Because of the random nature of our data collection process, the resulting test set contained few, if any, examples mentioning some of the rarer subgroups. This is a common problem and one with serious consequences. As leading algorithmic fairness researchers have advocated, model evaluation should cover diverse use cases, including those that are more challenging or less frequently occurring.

We addressed both problems by augmenting our dataset with additional labeled data. Specifically, we used the lexicons we curated for each identity to do targeted sampling within subgroups, then manually labeled these new examples and split them across our training and test sets. This process allowed the model to see a more diverse set of abusive and non-abusive examples mentioning the identity terms on which it was previously biased, helping it avoid making the kinds of over-generalizations that result in biased predictions.

The following tables illustrate the likelihood of identity attack abuse in examples mentioning a given identity. We can see that after data augmentation, the percentage of abusive examples within individual subgroups more closely matched that of the overall dataset (where subgroup = all).

Percentage of abusive examples by identity subgroup in the training (left) and test (right) datasets. Cell colors denote percentage relative to the percentage of abuse in the overall dataset (blue is more abusive, orange less abusive).

In terms of evaluation, this method also improved coverage of several previously unrepresented identities in the test set. As shown in the graph below, we added several hundred examples mentioning each identity subgroup, resulting in a robust test set for meaningfully measuring bias across all 16 identities.

*Counts of test set examples by subgroup.*

Importantly, our sampling approach for data augmentation — which involves filtering by specific identity terms — did result in a data distribution that is unrepresentative of reality. Our goal here was not to represent the source data, which reflects existing social biases (e.g., internet users mention the term “gay” much more frequently in abusive contexts than not). Instead, our goals were to (i) actively train our system to avoid such biases and (ii) comprehensively evaluate our model on categories of inputs that might otherwise go ignored.

2. Modifying the objective function

The second bias mitigation strategy we employed was a bias-aware loss function. Unlike data augmentation, which is an implicit way of reducing bias via input data, this method explicitly trains the model to make less biased predictions.

Most often when we train neural networks, we are solving an optimization problem and want to minimize a loss function. For a typical classification task, the loss represents the inaccuracy of the classifier’s predictions. To give a concrete example, our abusive language classifier has high loss if it predicts a high score for a piece of text that is not actually abusive. During training, the loss at each step informs how we update the model via backpropagation.

However, the loss function is not limited to a single objective. For instance, a loss function that contains a regularization term has the effect of penalizing model complexity in addition to penalizing inaccurate predictions. Since our goal is to minimize bias, we added bias-aware terms to the loss function in order to penalize the model for making biased predictions. The resulting bias-aware loss is defined in the figure below:

An illustration of the bias-aware loss terms. 𝑓 represents an arbitrary loss function and is the same for all four terms. The bias-aware terms are calculated separately for each subgroup, then combined using a *generalized mean*.

Bias-aware loss incorporates three additional terms on top of the standard loss: subgroup loss, BPSN loss, and BNSP loss. It uses the same basic loss function (cross-entropy loss in our case) for each term, but calculates it over three different subsets of the data, analogous to the three bias AUCs. Making this modification to our training process allowed us to explicitly tell the classifier to pay more attention to errors related to any of the three types of bias captured by the bias-aware loss terms.

3. Using intuition to define critical data slices

The third and final method in our approach involves using slice-based learning to improve performance on critical subsets of data. Slice-based learning is a paradigm based on two main ideas:

Some predictions are more important than others. In our domain, this might apply to imminent threats of physical violence, or examples mentioning the names of frequently targeted identities. Since machine learning systems often optimize for global metrics, performance on these critical data subsets (or slices) may be poor even if overall metrics appear satisfactory, especially if the slices comprise only a small percentage of the overall data.
We can use human-defined heuristic functions to specify important data slices to a machine learning model.

We applied this paradigm to address common error categories, which we identified using standard error analysis and domain knowledge. Such categories might include short examples, or examples from certain user communities. In terms of bias mitigation, we were particularly interested in addressing errors related to identity mentions (e.g., examples that refer to women). We then encoded this human intuition directly into our models by specifying important data slices using heuristic slicing functions. Simplified examples of slicing functions we defined are shown below.

Implementing the slice-based framework involved replacing the final classification layer in our existing model architecture with a slice-based classification module. (An in-depth description of the slice-based learning architecture can be found here.) At a high level, the slice-based components allow us to commit additional model capacity to important or error-prone subsets of the data. In addition to training a standard prediction model, we learn separate representations for each slice, which are then combined using an attention mechanism into final, slice-aware prediction.

Putting it all together

After applying these three bias mitigation techniques (data augmentation, bias-aware loss, and slice-based learning) to our identity attack and white supremacist extremism classifiers, we observed significant improvement in bias AUC scores. The following table shows the techniques that we applied to each classifier.

For the identity attack model, bias-aware loss improved neither the bias metrics nor standard classification metrics, so we excluded it to avoid adding unnecessary complexity to the model. For white supremacist extremism, the relatively narrow class definition means that some subgroups, especially the ones not already represented in the data, never appear in any abusive examples. Therefore, our data augmentation method, which is targeted towards adding more non-abusive examples to reduce false positives, is less relevant here.

The tables below show the bias AUC scores for both classifiers. Across the board, the scores either stayed the same, in areas where the models were less biased to begin with, or went up, by as much as 4% for identity attack and as much as 22% for white supremacist extremism.

*Bias AUCs evaluated on different identity subgroups. Only subgroups with greater than* n examples in the test set are shown: n=500 for identity attack and n=200 for white supremacist extremism. Cell colors denote the difference between the original and mitigated scores (darker colors denote greater improvement).

For identity attack, we improved BPSN AUC for all subgroups by 1–4%, meaning we were able to reduce false positives mentioning these identities. For white supremacist extremism, we saw dramatic improvements in BPSN AUC of between 15–25% for the black, Jewish, and Muslim subgroups, which are frequently mentioned in white supremacist hate speech, as well as large improvements of 5% or more for most other subgroups. Significantly, we also saw an improvement of 8% in BPSN AUC for the white subgroup; even though this identity is not a target of white supremacist extremism, it is frequently mentioned in white supremacist language. Conversely, we also improved BNSP AUC for the Asian subgroup by over 8%, meaning we were able to reduce false negatives mentioning this identity. Overall, the scores show that we were able to significantly mitigate bias in the areas with the worst errors without harming existing performance.

If we plot the distribution of real-valued classifier scores, we can visualize the model’s ability to distinguish between abusive and non-abusive examples. The figure below shows score distributions of the white supremacist extremism classifier for the overall dataset and for the Jewish subgroup. We can see that before bias mitigation, the model tended to incorrectly predict high scores for non-abusive examples that mention the Jewish identity. After bias mitigation, the scores of the abusive and non-abusive examples have much less overlap.

*Score distributions for the white supremacist extremism classifier, before and after applying bias mitigation methods.*

Finally, we need to ask whether each of the bias mitigation methods we applied is necessary for the final performance of our models. To evaluate this, we trained versions of our mitigated models with each of the bias mitigation elements removed. The tables below show the results of our ablation study.

Bias AUCs evaluated on ablated versions of the bias-mitigated classifiers. The Δ columns show the change in AUC after removing the corresponding model elements. Cell colors denote the impact of removing the element (magenta means AUC went down, green means AUC went up).

For identity attack, removing either of the two bias mitigation methods we applied resulted in decreased bias AUC scores across the board. In the white supremacist extremism model, we saw more tradeoffs between BNSP and BPSN AUC, especially when ablating slice-based learning. For this classifier, we observed that our bias mitigation techniques had both positive and negative effects on the metrics, but were most effective at addressing the type of bias captured by BPSN AUC. This metric is where the original scores were lowest, and represents the type of false positive bias that can hurt frequently-targeted subgroups the most. To give a concrete example of the harm posed by false positive errors, consider the following sentences:

fuck jews and heil hitler (abusive — white supremacist extremism)
i got attacked as a “kike jew” all day yesterday. (non-abusive)

Both sentences mention the Jewish identity, but the first speaker is expressing abusive intent while the second is speaking self-referentially. If the two comments are moderated with the same action (e.g. removal), the second speaker may be unfairly silenced on top of already being victimized.

Finally, it’s important to note that though this type of ablation is useful for observing how specific methods affect different areas of classifier performance, the effects of various methods do not stack linearly. When combined, we saw that our approach reduced bias across all the subgroups we observed.

What’s next?

By making changes to our data collection, modeling, and monitoring processes, we were able to measure bias and successfully mitigate it in our models. It’s important to point out that there’s no catch-all algorithmic solution for bias, and that careful human intervention is critical for identifying the worst errors and ensuring model fairness for the most vulnerable groups of users.

The approach we describe here specifically pertains to bias stemming from identity mentions. Though this is one of the most visible types of bias that can appear in an abusive language detection model, we are aware that there are many more sources of possible bias, including, but not limited to, authorship bias, community bias, and dialect bias (especially pernicious in many academic datasets is bias against African American Vernacular English).

We hope sharing the steps we’ve taken so far can offer some insight into our models, as well as facilitate the efforts of other teams working on similar problems in machine learning bias. As we continue working to improve our models, we’re constantly thinking about ways to build more fair and inclusive machine learning systems. Please reach out if you’d like to give us any feedback!