Unintended bias and names of frequently targeted groups

When we created Conversation-AI–the research group behind Perspective–we established five core values we put at the center of our work. One of these values is inclusivity: diverse points of view make discussions better. We want Perspective to be a tool that helps communities bring more people into conversations.

Shortly after our public launch, some people experimenting with the public demo on our website noticed errors and started providing helpful feedback. One particularly concerning pattern in the errors was that sentences referencing regularly targeted identity groups return higher toxicity scores (e.g. see this table from @jessamyn for examples of some of the worst errors). We wanted to share details on why these types of errors occur, the steps we’ve taken to identify and mitigate errors like these, and some of our efforts to advance the state of scientific research in this area.

Why are these errors happening?

Identity terms for more frequently targeted groups (e.g. words like “black”, “muslim”, “feminist”, “woman”, “gay” etc) often have higher scores because comments about those groups are over-represented in abusive and toxic comments. Unfortunately, that means that the data we used to train Perspective exhibits that same trend: the names of targeted groups appear far more often in abusive comments. For example, in many forums unfortunately it’s common to use the word “gay” as an insult, or for someone to attack a commenter for being gay, but it is much rarer for the word gay to appear in a positive, affirming statements (e.g. “I am a proud gay man”). When the training data used to train machine learning models contain these comments, ML models adopt the biases that exist in these underlying distributions, picking up negative connotations as they go. When there’s insufficient diversity in the data, the models can over-generalize and make these kinds of errors.

Not all false positives are equal

False positives — inappropriately high toxicity scores — related to identity have the potential to directly undermine the goals and values at the core of this project. Before we launched, our team and collaborators expected that biases would be likely to come from word embeddings or from biases in the human annotators creating the training data. This motivated us to work with frequently targeted groups to collect data sets of examples of the abuse they had experienced as well as examples of discussions that are particular important to their communities, which we used to test models before releasing them. None of these tests highlighted the unintended biases that we found after opening the API for public testing using the demo — specifically that it was the distribution of the names of the groups themselves that were a source of bias. Outside of our demo, most use-cases we know of were not impacted by these errors; this is because scores are used to identify comments above a much higher threshold than the unintended bias affects (e.g. a system might flag scores > 0.95 for human review, and ignore all lower scores). That said, the model’s errors are real. As a team we aim to develop tools that enable new people to enter and grow better discussions online — to do that, we can’t be content learning from only the biased, abusive discussions that are easiest to find.

What have we been doing to mitigate the issue?

Since making models available via the Perspective API, we’ve released five new versions of our toxicity model, with much of the work focused on mitigating bias. Here are a few highlights of the work that has been a part of these updates:

1) Learning from feedback and errors

Perspective allows developers to both score comments and to submit their own suggested scores. For example when a user clicks “Seems wrong” on our demo page, the API is sent a new suggested score which can used to retrain models in the future. Peoples entries on our demo page have already been one of the largest sources of debiasing training data, particularly for short entries mentioning a identity terms (thank you!). To other team’s developing similar technologies, we can’t recommend highly enough having a public demo to support getting this kind of feedback.

2) Developed new ways to improve models by balancing their training data

We developed new ways to balance the training data so that the model sees enough toxic and non-toxic examples containing identity terms in such a way that it can more effectively learn to distinguish toxic from non-toxic uses. You can learn more about this in our paper published at the AI, Ethics, and Society Conference.

3) New bias evaluation methods and open source code

We have also been creating and publishing new tools to measure unintended bias, based on an idea called Pinned AUC. This can be used to identify and quantify the size and kinds of unintended bias errors that reference frequently targeted groups for machine learning text classifiers. You can explore this yourself using this interactive python notebook exercise.

4) Building open datasets and partnerships for new ones

We are partnering with advocacy groups to help find conversations that can help build datasets relevant to underrepresented groups, and supporting community members who are building experiences dedicated to finding and reporting errors.

If you have data that you would like to contribute to this research. you can do that via our example submission form, directly to the API, or submit individual examples at g.co/projectrespect.

In building tools to help online communities, we can’t support the conversations we want if we only learn from the conversations we already have. Thank you for your feedback, your corrections, and your support as we continue to work on mitigating unintended biases in machine learning and improving Perspective’s ability to help people create open and inclusive forums online.

Authors: Lucy Vasserman, John Li, CJ Adams, Lucas Dixon