Increasing Transparency in Perspective’s Machine Learning Models

Jigsaw
Jigsaw

--

Transparency is one of the five core values we set out for Perspective and the Conversation AI research initiative when we launched. We’ve honored that value by including a demo of our model on our website, and publishing research papers, datasets, and open source models. Today we’re excited to take another step to increase transparency by sharing our first version of a Model Card for Perspective API’s TOXICITY model.

What are Model Cards?

The concept of a Model Card is being introduced this week at the Fairness, Accountability, and Transparency* Conference in Model Cards for Model Reporting by M. Mitchell et al, a collaborative paper between researchers across Google and outside the company, including Jigsaw’s Conversation AI team. Model Cards are short documents that go alongside publicly available machine learning models to share information that those impacted by the model should know, such as evaluation results, intended usage, and insight into model training processes. Building on the growing research on algorithmic fairness and unintended bias in machine learning systems, the paper recommends that evaluations should be disaggregated across different demographic groups and communities, meaning that performance metrics are computed and published independently for different identity-based subsets of the evaluation data. Publishing per-identity evaluation enables developers to understand how model scores and performance might vary between identity groups, allowing them to make informed decisions about how, where, and whether to best apply the model..

Our Model Card

Today we’re publishing an initial version of a Model Card for Perspective API’s TOXICITY model. You can see the model card in our API documentation, but we’ll walk you through some highlights here.

Model Basics

The Model Card begins with the basics about the model architecture, its training data and labeling schema. Soon, we’ll also expand this section with more extensive insight into the training datasets via links to full “Datasheets,” as introduced by T. Gebru et al. in Datasheets for Datasets. Next, the card details intended usage and usages to avoid. For example, we note that Perspective API models are recommended for human-assisted moderation and author feedback but not intended to be used for fully automated moderation nor are they intended to make judgements about specific individuals. This information helps developers use Perspective in the ways it can be most effective, robust to errors, and fair to those impacted by the model.

Evaluation Data

Next, the Model Card shows detailed evaluation results. For our evaluation, we show an overall result, as well as results broken down by specific identities to measure unintended bias.

Overall Toxicity Evaluation Data: For the overall result, we use the held out test set associated with the training set for the specific model. Note that this means that each new model version is likely to have a different training and testing set, so overall results are not directly comparable across models.

Unintended Bias Evaluation Data: For the unintended bias evaluation, we use a synthetically generated test set where we swap a range of identity terms into template sentences, both toxic and non-toxic, we then present results grouped by identity term (see our paper for more on synthetically generated test data). Note that this evaluation looks at only the identity terms present in the text, not the identities of comment authors or readers.

Metrics

We use the ROC-AUC metric, which for the toxicity model measures the likelihood that toxic examples receive higher scores than non-toxic examples within a given test set. We chose ROC-AUC because it is threshold-agnostic, meaning it does not require selecting a toxicity threshold in order to evaluate and because it is robust to large imbalances between toxic and non-toxic data, which is common in online forum data.

For unintended bias evaluation, we calculate three separate ROC-AUC results for each identity. Each result captures a different type of unintended bias and each is calculated by restricting the data set to different subsets:

Subgroup AUC*: Here, we restrict the data set to only the examples that mention the specific identity subgroup. A low value in this metric means the model does a poor job of distinguishing between toxic and non-toxic comments that mention the identity.

BPSN (Background Positive, Subgroup Negative) AUC*: Here, we restrict the test set to the non-toxic examples that mention the identity and the toxic examples that do not. A low value in this metric means that the model confuses non-toxic examples that mention the identity with toxic examples that do not, likely meaning that the model predicts higher toxicity scores than it should for non-toxic examples mentioning the identity.

BNSP (Background Negative, Subgroup Positive) AUC*: Here, we restrict the test set to the toxic examples that mention the identity and the non-toxic examples that do not. A low value here means that the model confuses toxic examples that mention the identity with non-toxic examples that do not, likely meaning that the model predicts lower toxicity scores than it should for toxic examples mentioning the identity.

* See our tutorial for more details on these metrics.

Unintended Bias Evaluation Results

The two charts below show these three metrics for a selection of identity terms around sexual orientation, gender, and race. On the left is Perspective’s initial model TOXICITY@1, launched in February 2017, and the second is the latest model, TOXICITY@6, launched in August 2018.

When we launched TOXICITY@1, some people noticed problems with the model scores for certain identity words. We can now see those issues clearly in the evaluation results for TOXICITY@1. We see low BPSN AUC values for identity terms homosexual, gay, and lesbian, and to a lesser extent, black and white. This indicates that the TOXICITY@1 model tended to give non-toxic comments with these words toxicity scores comparable to truly toxic comments without these words, essentially confusing these identity words for toxicity, and creating false positives (the inspiration for our blog’s name!). We also see low BNSP AUC values for identity terms straight and cis, indicating the opposite association. For straight and cis, the model falsely associated these words with non-toxicity, giving comments with these words low toxicity scores that might be comparable to truly non-toxic comments.

Since launching TOXICITY@1, we’ve continued investing in research on how to identify, measure, and mitigate these unintended biases in our models. We can see those improvements in the results for our current model TOXICITY@6, where performance for all identities is high.

We’re proud of this progress, but we’re clearly not done. The identity terms gay, homosexual, and black still show the lowest AUC values for BPSN, indicating that the model still has some, albeit slight, false association between these words and toxicity. Also, while the identity terms shown here are only a small subset of the 50 identity terms shown in our model card, even that set is just the beginning. In addition, these evaluations are all run on synthetically generated test data. We’re working on ways to expand to more realistic data, so stay tuned for that.

When we originally launched Perspective, the metrics or test data we needed to build this initial model card didn’t yet exist, but if it had, we would have been able to find and mitigate unintended bias in our models sooner. We’re committed to constantly improving our models and to having open, transparent conversations about model performance with those impacted by our models. Going forward, we’re excited to see how the concept of model cards grows and develops into another piece of that conversation as we continue to grow and develop our models. See our current model cards for our TOXICITY models here, including results for more identity terms and intersectional results. If you have any questions, feedback, or additional things you’d like to see in the model card, please reach out to us here.

— Lucy Vasserman, Software Engineer at Jigsaw, and John Cassidy, Design Lead at Jigsaw

--

--

Jigsaw
Jigsaw
Editor for

Jigsaw is a unit within Google that explores threats to open societies, and builds technology that inspires scalable solutions.