Toxicity and Tone Are Not The Same Thing: analyzing the new Google API on toxicity, PerspectiveAPI.

Google’s Jigsaw group today launched Perspective API, their api to create better conversations but their first roll out, their first machine learning model is to help deem or rate conversations as ‘toxic’ or not. I’m flattening what the API could do, but Google describes it as, “…The API uses machine learning models to score the perceived impact a comment might have on a conversation. Developers and publishers can use this score to give realtime feedback to commenters or help moderators do their job, or allow readers to more easily find relevant information, as illustrated in two experiments below...” Effectively, can you alleviate moderators work flow when looking at traumatic content and also can you help surface good content. The first is easier to do, and spaces like the Coral Project are creating products and software to mitigate harassment in commenting sections, and they are doing it well. What’s harder to do is create or surface what is a good conversation.

But let’s go back to bad conversations. What makes a conversation ‘bad’? In Google’s mind, it’s rating toxicity. Toxicity, I imagine, ranges from negative words, sentiments, to violent threats, racism, misogyny, etc (you get the point). But to create a machine learning API to rate toxicity, you have to teach it toxicity, and you need to generate a corpus of toxic words. Jigsaw may have used 4 different data sets, as they listed 4 different partners on their website for Perspective API. Those dates sets are from 3 newspapers’ and what I assume are their commenting sections: The Guardian, the Economist, and the New York Times. Why do I assume that? Because Jigsaw has said they are working with the New York Times and their moderators specifically on toxicity and abuse to train a bot to understand abuse. The final data set is using Wikimedia’s talk pages and having Wikimedia editors and volunteers rate the exchanges/fights within those pages on an abuse scale.

The tl;dr is Jigsaw rolled out an API to rate toxicity of words and sentences based off of four data sets that feature highly specific and quite frankly, narrow conversation types- online arguments in commenting sections or debates on facts, and all from probably only English language corpora.

I decided to play around with the API:

i mean, i cant even.
This is “have a good day” in Farsi, “I love you” in Chinese, and “I speak Spanish” in Spanish

As a friend noted on a slack channel, “…it seems like short input is much more highly likely to be rated as toxic…” And the foreign languages that were not latin based seemed to be rated as toxic. I wonder if the system is throwing errors. If shorter comments were rated as toxic than a machine learning system would rate single words as more toxic, in varying degrees. I wonder what non latin alphabets were translated to? Where emojis or any numerical systems rated as ‘more toxic’ than English, and is that why 我爱你 has 36% rate of toxicity?

Look, this is all conjecture, because I cannot access the API…yet. But I do know there area bunch of errors in here, regardless. This was trained off of a specific data set(s) that are only engaging in a certain kind of conversation- debates over articles. But how are the following not toxic? Why weren’t certain sentences or sentiments rated as ‘misogynistic’, ‘racist’, or ‘traumatic?’ The API seems to not be rating toxicity, but rather tone, or a metric for a bad tone in a conversation. Tone and toxicity are very, very different things. Toxic conversation can be innocuous and insidious, tone is a lot more obvious. Tone is closer aligned to sentiment, sentiment includes emotions of: anger, disgust, fear, joy, and sadness; but toxic can be inherited racism, and a form of unconscious bias. This seems to be more analysis on tones from a data of reported harassment in English comments from 4 specific sites, and less of an assessment of what harassment and hate speech could be. This isn’t a great data set, because of how narrow it is, and who may have trained the ground truth of within this machine learning model. Wikimedia editors and newspaper moderators have purviews specific to their websites, and their brands; they are not good indicators for societal standards but the community standards of their sites. Especially because people engaging in fights in the commenting sections or debating over Wikipedia additions are already in a heightened emotional state- they are ready to debate or fight because they are receiving criticism. If this is trained off of reports from these sites, then we know the data corpus is being trained from moments that are marked of being harassment or extreme escalation. This is a data set of agitation and anger, and perhaps, not really toxicity.

The above is called the 14 most hateful words, and are a Neo-Nazi slogan.
Look, this is just a dick thing to say. But how toxic is ‘dickishness’ in Google’s API

Look, language is hard, and so is conversation. What’s harder can be proving intention, and allowing for open conversation while in moderated spaces. What does that mean? Well, it means that inherently ‘unequal’ views can exist together- try this moderation quiz to see what I mean. The quiz has an example of someone not supporting gay marriage but the comment is allowed because it’s a religious view. But where does racism fall within that? And how much “more” toxic is homophobia than racism? How should an API rate that choice?

EDIT (An addition): I do have to have a follow up- I think it is beyond commendable Google, and in particularly, Jigsaw is trying to analyze language, and figure out toxicity, especially within commenting sections. What they’ve done so far in this field is more than anyone else has done, especially on abusive interactions. That being said, all of their data to train APIS comes from commenting sections, and their abuse models are specifically from data sets that are constrained by what regular and new commenters are saying to each other about a specific topic or domain (articles). Again, it’s highly specific and contextualized data. I still think that this is commendable research because I. Read. The. Comments. And I love commenting sections. But, nowhere on their website for Perspective does it mention that this is a beta or an alpha API, or that it is potentially wrong, or that it needs a lot of self training from other people to make it not racist, and not toxic. That kind of disclosure needs to be on a public facing site, especially one that allows for human input. If Perspective needs to be trained to be accurate, then let people know that they are doing that very necessary labor and that they are doing that labor for free. It’s a slippery slope when it comes to unsupervised machine learning and chat bots: models need to be trained frequently, and they are trained from interaction. Let’s not forget Taybot, and what we learned from Tay: how large scale unstructured human training, and 4chan, can destroy your experiments.

//disclosure- I’m working on a fellowship to study hate speech and mitigate harassment. So, the last question I am actively trying to answer, as an ethnographer and machine learning researcher.