10 New Languages for Perspective API

Published in

Jigsaw

3 min readDec 9, 2021

Jigsaw has been working to protect voices online since our inception six years ago. Core to this effort is our work to support safe online discourse without fear of harassment, regardless of dialect or language of origin. To that end, we’ve launched Perspective API’s toxicity detection in ten new languages to help expand the conversations we can help facilitate. This work was made possible by emerging innovations in language models, the machine learning algorithms that recognize and generate human languages based on text data sets.

Perspective is used by hundreds of platforms around the world to moderate comments posted by their users — including Reddit, The New York Times, Wall Street Journal, Le Monde, El Pais, Disqus, Coral and OpenWeb. Our publishing and platform partners help us test and improve the performance of our algorithms by contributing data, such as examples of toxic comments they encounter online, and by sharing feedback on how the technology can be improved. Over 400 partners use Perspective daily, calling our API more than 600 million times a day, and have been able to support the tool in English, French, German, Italian, Portuguese, Russian, and Spanish. The ten new languages we’ve added span the globe and include Arabic, Chinese (Simplified), Czech, Dutch, Indonesian, Japanese, Korean, Polish, Hindi, and Hinglish (a mix of English and Hindi transliterated using Latin characters).

In particular, we partnered with Al Jazeera on developing the Arabic language model. “We are thrilled to have partnered with Jigsaw to tackle one of the biggest societal challenges of our time — solving the problem of online toxicity and the weaponization of social media,” says David Hostetter, Al Jazeera Digital CTO. “We will be leveraging these technologies across our own brands to ensure that our standards and best practices are sustained and to help uphold our mission to be the voice of the voiceless across the globe.”

In previous iterations of Perspective, our technology relied on the previously state-of-the-art convolutional neural networks, a type of language model that is not large enough to handle more than one language at a time, requiring our engineers to build one model per language. Pre-trained language models are a recent innovation that can process larger datasets and therefore parse multiple languages at once. These models require less data for each individual language, with concepts in one language contributing to concepts in others, even if they do not stem from the same root language.

We soon realized, in initial tests, that we wouldn’t be able to provide the benefits of these models to our users, as they were too large and were too slow to serve in our API. We also identified that it was computationally expensive — requiring more time, memory, and high-powered computers — which threatened to make Perspective inoperable, especially for real-time use cases like providing real-time feedback to comment authors whose comments might be perceived as toxic

Two emerging innovations from our collaborators across Google changed the playing field for us. First, further advancements in pre-trained large language models like Charformer, which are able to generalize concepts without being restricted to rigid vocabularies, increased speed and reduced computation costs. Second, innovations in serving technology and access to new computing hardware within Google made it possible to serve much larger pre-trained language models.

In collaboration with teams at Google Research, we were able to build a new and improved model architecture using Charformer, which allowed us to serve bigger models directly on new serving technology that delivers these models fast enough to meet our API clients’ needs. In addition to making ten new languages possible, Charformer models also outperformed our previous models on more complex types of conversation, such as comments containing negation, identity terms, or adversarial misspellings.

We look forward to transitioning more of our existing languages onto the Charformer model and making even more languages available in the future, to enable as many users as possible to benefit from these new innovations in conversation technology.

More information about testing and implementing of our new models can be found on the Perspective Developers website.

Authors: Tin Acosta, Alyssa Lees, Daniel Borkan, Jeffrey Sorensen, Alyssa Chvasta, Roelle Thorpe, Lucy Vasserman

10 New Languages for Perspective API

Written by Jigsaw