Our Approach to Machine Learning Bias, Part 1

This is the first of two parts to Sentropy’s approach to addressing bias in machine learning classification tasks. You can read Part 2 here.

By Cindy Wang

An introduction to bias

At Sentropy, our goal is to protect online communities by detecting abusive language. There are countless hurdles that make this problem difficult. To name a few, users might employ language that is difficult to interpret out of context, or bad actors could intentionally alter their vocabulary to avoid detection. But one obstacle to effective moderation is embedded in the detection system itself: machine learning bias. If left unchecked, unintended bias can act as a gatekeeper, making it more difficult for already vulnerable populations to participate in online communities. In this post, we’ll describe how we think about algorithmic bias at Sentropy, what we believe has been missing from recent conversations around bias, and how we architected our machine learning systems to approach this problem head-on. To do all this, we leaned on existing academic research, as well as collaborating with domain experts in internet culture and the social sciences.

There can be serious consequences when biased systems are responsible for automated decision making. Take the ProPublica study in 2016 that analyzed the controversial COMPAS recidivism prediction tool, showing that it was twice as likely to mislabel black defendants as “high risk” for reoffending, compared to white defendants. Similarly, a 2018 study led by MIT Media Lab researcher Joy Buolamwini found that commercial gender classification systems sold by IBM, Microsoft, and Face++ performed significantly worse on darker-skinned female faces than lighter-skinned male faces. In 2018, Amazon shut down its internal recruiting tool that downgraded scores for resumes listing women-related keywords and all-women’s colleges. This is all to say, machine learning bias can have severe real-world ramifications.

In the last few years, the machine learning community has come a long way in our conversations around bias. Nonprofits like the Algorithmic Justice League are solely dedicated to fighting algorithmic bias, and an interdisciplinary academic community has pushed for research into fairness, accountability, and transparency in socio-technical systems. Within the industry, Facebook, IBM, Microsoft, and Google have all introduced tools that attempt to identify biased systems and datasets, and some of these tools have even been open-sourced for public use.

Unfortunately, many widely-used machine learning systems continue to exhibit unintended bias. In the domain of abusive language detection, machine learning bias can not only reduce accuracy but can also amplify the discrimination that these systems are designed to prevent. As recently as 2019, a survey of abusive language detection models trained on datasets collected from Twitter showed evidence of systemic racial bias across the board. A related study found this was likely due to the humans labeling the data having implicit biases related to African American Vernacular English (AAVE). In the same vein, Perspective API, a set of toxicity detection models released by Google’s Jigsaw incubator, was found to make biased predictions for comments containing certain identity terms like “gay” or “Muslim.” The Perspective team has since made progress in addressing these issues with new training and evaluation methods, and their contributions have helped form a foundation for our team at Sentropy to detect and mitigate bias in our models.

Common misconceptions about bias

As you might have noticed in the examples above, bias enters machine learning systems as a result of human assumptions. In particular, for complex systems that might involve layers of black-box operations and massive amounts of training data, bias can creep in at many stages of the process.

A bulk of the conversations around machine learning bias rightfully focus on imbalances in the training data. Adding training data is an effective fix when the existing data doesn’t match up with reality, like when a facial recognition system performs worse on darker-skinned people because the training set is mostly composed of lighter-skinned faces. But balancing training data isn’t always practical at scale, especially for tasks like abusive language detection, where the abusive examples form a tiny percentage of the available data. What’s more, adding data can actually amplify bias if the labels reflect existing prejudices — Amazon’s recruiting tool being trained on existing gender-biased hiring decisions is an example of this in action. A more proactive approach is necessary to create a system that explicitly removes unwanted bias.

Another source of potential bias is the selection of data attributes that will be used as inputs into the model. For a recruiting tool, an attribute might be the applicant’s gender, education level, or resume keywords. Unfortunately, removing explicitly biased attributes like gender or race from the model’s consideration often fails to entirely remove bias. Due to societal biases still present in the labels, the model could simply learn to focus on other correlated attributes, like implicitly gendered words in the case of Amazon’s recruiting tool, or employment status in the case of the COMPAS prison recidivism tool.

One additional misconception is that all biases are bad. In fact, bias stems from the assumptions of people creating machine learning systems. Any such system is necessarily biased due to human subjectivity in defining the problem, collecting the training data, and selecting the training method. When we talk about unintended bias, we are referring specifically to bias that ends up negatively impacting specific demographic groups. This is what we have tried to combat in our models at Sentropy.

Consider, for example, the following two sentences:

  • I am a man.

Clearly, in an isolated context, neither of these sentences is exhibiting hateful or abusive intent. If an abusive language model predicts that one of these sentences is abusive while the other isn’t, then it’s showing unintended gender bias. This unintended bias is the kind that Google’s Perspective team and others have called out in their models and the kind that we’ve taken specific steps to minimize in ours.

Sentropy’s model

What follows is an explanation of our process. We’ll be sharing a more technical deep-dive in an upcoming part 2 on this topic.

We know that we can’t eliminate bias from our models entirely and that many existing strategies for bias mitigation are insufficient on their own. We’ve implemented new sets of considerations into every level of our process to help reduce bias, including those that proactively train our models to be unbiased across different demographic groups.

At Sentropy, we’ve built a set of abusive language classifiers that take text content and output a real-valued score from 0 to 1, where a higher score means we’re more confident that the text contains abuse, like threats or hate speech. To avoid bias issues like the ones we’ve discussed, here are the key steps we took to differentiate our process:

1. We expanded our datasets.

Perspective API is both trained and evaluated on Wikipedia user comments. Though large in size, this dataset is not necessarily representative of the wide range of linguistic forms and levels of toxicity seen on the internet. To address this issue, we created new training and test sets for our models using an array of data from several different domains, including a diverse set of Reddit communities (or subreddits), which greatly expands our coverage of different kinds of language.

Another data problem was that certain identity terms (for instance, the word “gay”) mostly appeared in the data as insults, despite not actually being pejorative. This caused our models to learn biased associations of these terms with abusive language. To mitigate this, we added data so that the abusive/non-abusive distribution of these terms more closely matched that of the overall dataset.

2. We honed in on labeling with human curation and lexicon expansion.

We used a combination of unsupervised learning and human curation to expand our lexicons of identity terms for various demographic subgroups (e.g., black, white, Asian, Jewish, LGBTQ+, people with disabilities, etc.). These lexicons were then used to label the examples in our training and test sets that contained mentions of subgroup terms. We used these labels both during training, to explicitly train our models to avoid subgroup-based bias, and while testing, to evaluate the degree of bias within each of these subgroups.

3. We added bias mitigation to the training objective.

A common way to train a machine learning model is to tell it to minimize a loss function, which is a measure of how wrong the model’s predictions are. Instead of a standard loss function, we trained our models using a multi-part loss that requires them to pay extra attention to subgroups of that data that are frequent subjects of bias. This way, we can explicitly tell our models to be unbiased as they are learning.

4. We incorporated “slice” aware training.

We also trained our models using slice-based learning, a paradigm that allows us to improve performance on critical subsets of the data. One such critical subset (or “slice”) may be short messages, or messages belonging to the LGBTQ subgroup. Slice-based learning divvies up the training data into the specified “slices” and commits additional model capacity to them.

5. We continuously monitor and iterate.

Finally, we implemented the bias metrics put forth by the Perspective API team in a recent paper. In addition to tracking standard metrics of classifier performance like accuracy, precision, recall, and F1 score, we use these bias metrics to continuously monitor the levels of unintended bias in our models as we iterate on them.

What does this all mean?

After implementing each of the steps above, our model not only achieves higher accuracy but also exhibits less bias across different demographic subgroups. The many considerations we’ve baked into our approach — a more diverse dataset, fine-grained labeling, a more robust loss function, and new training approaches — all work simultaneously to reduce bias.

We’re sharing our approach with the intent of transparency, along with the hope that it will help bootstrap new research and conversations around algorithmic bias. This work is absolutely critical to making sure our tools are effective for all users, especially those who are most vulnerable to abusive language online. Otherwise, we put their safety at risk.


We all deserve a better internet.


We all deserve a better internet. Sentropy helps platforms of every size protect their users and their brands from abuse and malicious content.

Sentropy Technologies

Written by

We all deserve a better internet. Sentropy helps platforms of every size protect their users and their brands from abuse and malicious content.


We all deserve a better internet. Sentropy helps platforms of every size protect their users and their brands from abuse and malicious content.