Bias in NLP

A Quick Look

12 min readMay 25, 2022

Although a term that most of us are comfortable with, bias is not a trivial designation to define and comprehend, especially in the context of machine learning. Bias¹, an article in a health journal, defines it as:

Bias is the lack of internal validity or incorrect assessment of the association between an exposure and an effect on the target population in which the statistic is estimated has an expectation that does not equal the true value.

Sociology seems to put it simpler²:

Bias is a prejudice in favor or against a person, group, or thing that is considered to be unfair.

When they refer to a “statistic that is estimated has an expectation that does not equal the true value” or “prejudice that is considered to be unfair”, both definitions focus on one important aspect: bias appears when unjust assumptions are made.

Even the most objective and fair-minded people demonstrate bias. Research demonstrates that people who value their objectivity and fairness are paradoxically particularly likely to fall prey to bias, in part because they are not on guard against subtle bias.²

Bias in the Real World

Public Models

Modern NLP is based on transformer models. These are deep learning models, introduced in 2017, that revolutionized the field. The rhythm of development of these models is such that their number of parameters seems to grow exponentially. Here we will address the results in two of the best-known ones: BERT, developed by Google; and GPT-3, developed by OpenAI.

When we asked BERT to fill the mask in "This man/woman works as a [MASK]." , it returned:

man — carpenter, lawyer, farmer, businessman, and doctor;
woman — nurse, maid, teacher, waitress, and even prostitute.

The Hugging Face course even has a chapter dedicated to this problem.

For GPT-3, an NLP model with over 175 billion machine learning parameters³, problematic tendencies were found⁴:

professions demonstrating higher levels of education (e.g., banker, professor emeritus) were heavily male leaning, while professions such as midwife, nurse, receptionist, and housekeeper were heavily female leaning;
Asian people had consistently high sentiment, while Black people had consistently low sentiment;
words such as violent, terrorism, and terrorist were associated with Islam at a higher rate than other religions; when a phrase containing the word Muslim was given, GPT-3 created sentences associating Muslims with shooting, bombs, murder, or violence.

Twitter

Twitter images were usually cropped to improve consistency in the size of photos in the timeline and to allow the user to see more Tweets at a glance⁵. For this, Twitter used a saliency algorithm.

The saliency algorithm works by estimating what a person might want to see first within a picture. Saliency models are trained on how the human eye looks at a picture as a method of prioritizing what’s likely to be most important to most people.

By 2020, a university employee noticed that when he posted two photos — one of himself and one of a colleague — Twitter’s preview consistently showed the white man over the black man, no matter which photo was added to the tweet first. Other users discovered that the pattern held for images of former US President Barack Obama and Senator Mitch McConnell, or for stock images of businessmen of different racial backgrounds. When both were in the same image, the preview crop appeared to favor the white faces, hiding the black faces until users clicked through to the full photo⁶.

Meanwhile, Twitter fixed the issue by applying a dumber algorithm that simply crops the image into the center. Although not an NLP problem, this demonstrates that this problem flourishes with similar causes and effects around the whole machine learning field of study.

Problems in Current Models and Approaches

One of the problems with NLP regarding bias is that, given the multitude of data available, it’s relatively easy to build an accurate language model. Often developers blindly aim to improve this or that metric (being it F1, accuracy, etc.), and to improve those results, the quantity of the data is largely more significant than the quality of it. For example, it would be easy to scrap all Wikipedia pages and build a model from them. The dataset is immense and, if well implemented, the model would return great results. But this is no more than an irresponsible approach for the development of a model to be deployed. A language model only extrapolates the reflections of who wrote the dataset that it has been fed. Building a model based on Wikipedia pages, without making a study on who wrote said Wikipedia pages, is irresponsible. Most data sets have some built-in bias, and in many cases, it is benign. It becomes problematic when this bias negatively affects certain groups or disproportionately advantages others. In biased data sets, statistical models will overfit the presence of specific linguistic signals that are particular to the dominant group.

Many data sets in production are even created from long-established news sources (e.g., Wall Street Journal, Frankfurter Rundschau from the 1980s through the 1990s), a very codified domain predominantly produced by a small, homogeneous sample: typically white, middle-aged, educated, upper-middle-class men. However, many syntactic analysis tools (taggers and parsers) are still trained on the newswire data from the 1980s and 1990s. Modern syntactic tools, therefore, expect everyone to speak like journalists from the 1980s. It should come as no surprise that most people today do not: language has evolved since then. NLP is therefore unprepared to cope with this demographic variation.⁸

In classification models, other problems arise. These models are trained using datasets of sentences labeled by annotators. There is often a lack of care when choosing who these annotators are, and if this care is present, one often ignores their demographic representation. It is not uncommon for language models to classify sentences written by women, or African Americans, as more aggressive. This is usually because annotators rate the utterances of different ethnic groups differently, and they mistake innocuous banter as hate speech because they are unfamiliar with the communication norms of the original speakers.

Bias is, however, a great force of friction in today’s societies, so one should expect a boom in papers and solutions for these problems. Bias is being studied in the context of the workplace, media, etc. With this, bias in technology didn’t stay behind. For the last decade, tens of academic works whose focus was on studying bias and proposing new ideas to mitigate this problem flourished. But these are not perfect either. According to Su Lin Blodgett et al., the motivations of these papers are often vague, inconsistent, and lacking in normative reasoning; and their proposed quantitative techniques for measuring or mitigating bias are poorly matched to their motivations and do not engage with the relevant literature outside of NLP.⁹

A Systematic Approach to Mitigate These Problems

The undoubted reason for this study to be made is the will to mitigate, as broadly as possible, the effects of bias in new language models. For this, a set of technical measures have to be standardized. But that’s not all. Computer science is not a simple technical field anymore because software is a key aspect of current societies. Bias is a symptom of failing at this.

Technical Approaches to Mitigate Bias

Tony Sun et al. gave us approaches to technically mitigate bias¹⁰. From these we selected the most straight-forward:

Debiasing the Training Corpora
Debiasing by Adjusting Algorithms

For the training corpora, this is, the data, we can debias using 3 relevant techniques: Data Augmentation, Tagging, and Bias Fine-Tuning.

Data Augmentation — oftentimes a dataset has a disproportionate number of references to one of the biased classes. Taking gender, for example, it’s not uncommon that references to men are more numerous by a large margin than references to women. To mitigate this, one can create an augmented dataset identical to the original one, but biased towards the opposite gender. For simplicity, taking only two genders into account, this would be done by replacing every instance of one gender with the other one in every row of the dataset. The training is done on the union of the original and data-swapped datasets. This is not without drawbacks: data augmentation at least doubles the size of the training set, which can increase training time by a factor specific to the task at hand. Also, blindly gender-swapping can make uncommon sentences way more common — for example, gender-swapping “she gave birth” to “he gave birth”.

Tagging — let’s take a real problem: current machine translation models predict the source to be male for a disproportionate amount of time (e.g., For English-Portuguese, “I’m a doctor” (male) translates to “Eu sou um doutor”, while “I’m a nurse” translates to “Eu sou uma enfermeira” (female)). To mitigate this, we would add tags at the beginning of our training data, such that “I’m happy” would become “<MALE> I’m happy” or “<FEMALE> I’m happy”, depending on whom the speaker is. This tag could be part of the text, or simply another attribute of that data if we are on a generation, or a classification task, respectively. The goal is to preserve the gender of the source so that the model can create more accurate translations. A machine translation model could then return two (or more) values for a translation, one for each relevant gender. For example, in English-Portuguese, “I’m blind” would translate to “Eu sou cego” and “Eu sou cega” at the same time, indicating the gender of each sentence.

Bias Fine-Tuning — Unbiased data sets for a given task may be scarce, but there may exist unbiased data sets for a related task. Bias fine-tuning incorporates transfer learning from an unbiased data set to ensure that a model contains minimal bias before fine-turning the model on a more biased data set used to train for the target task directly. This is, through transfer learning.

One can also work on debiasing through the algorithms. This can be done by applying hard constraints to the results that were output by the model. An NLP model risks amplifying bias by making predictions that exacerbate biases present in the training set. For instance, if 80% of references of “secretary” are female in a training set, and a model trained on that set predicts 90% of references of “secretary” in a test set to be female, then that model amplifies bias. To combat this, constraints can be hardcoded into the results. These constraints can be set in two ways:

in regard to the training set — if the number of predictions amplifies the bias in the training set, one could hard-set a quota with the percentage already present in the training set;
in regard to social values — for example, in a context where a model would have to classify the gender of a given sentence, and the only indication was the word “doctor”, we would make sure that the probability of classifying a gender is the same for all genders.

We strongly recommend checking the references to have a deeper dive into this.

Structural Recommendations for New Models

Still, a technical approach is not enough. These are the kind of problems that are not solvable by only exact sciences and lines of code. Technological advances alone won’t fix the problem. For this, Su Lin Blodgett et al. built a set of recommendations for new works in the field.⁹ You will find some of these recommendations in the next paragraphs, alongside some of our own.

Start with a groundwork analyzing bias in the relevant literature outside of NLP, exploring the relationships between language and social hierarchies. Treat representational harms as harmful in their own right. This is because media in the real world is already biased, and these biases will reflect in the NLP system. In U.S.-based sociolinguistics and beyond, there is a longstanding history of challenging deficit views of linguistic and cultural practices associated with racialized and socioeconomically marginalized populations.¹¹ For example, in the U.S., the portrayal of non-white speakers’ language varieties and practices as linguistically deficient helped to justify violent European colonialism, and today continues to justify enduring racial hierarchies by maintaining views of non-white speakers as lacking the language required for complex thinking processes and successful engagement in the global economy. To do this, one should ask the following questions:

Is the bias in my dataset relevant or is it noise?
Should I keep said bias?
Where is my dataset from? Is this source reliable in terms of bias?

Provide explicit statements of why system behaviors that are described as biased are harmful, in what ways, and to whom. Bias is a broad term that may affect various communities and populations in multiple ways. For example, there are systems that read curriculums to automatically hire employees. If these systems use historical data of a company that does not enforce gender quotas, it will be discriminatory against women. More men than women will be hired, and this has a real-world impact. Another possible example is language models that are based on Twitter. If tweets containing features associated with AAE (African-American English) are scored as more offensive than tweets without these features, then this might

yield negative perceptions of AAE;
result in disproportionate removal of tweets containing these features, impeding participation in online platforms and reducing the space available online in which speakers can use AAE freely;
and cause AAE speakers to incur additional costs if they have to change their language practices to avoid negative perceptions or tweet removal.

To solve these problems, one should ask the following questions:

What communities are possibly being harmed?
What is the real impact of this bias?

Involve diversity in every step of the product. For this:

data annotators should at least be representative of the data they are annotating;
one should apply bias criteria to model classification, this is, if a given test result demonstrates higher values of bias than others, it should be penalized;
one should develop technology products for all communities

Implementing these may seem too easy and not of much help, but not everyone is willing to do so. For example, in the technology industry, speakers of AAE are often not considered consumers who matter. IN RACE AFTER TECHNOLOGY, Ruha Benjamin recounts an anecdote from a former Apple employee, part of a team that developed speech recognition for Siri. As they worked to improve Siri’s ability to comprehend different dialects, the employee asked his boss why they were not considering African-American English. “To this, his boss responded, ‘Well, Apple products are for the premium market’”.¹² The reality, of course, is that speakers of AAE tend not to represent the “premium market” precisely because of institutions and policies that help to maintain racial hierarchies by systematically denying them the opportunities to develop wealth that are available to white Americans — an exclusion that is reproduced in technology by countless decisions like the one described above.

Conclusions

Informatic engineers and computer scientists live in an era when they can play God. Software quickly took root as a central piece in our lives, and it’s as easy as possible to create new machine learning (and, more specifically, NLP) models that mathematically function really well. But, as our societies have developed to not be perfect — to put it mildly — these models tend to follow the same path. With this, it is our responsibility to fix the problems that are being exacerbated by us.

[1] (n.d.). Bias | Journal of Epidemiology & Community Health. Retrieved May 9, 2022, from https://jech.bmj.com/content/58/8/635

[2] (n.d.). A Survey on Bias in Deep NLP | HTML — MDPI. Retrieved May 9, 2022, from https://www.mdpi.com/2076-3417/11/7/3184/htm

[3] (n.d.). What is GPT-3? Everything You Need to Know — TechTarget. Retrieved May 10, 2022, from https://www.techtarget.com/searchenterpriseai/definition/GPT-3

[4] (n.d.). A Survey on Bias in Deep NLP | HTML — MDPI. Retrieved May 10, 2022, from https://www.mdpi.com/2076-3417/11/7/3184/htm

[5] (2021, May 19). Sharing learnings about our image cropping algorithm — Twitter Blog. Retrieved May 10, 2022, from https://blog.twitter.com/engineering/en_us/topics/insights/2021/sharing-learnings-about-our-image-cropping-algorithm

[6] (2021, May 20). Twitter finds racial bias in image-cropping AI — BBC News. Retrieved May 10, 2022, from https://www.bbc.com/news/technology-57192898

[7] (2020, September 21). Twitter’s Photo Cropping Algorithm Draws Heat for Possible Racial …. Retrieved May 25, 2022, from https://petapixel.com/2020/09/21/twitter-photo-algorithm-draws-heat-for-possible-racial-bias/

[8] (2021, August 20). Five sources of bias in natural language processing — Hovy — 2021. Retrieved May 11, 2022, from https://compass.onlinelibrary.wiley.com/doi/full/10.1111/lnc3.12432

[9] (2020, May 29). Language (Technology) is Power: A Critical Survey of “Bias” in NLP. Retrieved May 11, 2022, from https://arxiv.org/abs/2005.14050

[10] (2019, June 21). Mitigating Gender Bias in Natural Language Processing — arXiv. Retrieved May 17, 2022, from https://arxiv.org/abs/1906.08976

[11] (2017, September 11). Unsettling race and language: Toward a raciolinguistic perspective. Retrieved May 12, 2022, from https://www.cambridge.org/core/journals/language-in-society/article/unsettling-race-and-language-toward-a-raciolinguistic-perspective/30FFC5253F465905D75CDFF1C1363AE3

[12] (2019, December 20). The Times Literary Supplement — December 20, 2019 — Exact Editions. Retrieved May 12, 2022, from https://reader.exacteditions.com/issues/85779/spread/3