Diversity in AI is not your problem, it’s hers

22 min readNov 11, 2019

Update: in September 2020, a paper by Alex (Carmen) Morrison and myself based on this research was accepted to the 2020 Conference on Empirical Methods in Natural Language Processing:

Monarch, Robert (Munro) and Alex (Carmen) Morrison. 2020. Detecting Independent Pronoun Bias with Partially-Synthetic Data Generation. The 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Presentation for the EMNLP 2020 paper

See the video and paper above for the version of this article that past peer-review. Only 16.7% of short papers submitted to EMNLP 2020 were accepted, so we are grateful to the reviewers of this conference for accepting our paper.

The original article from November 2019 is below. Please note that this research is not now featured in my book, Human-in-the-Loop Machine Learning. Because this work was accepted as a conference conference paper, and due to other world events since then, I have replaced it in the book with a disaster response-related example.

I came to a shocking conclusion while writing about diversity for my book on machine learning: diversity in Artificial Intelligence is not your problem, it’s hers.

I mean, of course, that the problem is with the English pronoun “hers”. There is a bias against “hers” in most major AI systems today, and the source of the bias is the perfect metaphor for bias in AI more broadly. Like you might remember from high school, “hers” is a pronoun. Each word in a sentence belongs to one of a small number of categories: nouns, pronouns, adjectives, verbs, adverbs, etc. One common building block in many AI applications is to identify the right category in raw text.

Today, “hers” is not recognized as a pronoun by the most widely used technologies for Natural Language Processing (NLP), including (alphabetically) Amazon Comprehend, Google Natural Language API, and the Stanford Parser. I discovered this recently and you can see more in this video:

Examples of “hers” missed as a pronoun

The video shows that in the sentence “the car is hers”, Amazon and Google classify “hers” as a noun and the Stanford parser classifies “hers” as an adjective. They don’t make the same mistake with the sentence “the car is his”, correctly identifying “his” as a pronoun.

These demos are all free to use online, so you can test them yourself. (If you are reading this article several months after it was published, hopefully you will find that these technologies no longer make these errors!)

Any technology that extracts information from text needs to know about pronouns. We use pronouns more than actual names in our sentences! Here’s an example of what this can look like with different pronouns:

Different sentence structures that convey the same information. The green arrows indicate possession, and the purple arrows indicate who the pronoun refers to.

If we want to extract information as simple as Cameron possessing a car, we first need to map “Cameron” to the right pronoun and then we need to map the pronoun to the “ownership” relationship in the different ways that it can be expressed. If we miss any of the pronouns, we don’t get to capture that information. We use pronouns much more frequently than the entities they refer to, so this is a big gap.

This graphic is also a good snapshot of where AI is today. Only the most recent research has shown that you can do both the pronoun mapping and possession mapping in a single machine learning system, but in industry these are almost all still separate systems. We are obviously a long way from AI that can deeply understand languages.

The technologies also got the pronoun “mine” wrong in some contexts like “the car is mine”. So, I guess this article could have also been titled:

Diversity is not her problem, it’s mine.

Because the problem is mine, I’ve also created the solution. I talk about the solution later in this article after going into more depth about the causes. This article is an excerpt from an upcoming chapter in my book, Human-in-the-Loop Machine Learning, which goes into much more technical detail about diversity problems in AI and their solutions.

The “hers” error is a widespread bias that I found in almost every major Natural Language Processing library and product today. I shared these three out of familiarity: I led Amazon Comprehend, I was part of the Stanford NLP Group that created their parser, and I‘ve been a launch partner for Google Cloud’s AI products.

Why is “hers” not recognized as pronoun?

There are five reasons why this error occurs in the major technologies: the algorithms are trained on data with gender imbalances; the algorithms are trained on narrow genres of data; the datasets are not correctly labeled; domain experts were not consulted; and there are underlying linguistic differences between Masculine and Feminine pronouns in English.

In spite of what sensationalist media coverage about “Bias in AI” tells you, “algorithm bias” is rarely the cause of bias. Other potential causes that did not produce the errors here are an inherent bias in the language and the unconscious bias of people building the algorithms.

Before the causes are covered in more detail, you’ll need to revisit your high school English Grammar classes, where you learned that sentences are made up of constituents like Subjects, Verbs, and Objects. The only extra category that you need to know about in this article is the Possessive (eg: “his car”, “her car”, “Cameron’s car”). These categories are especially meaningful for pronouns in English because they determine which pronoun we use.

The Singular Personal Pronouns in English form a reliable pattern. Each pronoun falls into a sentence, as in these examples for some of the English pronouns:

Singular Personal Pronouns in English (ordered alphabetically by Object). Each column can be the correct pronoun for each example sentence, showing how English pronouns pattern with grammatical categories.

When you speak English, you use all these different pronouns in the correct grammatical positions in the sentence without thinking about it. For example, when you talk about yourself, you use one of “I”, “me”, “my”, “mine”, or “myself” depending on whether the pronoun is the Subject, Object, Dependent Possessive, Independent Possessive, or Reflexive.

The highlighted examples show where some pronouns double-up in English: we use “her” for both the Object and Dependent Possessive; we use “his” for both the Dependent and Independent Possessive; and we use “you” for both the Subject and Object Second Person. Only the last column, the Reflexive, has no overlap with the other columns. The Reflexive has “[Subj]” because the correct reflexive depends on the Subject of the sentence. This makes the Reflexive extra-interesting, linguistically, but it is not relevant to this article.

If you hadn’t thought about this pattern of pronouns in English before, you’re not the only one:

“I never noticed that him/his and her/hers aren’t the same grammatical patterns!”
- Native English speaker with a PhD in Linguistics from one of the world’s top Universities

Just like this person (who will remain nameless) if you are a native speaker of English you might not have realized that the Masculine and Feminine pronouns aren’t (grammatically) the same as each other before reading this article.

The fascinating truth is that: your brain already knew the difference! How often do you accidentally use “her” instead of “hers” or “his” instead of “he” in a sentence? Probably never. Every time you construct a brand new sentence, you always use the correct column above, that is, the correct grammatical category of pronoun. You might not refer to someone with the pronoun of their choice, but your error is in the row above, while you are always in the correct column.

So, you do encode Masculine, Feminine and Gender Neutral pronouns differently at a subconscious level. In case you’re saying to yourself that you would have noticed the difference between him/his and her/hers, but you had never focused on it before, then look again at the examples with Cameron’s cars:

See it now? The different patterns for her/hers and him/his were right in front of you, but because you generally don’t use pronouns in conscious thought, you probably did not notice it.

The differences can be right in front of you, but you don’t notice until it is pointed out.

Here are 5 causes and 3 non-causes for the “hers” in technologies today:

Cause of bias #1: The algorithms are trained on data with gender imbalances

All major machine learning technologies are trained from human-labeled datasets. For the task we are looking at here, these datasets are created by humans labeling words as a noun, verb, pronoun, etc, and then the machine learning algorithms learn from those labeled examples.

So, the source of the data matters a lot. The majority of examples in popular datasets are drawn from news articles. I checked the two most well-known datasets for English: the Penn Treebank and Universal Dependencies. Masculine pronouns are 3x and 4x more frequent than Feminine pronouns respectively in each dataset.

This means that there is a bias in news articles reporting about men more than about women and this bias is carried over to the datasets that are the labeled examples for the major NLP algorithms.

Cause of bias #2: The algorithms are trained on narrow genres of data

News articles are a very narrow genre. They rarely use the Independent Possessive pronouns. This means that instead of writing “hers was fast”, a journalist will favor “her car was fast”, even if it was obvious that “hers” referred to a car.

In fact, in the portion of the Penn Treebank dataset that I looked at, the Independent Possessive does not appear even once for either “hers” or “his”!

How often do you say “it’s yours”, “it’s mine”, “is that hers?”, or “is that his?” while sitting around a table with people? You use these words all the time in your daily speech when what you are referring to is clear, but these types of pronouns are almost completely absent in news articles.

This problem is known as “domain dependence” because in this case the datasets have mostly been limited to the domain of news articles. Domain Dependence is one of the biggest problems in Machine Learning. It is just as true for Computer Vision examples as for language: if you train a Machine Learning model on a narrow genre/domain of data, it will struggle with accuracy outside of those examples.

Cause of bias #3: The datasets are not correctly labeled

The Universal Dependencies datasets only have three examples of “hers” in total and none are fully labeled as Independent Possessive pronouns. Even if they were labeled correctly, there might not have been enough for the machine learning algorithms to correctly learn the “hers” pronoun.

Cause of bias #4: Domain experts were not consulted

This is a weaker probable cause, but it is worth highlighting. The error could have been found sooner if the right domain experts were consulted.

For field linguists, their main job is identifying how a given language divides up their grammar into categories like Subject, Object and Possessive. So, you probably missed the differences with Cameron’s cars, but a trained field linguist would have been looking for exactly this if employed to do so.

There are many linguists who work in NLP, but most aren’t trained to study languages holistically like field linguists. Emily Bender and Batya Friedman from the University of Washington recently recommended that AI practitioners adopt more practices from field linguists to be more transparent about datasets, and I recommend that everyone follows their lead.

Cause of bias #5: Linguistic differences between Masculine and Feminine pronouns in English

The last cause of bias is a completely arbitrary one: the fact that “her/hers” is a distinction that patterns differently to “him/his”:

In the existing datasets there are 100s of examples of the Dependent Possessive “his” as in “his car”. So, the NLP systems can learn that “his” is a pronoun in the Dependent context and then guess correctly because it’s the same word in the Independent context. This isn’t possible for “her/hers” with the different spellings.

This might be the most important lesson to learn here: harmless differences in human speech can become biases in machine learning. Causes 1, 2 ,3 & 4 could have been absent, but a linguistic difference that is not inherently biased could still result in a biased machine learning model.

Not a cause of bias #1: inherent bias in the language

The different patterns for him/his and her/hers in English are most likely for phonological reasons that were not the result of inherent gender bias. For example, it’s likely that there used to be a “his’s” but the double “s” was awkward to pronounce and got lost over time. This is really common across languages. If you are interested in learning more about pronouns in English and how they’ve changed, I recommend this recent Lexicon Valley podcast by John McWhorter.

While it didn’t contribute to bias here, it is a fair question to ask as inherent gender bias can occur in languages. For example, the equivalent of “them” in Spanish roughly translates to “hims”. When there is a group of people of multiple or unknown genders, the Masculine word is used. This does reflect historic gender inequalities that still exist today and are encoded directly into the language. It’s one reason why the “Latinx” movement exists: to replace the gendered “Latino” and “Latina”. See How Does Grammatical Gender Affect Noun Representations in Gender-Marking Languages? by Hila Gonen, Yova Kementchedjhieva, and Yoav Goldberg for more approaches to resolving bias in languages that encode gender (German and Italian in their paper).

However, the overwhelming majority of ways that languages vary at the grammatical level has little to do with the culture of the societies that speak those languages. That is the case for the independent possessive pronouns in English.

Not a cause of bias #2: inherent bias in the algorithms

In current machine learning, there are very few ways that the algorithm itself can be biased. This is one of the most misleading things that you commonly read about AI: that the algorithms themselves are biased.

Algorithms take in the data they are trained on. There is no bias in the algorithms themselves, except (arguably) ignoring some rare data points that they should probably pay attention to. Algorithms are simply trying to fit that data to make similar predictions to what humans have labeled in the example training data.

Not a cause of bias #3: the people building the algorithms coded their unconscious biases into the programs

Because machine learning no longer uses hand-coded rules, there aren’t many ways that people can encode their bias into the technologies.

The only way that an error could have occurred would be in the annotation of the data. That is, if the people creating the datasets were less consistent in how they annotated different genders of pronouns. This isn’t the case here. “Hers” and the Independent “his” are annotated consistently in the datasets.

It wasn’t overlooked in the annotation guidelines, either. In fact, “hers” appears in two examples in the Universal Dependencies guidelines and four more times in translations from other languages (https://universaldependencies.org/u/overview/simple-syntax.html).

There was also a workshop focused on gender at the most recent global conference on Natural Language Processing, which shows that gender and pronouns are not overlooked in the NLP research community. The biggest focus was a shared research project focused on pronouns in English. The organizers confirmed with me that they didn’t include Possessives in their study, so the problems in this article didn’t come up.

There are many ways that people who work in technology might encode their unconscious biases into the applications that they are building, but that was not the case here.

The solution: open data with the right examples

There is little point in complaining about bias in AI if you can’t suggest a solution.

I decided to fix the problem directly, with both data and new Machine Learning innovations. I created a new dataset annotated with many examples of “hers” and I added it to the Universal Dependencies collection an open source project that releases new data every six months. The dataset will be in the next release in just a few weeks time (November 2019):

UniversalDependencies/UD_English-Pronouns

UD English-Pronouns is dataset created to make pronoun identification more accurate and with a more balanced…

github.com

Other pronouns

I also noticed that the singular “they” did not exist in any of the major datasets. This is because of all the same reasons as “hers”, and probably also because some people still cling to the mistaken belief that “they” was historically only plural.

For examples that should be unambiguously singular, like “the shadow is theirs”, the main technologies today all incorrectly classify “theirs” as plural. (It’s possible that multiple people could cast a shadow, but that shouldn’t be the default interpretation!)

So, I also included examples in the new dataset that are unambiguous examples of singular “theirs”, which should now allow those technologies to consider the singular interpretation of “theirs”.

Tip: When you see someone upset that “they” is both plural and singular, ask them if they are also upset that “his” is both a Dependent and Independent Possessive.

There are many other pronouns that are 100% valid in different varieties of English. I have personally seen: “ya/yez” a singular/plural distinction for “you” in Australian English that is especially common among speakers of Aboriginal and Torres Strait Islander languages that have a singular/plural distinction in those languages; “yous” and “y’all” as a plural for “you” in UK and US English; and “e” in Sierra Leonian English, borrowed from Sierra Leonian Krio, as a gender-neutral pronoun used in place of “she/her/he/him” and singular “they/them”. Similar to the gender-neutral Sierra Leonian “e”, many people have suggested adopting a gender-neutral “ze” in English. For a longer (but still non-exhaustive) list of pronouns that are used in English, I recommend looking at Pronoun Island.

To support any other variation, I wrote the code so that the dataset can be easily extended to other pronouns. I also included “mine” in this dataset, because of the errors I found. Although there’s not an obvious gender bias with “mine”, the systems will still be more accurate when they know that “mine” is a pronoun and not a hole in the ground (“gold mine”). I’m making the code free and open source and it doesn’t require programming knowledge to edit the code to add more pronouns.

Easier and harder problems to solve

Sometimes, what seems like a difficult problem can be easy to fix, and this can be true for bias in AI. It took me just one day to create enough data to train on so that any grammatical parser in the future will correctly understand “hers” as a pronoun and that “theirs” can be singular or plural.

Most problems with bias in AI are the same. They can be hard to detect, but when they are correctly diagnosed the solutions are easy and often come down to creating the right training data. 90% of the problems I’ve seen in Machine Learning have been this easy to fix. The organizers of the workshop focused on gender in NLP said that most of the solutions to the pronoun task at the workshop were relatively simple, too.

But, some problems are much, much harder to solve. This is the case for the most popular technologies for pre-trained models today.

Gender bias in pre-trained models like BERT

I have only solved one dimension of the “hers” problem with the pronoun dataset. If an application using machine learning needs to predict the frequency of each word, then you would still have a problem. Machine learning building blocks like Google’s BERT system are one example of where this problem can occur. One core piece of BERT’s architecture is trying to predict which word can occur in a sentence, trained over large amounts of raw data.

The downside of this aspect of BERT is that the raw frequency of words matters. We don’t need datasets with an equal number of “hers” and (independent) “his” to have a fair machine learning system that identifies pronouns. However, we probably do need an equal number of each pronoun type to avoid bias in pre-trained models like BERT.

I looked at possessive pronoun bias in BERT using a method adapted from this recent paper: Quantifying Social Biases in Contextual Word Representations by Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, and Yulia Tsvetkov. To summarize the method, I measured bias by seeing if a sentence like “the car is his” is preferred over “the car is hers” within BERT:

Step 1: Using the new English Pronouns dataset that I created. I extracted BERT’s predictions for what items other than “cars” are the most likely to be possessed. For example, BERT was asked to guess what the blank word would be in 50 sentences like “the ___ is theirs”. (For the technical among you, I used Monte Carlo Sampling to generate multiple items for each sentence and kept generating new items until Good-Turing estimates that additional new items were unlikely.)

Step 2: The first step resulted in a little over 100+ items. This included concrete items like “camera” and “world” and abstract items like “night” and “instincts”. Because these items were all predicted to be the most likely item in a given context, it means that we can be confident that they aren’t low-frequency items that will make BERT produce erroneous results. I manually removed about 10 examples that didn’t make sense in the sentences or were a plural where a single also existed.

Step 3: With the list of 104 items from Step 2, I generated sentences that use each item to predict the pronoun in the sentence. For example, BERT was asked to guess what the blank word would be in 1000s of sentences like “the camera is ___”, “the world is ___”, “the night is ___”, etc. I measured whether “hers” or “his” was a higher prediction by BERT, and by what probability.

Of the 104 items, only one item, “mom”, was preferred for “hers” over “his”:

Objects that are “his” in BERT:

action, answer, baby, back, best, blood, bodies, body, box, boy, business, camera, car, city, clothes, crew, customers, deal, dealer, door, drawers, drivers, drugs, engines, everything, eye, face, family, father, first, fish, floor, friends, front, girl, glass, goods, hair, hand, head, heart, horses, house, innocence, instincts, island, jewelry, job, junk, kid, land, last, leg, life, likes, lot, men, mess, minds, money, mother, name, night, one, paint, painting, parents, party, past, people, place, pleasure, pockets, power, product, rest, room, same, scent, sheriff, ship, shit, shoes, shop, soul, streets, stuff, sun, sword, table, team, things, tires, town, toys, tracks, two, water, way, wheels, windows, work, world

Objects that are “hers” in BERT:

mom

Here’s the ordered breakdown:

The BERT preference for “hers” and (independent) “his”. The numbers are ratios. For example “mom” is 7.4 times more likely to make BERT predict “hers” than “his”, when averages across all contexts and “money” is 23.0 times more likely to make BERT predict “his” than “hers”.

The world, and almost everything in it, are “his” according to BERT. The outlier, “mom”, is probably because BERT is erroneously applying gender to the entire sentence, even though a “mom” could be “his” or “hers” equally in the real world. This gendering of the entire sentence is probably the case for “ship” almost being “hers”, too: because (in English) we refer to ships with female pronouns.

The most worrying example is that “action” is almost 70 times more likely to be “his” than “hers”. There is a good chance that this reflects the inherent bias in agency ascribed to different genders in the language that BERT was trained on.

One factor in the bias could be that “his” is both types of possessives, compared to “her/hers”, which would make “his” more frequent even in balanced data. This could result in “his” being more likely to be predicted by BERT than “hers” simply because BERT doesn’t explicitly capture that the same word can be in multiple linguistic categories. If so, this would be an example of how a non-biased linguistic difference can become a biased one if we are not careful about how machine learning models are trained on that data.

Even if we try to bias BERT towards “hers” and “theirs”, it prefers “his” most of the time.

Even when BERT is queried to predict the items most likely to be “hers” or “theirs”, those items are still predicted to be “his” more often. The diagram above shows this. This is, even if we try to bias BERT towards the most “hers” and “theirs” items, the preference is still for “his” most of the time.

This BERT analysis was my starting point for the “hers” error that I discovered. The plan was to help models like BERT with a method where the items with the biggest bias could be found programmatically and then new sentences created that replaced pronouns in order to counter that bias. This technique is a combination of what’s known as generating adversarial examples and data augmentation. They are simple but surprisingly effective techniques.

But it wasn’t possible to implement data augmentation without knowing which category of pronoun was being used for the ambiguous “his” and “her” examples. And it wasn’t possible to know which category of pronoun was being used, because the existing technologies could not accurately identify all the pronouns.

The global impact of BERT

Google announced a week ago that BERT will be used in 10% of searches.

I’m sure BERT’s creators appreciate the problems. For example, one of the authors of BERT, Kristina Toutanova, even talked about the domain dependence problem of the Penn Treebank in her 2005 PhD dissertation. These are problems that we have been working on in machine learning community for some time and there aren’t always easy answers.

Unlike the one day that it took me to create a dataset to solve the pronoun identification problem for “hers”, I think the new dataset will significantly improve BERT’s gender bias, but is only one step towards solving it. I’m not yet sure how much of the singular “theirs” problem will be solved, but probably very little at most.

The distinction is that the needing the pronouns to be represented in a dataset for a machine learning model to be fair vs the harder task of needing to make that dataset representative for a machine learning model to be fair. That is, we only need the pronouns to be well-represented for a model built on Universal Dependencies to be fair. But we need representative data for machine learning systems like BERT to be fair. The finer details of the represented/representative distinction aren’t important for this article, but I go into more detail in my book if you are interested in learning more.

It is hard in almost every other language

I have no particular expertise in gender bias in AI. That’s encouraging, because it means that you don’t need expertise to solve problems with bias in AI. That doesn’t mean that you can just rush in: I spent more time reading about gender bias and consulting experts than implementing the solution. I recognize my privilege in getting a computer science and linguistics education and my access to people creating the most widely used NLP technologies.

My expertise and main area of passion is making AI fair and equally accurate for any language in which someone chooses to interact with technology.

English is a privileged language, with large volumes of data available and many existing technologies. If we are overlooking one of only eight gendered pronouns in English, then we are overlooking a lot more in other languages. If you use BERT-via-a-Google-search for “Top NLP Conferences”, the first result should be this image from my PhD:

Spellings for odwala (“patient”) in Chichewa and their translations into English

My PhD focused on messages sent in health-care and disaster response contexts in low resource languages. Many years later it is still very much an unsolved problem.

English’s pronoun system is one of the simplest in the world. English also has one of the simplest noun systems, with only three forms: a Singular/Plural distinction and a Possessive distinction (’s as in Cameron’s). Most languages look more like Chichewa, where there are 40 different spellings for “patient” in sentences that translate to just two spellings in English, “patient” and “patients”.

The errors for “hers” and singular “theirs” in the technologies today were in part because “hers” and “theirs” were rare in the data. For most of the world’s language, most word forms will be rare. There can be dozens of pronouns in a language and sometimes thousands of different noun forms. (Although one thing you don’t have to worry about in Chichewa is the gender of pronouns: like the majority of languages in Africa, Chichewa does not have gendered pronouns.)

So, the problem with “hers” and “theirs” is a good metaphor for the kinds of biases that we encounter much more frequently in other languages. Some differences will reflect societal bias and some will not. In some cases, the machine learning models will cancel out that bias and in other cases, the machine learning models will amplify that bias. The best way to solve the problem is to carefully construct the right datasets that allow the machine learning algorithms to understand all the variations in a language.

Most of the work going into Universal Dependencies datasets today focuses on linguistic diversity, with more than 85 languages in the collection and 20 more being added in the November 2019 release. It makes for a very exciting collection: Universal Dependencies is easily the most important project tackling bias in AI today.

I found the “hers” pronoun problem while looking for an example for my book, where I argue that diversity is the responsibility of everyone building machine learning models and that it starts with the data.

I chose what should be one of the easiest bias problems to solve in AI: the most well-studied bias (gender), for the most well-studied language (English), for the most well-studied way that we express gender (pronouns). I was surprised to find that we haven’t solved it: one of the easiest problems to solve in AI was plain to see in the most popular technologies, but no-one had acted to fix that bias. My book was always going to have a focus on addressing bias in data, and it will be a larger part of the book now that it’s clear how far AI is from fairness.

Robert Munro

November 2019

This article was picked up by the New York Times and was their lead story for the business section on November 11th 2019: https://www.nytimes.com/2019/11/11/technology/artificial-intelligence-bias.html

Acknowledgments

Thanks to Alex Morrison aka Alex U. Inn [they/them] for feedback on the entire article and especially for insights on non-gendered pronouns!

Thanks to Christopher Manning, head of Stanford’s AI lab and NLP group, for pointing out labeling as a source of error out to me when we talked about this briefly!

Thanks to Crystal and Katie for letting me use their child Cameron’s name in my examples!

Thanks to Emily Bender at the University of Washington for extensive feedback on the article!

Thanks to Kellie Webster and Will Radford, co-organizers of the First Workshop on Gender Bias in Natural Language Processing, for feedback and insight about that workshop!

All errors or omissions are not hers, his, or theirs, they are mine.