The theme of this year’s NAACL, which ended last week, was data bias and privacy, topics of great social consequence. On the former, many exciting contributions were presented on gender bias in NLP models and datasets.
The keynote “Data as a Mirror of Society: Lessons from the Emerging Science of Fairness in Machine Learning” by Aravind Narayaran (Princeton) was the first, but not the last, to bring the issue of gender bias in our NLP systems to the forefront of the discussion. One of Aravind’s slides was a new classic:
In the slide, Google Translate initially (on the left) surfaced an undesirable translation from a “genderless” language (Turkish) to a (partially) gender-marking language (English), then updated the algorithm (on the right; https://www.blog.google/products/translate/reducing-gender-bias-google-translate/) to offer two alternatives. You may have noticed that I put “genderless” in quotes. No language is genuinely “genderless” in that its speakers do not talk, think about, or react to gender as a social construct. When I say “genderless” here, I am referring to languages that do not grammatically encode gender obligatorily, say, as affixes on pronouns, adjective, nouns, etc. When we talk about gender in this grammatical way, we often label it “grammatical gender”.
Of course, grammatical gender might also reflect something about society, and gender translation failures are pernicious in this respect, since they can reinforce negative gender stereotyping. Many of NLP’s finest are working on it (Bordia & Bowman 2019, Gonen & Goldberg 2019, Qian et al. 2019, Zhao et al. 2019, Rudinger et al. 2017, just to name a few recent examples), and more should jump in! My two cents: we should involve a few more sociologists of gender here. Gender, as a sociopolitical category, clearly isn’t monolithic. It obviously doesn’t consist of between exactly 2 and 3 “natural” classes either (“natural” is a term which itself has been discussed at length in the morphological literature in linguistics: it is, at best, vastly incorrect, and at worst, socially harmful). Linguists have always explored the relationship between gender and society as mediated by language (Corbett 1991 et seq. for an overview, Burnett & Bonami 2019, for a recent, nice example), but we too are becoming more motivated to investigate it recently. Towards more nuanced, interdisciplinary discussions of gender and grammar, some linguists recently held a conference at Queen’s University in Canada to discuss “nonbinary gender (usage, users, and user experiences) in language” (THEY2019), some of which may be of interest to NLP-ers.
In NLP though, much of the gender work I know of presupposes a relationship between grammatical gender, which is present on nouns in the lexicon of gender-marking languages, and sociopolitical or cultural gender, which can tell us something about culture and society. Here, I will make a very conscious attempt, to the extent that it is possible, to split these two. In this post, I focus on the grammatical “piece” of gender, and situate it within a larger, more general, typology of grammatical systems. In short, I want to remind everyone that translation mismatches (Kameyama et al. 1991) are a widespread and well-recognized grammatical issue in machine translation. Gender mismatches are only one typex. Let’s return to our first Turkish example, reprinted below,:
Let’s dig into this a little more. In the original translation on the left, cultural biases become built into the translation; namely, the sentence was translated into English using “he”, which suggests the model encoded the often-false assumption that since the person being referred to is a doctor, that person must be male, despite there being no explicit grammatical evidence for that. Perhaps this is a fact about the statistics of the corpora used to formulate the translation? Open question.
While Google Translate rightly addressed this issue for Turkish, unfortunately, lower resource languages like Indonesian are still defaulting to the masculine gender form, even in single sentence translations with only one pronoun, i.e., gender-underspecified pronoun dia and suffix -nya both get translated to masculine, as of yesterday. (p.c. Clara Vania, see this link: https://translate.google.com/#view=home&op=translate&sl=id&tl=en&text=Ibunya%20tinggal%20di%20New%20York%0ADia%20tinggal%20di%20New%20York). That should give you a hint of how challenging this problem still is.
In my view, such translation failures stem from a few considerations:
(1) translation setting where the source language doesn’t encode a grammatical specification that the target requires, and
(2) contextual information is insufficient for a good *guess*, so
(3) a model outputs a bad or biased *guess*
I will label situations like described in (1) translation mismatches. These are necessary, but not sufficient, conditions for biased gender translation failures. Stereotypes and bias in our culture(s) feed off underlying translation mismatches; at step (3), society-level negative gender stereotyping acts also a necessary, but insufficient condition, for biased guessing.
As a linguist, I am most qualified to talk about (1); although it’s arguably the least pressing of the three, I will focus on it here. Perhaps, if we could find a way to better handle the general, grammatical contribution to translation failures, we might find new ideas for generalizing and improving translations. Maybe this would have useful consequences for translation mismatches with genuine societal impact, like for gender.
Why should we think harder about the grammatical factors that leads to translation mismatches? Well, they are very wide-spread in language (I’m clearly not the first to notice them), so doing well on them will be important if we want quality translations. Let me give you a few more examples to motivate this.
Just as Turkish can be called “genderless”, other languages, like Mandarin Chinese, can be thought of as (at least partially) “tenseless” (Lin 2010, Sybesma 2007, Matthewson 2006). More specifically, Mandarin Chinese possesses fewer overt linguistic markers straightforwardly specifying future or past tense meanings. Tense meanings in Mandarin Chinese are often encoded differently than for English: a Mandarin Chinese sentence may include temporal adverbials like “yesterday”, which will imply that the event happened in the past, or grammatically complex aspectual markers might allow us to infer past tense meaning. Given this, if we translate a “tenseless” Mandarin Chinese sentence into English, the model will have to *guess* which tense is appropriate. For the sole grammatical reason that English morphologically requires tense on the main verb, but Mandarin Chinese doesn’t.
First, let’s find an example where Mandarin Chinese is correctly translated as “tenseless” from English:
Here, two English sentences that differ in tense collapse into a single Mandarin Chinese translation; we have thus verified Mandarin Chinese does not overtly mark tense (for this kind of sentence). Let’s switch direction, to evoke the translation mismatch. We can then ask: If we translate from a Mandarin Chinese “tenseless” sentence to English, which tense will Google Translate select?
Ah, we get present tense!
Now, guessing present tense is obviously not as socially problematic as guessing an unspecified doctor will be male, but it does have a similar grammatical profile. Mandarin Chinese, like many languages, doesn’t encode tense as an inflectional morpheme, like English does (with -ed), so the translation from Chinese to English must make a *guess*, likely based on the frequency of the present tense interpretation for sentences like this (i.e., out of context, such a sentence is likely to be interpreted as present tense). To leave this example on a complicated note, I quote Yoav Goldberg, “human language is magnificent” and complex; if you try this example on other verbs, you might get better/different results, because, in the end Mandarin Chinese is only partly “tenseless”. The example is still illustrative.
We started with tense and gender because they are fairly familiar grammatical systems to most English speakers. But, let’s move beyond that. For our next example, let’s take on “evidentiality”. Grammatical evidentiality requires the language to grammatically encode the source of information. For example, evidential languages have affixes on verbs which mean things like “purportedly”, “I saw X”, or “I heard that X”. Evidentiality is a very important grammatical system, particularly for native American languages like Apache, or South American languages like Quechua, but even languages as “exotic” as German can be seen as (at least partially) evidential-marking (Diewald & Smirnova 2010). In fact, 237 out of 418 languages listed in WALS under the discussion of grammatical evidentiality have some grammatical encoding of this information (also see this list). Evidential affixes for evidential-marking languages are just as required as English -ed or will when speaking about past or future events, and just as comfortable to speakers of these languages as -ed is to us. In this way, we can think of English as an “evidential-less” language.
So, how does our Google Translate do? If we translate a translation mismatch in evidentiality, what happens? Which evidential-specification does Google Translate collapse to? See below:
In the first line, the Turkish encodes the idea of direct experience, providing a meaning like, say, “Ahmet is here and I saw him myself”, whereas the second sentence encodes indirect evidence supporting the claim that “Ahmet is here” (maybe you’re friend told you he was here; link). Of course, evidentiality, like tense and gender, is very linguistically complex, interacting with many other grammatical systems. Setting aside social impact for a second and focusing on grammar: this example is a bit more grammatically worrying than the gender example, because, not only does the system fail to notice the ambiguity, but also fails to encode any evidential meaning whatsoever!
Let’s take one last example: honorifics in Japanese. Honorifics are noun affixes, somewhat like Ms. or Dr. in English, which express social alignment between interlocutors, often marking politeness. Except in Japanese, they are suffixes. Three frequently occurring ones are -san (which is pretty underspecified, and can be attached to regular people’s last names), -sama (which refers to someone of higher rank or an older person, showing a higher degree of politeness or “deference”) and -shi (which is used in formal writing, often for someone you’ve never met, like a visiting professor). Honorifics are required Japanese (and Korean, and many other languages), and cover a wide range of social functions.
Above, three Japanese sentences are collapsed in one translation — in which, incidentally, the subject is again spuriously male (fyi: Japanese is another “genderless” language). See how deep gender translation issues go? There’s nothing about “arriving somewhere tomorrow” in Japanese that feels particularly masculine to me…
But, what about translating in the other direction? This will induce the translation mismatch.
Which honorific should we use in the sentence “Mr. Yamamoto will come tomorrow”? Uttered out of the blue, any honorific suffix is possible in principle, depending on social relationships between interlocutors. We’d need contextual information (another, related issue) to know which one should be preferred. Perhaps Google Translate should adopt the same strategy it did for Turkish gender and surface all the options? If so, we would see (at least) three Japanese sentences with different honorific suffixes, not just one. This opens up the possibility for many potential alternate translations needing to be surfaced to the client.
So, now we’ve seen three new examples of where translations fail. They are very well controlled: in each example of a translation mismatch, we’re translating only a single sentence with only one grammatical system. In reality, pairs of languages may differ across numerous grammatical systems, even within the same sentence. We could have wanted to translate Turkish sentences with both “genderless” pronouns and evidentiality into English, and then we would multiple the potential translation alternatives. These three examples are just the three I could think up off the top of my head post-NAACL (as I write this, I’m reminded of others to try: we could look at translation mismatches in case, determiners and definites, question particles, clusivity, numeral classifiers, and this list could definitely go on and on).
Here, I’ve only shown translation mismatches where at most three sentences collapse to a single translation (i.e., for Japanese honorifics), but I’ve no doubt there are examples where many sentences will spuriously collapse to one translation. There are undoubtedly many many more of these examples to be had, for every pair of languages we might hope to translate. Translation mismatches most certainly have implications for all our multilingual models, especially for commercially-important tasks in our increasingly globalized world (e.g., will our Japanese personal assistant/translator utilize the right honorific in translating a business message? If no, it could be very socially awkward/overtly offensive!).
What are we going to do? To make our translations better across mismatches in general, the most straightforward option is to have humans hand-annotate all language pairs, then surface multiple possible options to the consumer at translation time (as Google Translate currently does for Turkish gender). This would give us the most faithful translations in low context situations of the kind discussed here. But, this is likely very hard to scale up (given many possible spurious collapses).
Another option is to beef up our neural machine translation systems. Many NLPers pursue this option. [Edit: I recently came across Moryossef et al. 2019, which “injects” the missing morphological information; this might be a fruitful, nay even scalable, future direction!]
Yet a third option is to work towards better understanding the grammar underlying the mismatches. If we understand grammatical phenomenon (such as the role of affixal morphology) better, we might even find ways to determine, in a data-drive way, when translation mismatches are present for any arbitrary language pairs. Although it is more removed from the incredibly important work of handling socially harmful gender bias in our translation systems, I shamelessly (since I work on lexical semantics of gender in NLP) and optimistically advocate for this indirect approach as another, potentially complementary direction of attack.
It’s a long shot, but hey, understanding more about grammar might let us be lucky enough to uncover new ways to ameliorate the negative gender biases that we are all rightly worrying about.
Bordia, S., & Bowman, S. R. (2019). Identifying and reducing gender bias in word-level language models.
Burnett, H., & Bonami, O. (2019). Linguistic prescription, ideological structure, and the actuation of linguistic changes: Grammatical gender in French parliamentary debates. Language in Society, 48(1), 65–93.
Corbett, G. G. (1991). Gender. Cambridge: Cambridge University Press
Diewald, G., & Smirnova, E. (2010). Evidentiality in German: Linguistic realization and regularities in grammaticalization (Vol. 228). Walter de Gruyter.
Gonen, H., & Goldberg, Y. (2019). Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them.
Kameyama, M., Ochitani, R., Peters, S. (1991). Resolving Translation Mismatches With Information Flow. In Proceedings of the 29th annual meeting on Association for Computational Linguistics (pp. 193–200). Association for Computational Linguistics.
Lin, J. W. (2010). A tenseless analysis of Mandarin Chinese revisited: A response to Sybesma 2007. Linguistic Inquiry, 41(2), 305–329.
Matthewson, L. (2006). Temporal semantics in a superficially tenseless language. Linguistics and Philosophy, 29(6), 673–713.
Qian, Y., Muaz, U., Zhang, B., & Hyun, J. W. (2019). Reducing Gender Bias in Word-Level Language Models with a Gender-Equalizing Loss Function.
Rudinger, R., May, C., & Van Durme, B. (2017). Social bias in elicited natural language inferences. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing (pp. 74–79).
Sybesma, R. (2007). Whether we tense-agree overtly or not. Linguistic Inquiry, 38(3), 580–587.
Zhao, J., Wang, T., Yatskar, M., Cotterell, R., Ordonez, V., & Chang, K. W. (2019). Gender bias in contextualized word embeddings.
TY for comments: Hagen Blix, Ryan Cotterell, Hila Gonen, and Clara Vania