Evaluating Gender Bias in Machine Translation

Does machine translation get gender?

Gabriel Stanovsky
6 min readJun 4, 2019

We find that popular machine translation (MT) services are prone to make gender-biased errors when translating from English to various target languages, thus echoing and amplifying societal stereotypes. Data and code are available in our GitHub repository. Find more details in our ACL 19 paper.

Translating gender

For a translation to be accurate it is required, among other things, to preserve the gender of entities between the source text and the target translation.

An example translation of a Winogender-style sentence. A stereotypical assignment of gender roles changes the doctor’s translated gender.

To translate gender correctly, we first need to understand how gender is conveyed in the source language. For example, in languages with no grammatical gender (such as English), gender is often conveyed through the use of pronouns. In the English sentence above, we understand that the doctor is a woman¹ by associating the pronoun “her” with the doctor, rather than the nurse.

In addition, we need to use the right tools for conveying gender in the target language. In our example, Spanish encodes masculine and feminine genders at the word level, using morphological inflections for both noun entities (“doctor-a”) as well as gendered articles (“la”). Subsequently, in Spanish we have to commit on the gender of the two entities in our example sentence. In this case, the assignment of either masculine or feminine grammatical gender to the nurse (whose gender is undefined in the English sentence) is valid, while an accurate translation must use a feminine inflection for the doctor.

Getting the Gender Right in Machine Translation

Users of popular industrial machine translation services (such as Google Translate) have noticed gender-biased translations, influenced by societal stereotypes and norms.

“manager” is translated as a male — “El gerente. “receptionist” is translated as a female — “la recepcionista”. (screenshots were taken 6/2/19)

Without additional context, such examples may be stereotypical, yet not inherently wrong. Also, they’re perhaps not that surprising, given that MT models are trained on real-world text, where, unfortunately, managers still tend to be men, while receptionists are predominantly women.

What about instances where gender isn’t ambiguous?
To estimate the extent to which current state-of-the-art MT models are capable of inferring gender from context, we use two recent test sets: Winogender (Rudinger et al., 2018) and WinoBias (Zhao et al., 2018), with a total of about 3.9K sentences. Both of these follow the Winograd schema: each instance is an English sentence which describes a scenario with entities who are identified by their role (e.g., “the doctor” and “the nurse” in our example above) and a pronoun (“her” in this example), which needs to be correctly resolved to one of the entities (“the doctor” in this case).

Although originally designed to test gender bias in coreference resolution, we find that these datasets also fit our needs, as each sentence contains an entity whose gender is non-ambiguous and has ground truth gender annotation. Furthermore, these test sets are evenly divided between stereotypical and non-stereotypical gender role assignments, following U.S. labour force statistics (e.g., a female nurse versus a female doctor), enabling us to test differences in gender translation performance between the two subsets.

Automatic evaluation
Having an automatic method to assess a given MT’s gender prediction accuracy will allow for an easy re-assessment of future MT models. To achieve this, we will specifically test automatic translations of our test sets to languages which encode gender with morphological markers (as we’ve seen that Spanish does), and which have existing automatic morphological analyzers capable of predicting a word’s gender with high accuracy.² This requirement doesn’t limit us too much, as a diverse array of languages fit these requirements:

  • Romance languages: Spanish (ES), French (FR), and Italian (IT), all of which
    have noun-determiner gender agreement and spaCy morphological analysis support.
  • Slavic languages (Cyrillic alphabet): Russian (RU) and Ukrainian (UK), for which
    we use the morphological analyzer developed by Korobov, 2015.
  • Semitic languages: Hebrew (HE), for which we use the morphological analyzer developed by Adler and Elhadad, 2006, and Arabic (AR), where grammatical gender can be easily identified via the ta marbuta (ة) suffix, which uniquely indicates feminine inflection.
  • Germanic languages: German (DE), for which we use the morphological analyzer developed by Altinok, 2018.
Google’s translation of our running example. Despite context, “doctor” is translated as a male — “El médico”.

Results

Running four commercial systems through our evaluation method shows a clear trend.

All tested MT systems do much better when translating the stereotypical role assignments.

Google Translate performance on gender prediction.
Microsoft Translator
Amazon Translate
SYSTRAN

All tested MT systems do much better when translating the stereotypical role assignments, indicating that they indeed seem to rely on biases in their training data more than they do on relevant context. Here are a few interesting examples:

Examples of Google Translate’s output for different sentences in the test corpora. Words in blue, red, and orange indicate male, female and neutral entities, respectively.

Fighting bias with bias

What happens when we pit system biases against each other?

We tested whether we can affect the translation process by prepending the adjectives “handsome” and “pretty” to male and female entities, respectively. For example, our sentence will be converted to:

“the pretty doctor asked the nurse to help her in the operation.”

While “doctor” evidently biases towards a male translation, “pretty” may tug the translation towards a female inflection. Our results show that this change improved performance in some languages, significantly reducing bias in Spanish, Russian, and Ukrainian.

Revisiting our example, adding “pretty” as a streotypical adjective produces a female inflection in the output Spanish — “La linda doctora”.

Limitations

While our work presents the first large-scale evaluation of gender bias in MT, it still suffers from certain limitations.

First, our evaluation relies on synthetic English source-side examples. On the one hand, this allows for a controlled experiment environment, while, on the other hand, it might introduce some artificial biases in our data and evaluation. Ideally, our data could be augmented with natural sentences, with many source languages, all annotated with ground truth annotations of entity gender.

Second, similar to many medium-size test sets, our evaluation serves only as a proxy estimation for the phenomenon of gender bias in MT, and would probably be easy to overfit. A larger annotated corpus can perhaps provide a better signal for training.

Conclusion

Like many other learned systems, machine translation seems prone to exploit statistical biases in its training data rather than rely on more meaningful context cues. In turn, this echoes and amplifies our own social biases, especially within a prominent and widely used NLP service such as MT. While solving such problems remains challenging, identifying that gender bias exists in MT (and quantifying its extent) is a step in the right direction. We hope that future work will use our evaluation protocol as a first stepping stone towards more gender-balanced MT models.

[1] For the sake of this large-scale automated study, we adopt an admittedly over-simplified view of gender as binary.

[2] We employ fast_align (Dyer et al., 2013) to align between English and the target language to find the translated entity.

--

--