Technology: Machine Learning

When Men are the Default

What we can learn from Wikipedia’s biases

OpenSexism
4 min readAug 17, 2022
“Notable Person" by Dalle-Mini (all men) Showing gender bias on Wikipedia and downstream technologies
“Notable Person” as as conceived by DALL-E Mini

Have you ever wondered what a notable person looks like? Or why nine out of nine of the notable people in the above image, generated by DALL-E Mini, are men? Ever since Laura Mulvey described the male gaze in her essay “Visual Pleasure and Narrative Cinema” we’ve used this lens to better understand how women are depicted (or not) in art. But whose gaze are we following when we look at computer generated works?

Looking at Wikipedia provides some insights–both because it’s often used to train machine learning tools and because it’s been well studied, particularly in regard to gender.

I have a collection of studies, interviews and essays that look at gender on Wikipedia — 119 pieces as of this writing. We know that the majority of Wikipedia editors are men, for example. We know that the share of men’s biographies on Wikipedia is just over 80%. We know about structural biases — the fact that women are less central in the knowledge network. And citation biases — that women writers are under-cited. And we’ve looked at the bias in the language used to describe individuals.

How does this all affect the tools trained on this content?

The researchers I’ve read express concern. Langrock and Gonzalez-Bailon, who look at feminist interventions on Wikipedia, have found that the current efforts to correct known gender gaps by creating biography pages for notable women are not alone enough to correct gender bias on the site. Women’s pages remain less central in the network — with fewer incoming links and info boxes— and therefore less visible and harder to find. Links to women’s biographies account for only 7 percent of the links to humans on Wikipedia’s 100 Level 2 Vital articles (which includes pages such as Climate, Astronomy, and Business), for example.

Noting that Wikipedia content is used to train machine-learning models, among other downstream applications, the paper’s authors write:

Inequalities within the structural properties of Wikipedia — the infobox and the hyperlink network — can have profound effects beyond the platform…the gendered inequities we identify can have large effects for information-seeking behavior across a range of digital platforms and devices.

Nicholas Vincent and Brent Hecht, who looked at Wikipedia and search engines, found that Wikipedia links appear in 81–84% of search result pages for common queries, and are also prevalent in Knowledge panels, which are particularly visible on result screens. Wikipedia is also used behind the scenes to help build the search engine knowledge graphs, which — among other things — help the technology understand relationships between things.

“Wikipedia content has a huge impact well beyond the wikipedia.org website,” the authors write, noting that a “particularly significant implication is that the biases of Wikipedia content will impact search results.”

Recently, Oskar van dear Wall, et al, trained an LSTM on Wikipedia and studied the parameters as they changed over time in order to better understand “how language models come to be biased in the first place.” One of the interesting things they observe is the development of a ‘gender unit’ that is “strongly driven by female tokens, whereas male tokens dominate the development of gender information that is distributed across all other dimensions.”

I’m reminded of ‘engineers’ and ‘women engineers’, ‘scientists’ and ‘women scientists’, a world in which men are the default for everything other than, well, being a woman.

Today, a Google search for “scientists” from my laptop returns a related search for “15 scientists and their inventions” that expands into a featured list of 29 guys and Marie Curie. Why are we still featuring a panel like this? Why did the search engine make these particular connections?

Gender bias is prevalent in Wikipedia. We can see and study it there. And maybe, by making it more visible, we can correct it — both on Wikipedia and in the applications that learn from online content. Because it’s 2022. There’s really no place for sexism.

Read more about structural biases on Wikipedia:

Works Cited

Baltz, Samuel. “Reducing Bias in Wikipedia’s Coverage of Political Scientists.” PS: Political Science & Politics 55, no. 2 (2022): 439–444.

Brun, Natalie Bolón, Sofia Kypraiou, Natalia Gullón Altés, and Irene Petlacalco Barrios. “Wikigender: A Machine Learning Model to Detect Gender Bias in Wikipedia.”

Langrock, Isabelle, and Sandra González-Bailón. “The Gender Divide in Wikipedia: Quantifying and Assessing the Impact of Two Feminist Interventions.” Journal of Communication 72, no. 3 (2022): 297–321.

Mulvey, Laura. “Visual pleasure and narrative cinema.” In Visual and other pleasures, pp. 14–26. Palgrave Macmillan, London, 1989.

Van Der Wal, Oskar, Jaap Jumelet, Katrin Schulz, and Willem Zuidema. “The Birth of Bias: A case study on the evolution of gender bias in an English language model.” arXiv preprint arXiv:2207.10245 (2022).

Vincent, Nicholas, and Brent Hecht. “A deeper investigation of the importance of wikipedia links to search engine results.” Proceedings of the ACM on Human-Computer Interaction 5, no. CSCW1 (2021): 1–15.

WikiProject Women in Red: https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Women_in_Red

--

--