Trans on Wikipedia: the code is binary, but the people aren’t

Sophie Hurwitz
6 min readDec 13, 2020

--

A CS234 project by Kathryn Swint, Sarah Pardo, and Sophie Hurwitz

In recent years, thinkpiece-makers and social-media-users have spoken at length about how the biases of the people who create the platforms we use every day online can influence how those platforms work. In spaces like Wikipedia, for example, the way gender is represented tends to map onto the biases of the editors of Wikipedia — as of 2008, for example, a study of all Wikipedia editors (‘wikipedians’) concluded that 84% of the editors of the English-language Wikipedia were men. While cis-women-centered ‘feminist wiki edit-o-thons’ have become relatively commonplace as an attempt to fix this problem, there has been almost no work done focusing specifically on how trans people are portrayed on Wikipedia.

While multiple studies have shown that women’s wikipedia biographies are more likely than men’s to spend a higher percentage of their wordcount on “female” gendered features and accomplishments — such as family work, spouses, and personal life — there are no similar studies for transgender people. As such, my friends and I decided to study this topic for our CS class — leading us down a rabbit hole of gender, pronouns, code, and general crowdsourced madness. Our initial research question became: how are trans people portrayed on Wikipedia, and how can we use analysis of this on a larger, data-driven scale to improve the representation of transgender and nonbinary individuals?

Data Collection

To do this, we first had to collect a sample dataset — so we collected biographies from a Wikipedia list of non-binary people, and another, seperate list of transgender people. We needed samples of cisgender men and women’s biographies for comparison, so we used lists of male jazz singers, male mixed martial artists, female rock singers, and female members of parliament in the UK. (These lists were picked more or less at random — a larger study could likely have done a better job of finding representative samples of cisgender people.)

Our sample dataset ended up including 1200 Wikipedia pages: 442 cisgender men, 398 cisgender women, 182 nonbinary people, and 317 transgender men and women.

Notably, the transgender list did not have gender identity labels readily accessible like the nonbinary list did — the nonbinary list was structured more like a table, including specific gender identity, while the transgender list wasn’t at all. While Wikipedia’s coverage of trans people in general was very limited, their coverage of binary trans people was comparatively extensive. So, that’s one interesting conclusion we can draw here immediately: where trans people are seen on Wikipedia, they tend to be seen as binary. (In the world outside Wikipedia, however, nonbinary people make up a significant proportion of the transgender population — 35% according to the 2015 U.S. Transgender Survey).

After collecting the pages we wanted to use, we then analyzed the text of these wikipedia pages by “features”: we created lists of “women features,” “men features,” “nonbinary features,” “cis features,” and “trans features,” and measured the frequency of each type of word (generally, pronouns and possessive words) on each biography page so we could compare the proportions. All of the code used here can be found on this page: https://github.com/sarah-pardo/wikigenderanalysis

Preliminary Findings

Here are some conclusions from that study: first of all, cisgender features (in this study, the words “cis” and “cisgender”) are very rarely mentioned across the entire population we studied, despite most of that population being cisgender. This reinforces the idea that cisgender is the default state, a sort of “invisible normal.”

With regards to the biography pages of nonbinary people in particular, we had some particularly interesting takeaways: first of all, the prevalence of “nonbinary features” — use of they/them pronouns or neopronouns in particular — tended to be negatively correlated against page length, which may be because editors tend to feel more comfortable simply repetitively using someone’s last name instead of using the singular “they” or neopronouns. This could also be related to an overall trend in which nonbinary people simply aren’t written about as much.

Linear regressions displaying gendered features against page length.

We calculated the linear regression models for gendered group proportions against the lengths of pages in characters. Female features are negatively correlated against page length on transgender people’s pages, while male features are positively correlated. Transgender features (in this study, page use of the word “transgender,” “transsexual,”) are also negatively correlated with page length.

This is likely because many trans people’s pages are what wikipedia calls “stubs” — pages that often just state that the “notable individual” is trans, and then add a sentence or two on what that person has done to make them notable. So, the negative correlation between frequency of “trans feature” use and length of page may simply be because a longer page allows for editors to write about more than just a trans person’s gender.

Conclusion and next steps

As trans activist Tiq Milan put it, there is a paradox inherent in transgender representation: “the more trans people are seen, the more they are violated.” this is known as the visibility paradox: as groups like transgender people are represented in the media more than ever before, much of that representation is going to be stereotypical, and even encourage violence against trans people.

If we make that representation as accurate and sensitive as it can be, though, we can minimize the violence of it. And since Wikipedia is a crowdsourced platform, it’s one that we actually have some control over. We can introduce more nuanced trans narratives to a wider audience due to Wikipedia’s huge internet popularity.

There are several ways we found to do this in our research: one is reformatting lists such as Wikipedia’s “list of transgender people” into tables, making the data easier to work with. We can also run edit-o-thons focused directly on bulking out the pages of trans people, and undoing malicious edits done by anti-transgender groups, such as one we found last week in which malicious editors changed her pronouns to “he.” On the profile of Lynda Cash, the first ever trans person known to have served in Britain’s Royal Navy, all the pronouns were changed, and labeled as a “typo correction” so as to not come up as a major edit.

Another way we can help is expanding the biographies of trans people — a volunteer group called the LGBT Wikiproject, which is responsible for much of the data on trans people we were able to use in the first place, does a lot of work on that. Many articles on trans people are currently stubs, or are articles on those people’s murder, for example, rather than the individuals themselves. This not only makes collecting data on the treatment of trans and nonbinary people on Wikipedia harder, but also helps perpetuate biases against them. One example of an article in this category is the one on Leelah Alcorn, which rather than being labeled with her name is labeled “Suicide of Leelah Alcorn.

As activist and actress Laverne Cox put it, trans representation on places like Wikipedia often amounts to barely more than a missing dataset: “we are not basically included in so many levels of data. And so when we’re not counted, we can often be discounted.” As trans and nonbinary people and allies, it is important that we use our data skills for good, and to make sure that where we are represented, it is done correctly and humanely.

--

--