Jan 12, 2017 · 3 min read

Gender identity is a fascinating and emotional topic, but from a linguistic perspective it’s a statistical one too! For a recent presentation at Hack && Tell, I wanted to explore the personal concept of gender that was imprinted into Wikipedia. I say personal since I’d like to make it about you, the reader. To do so, let’s examine the name of your gender according to Wikipedia.

Do names have a gender? Is that gender a binary? Do you think that John is a masculine name and Nell is a feminine one? What about the androgynous Pat? The numerical evidence suggests that gender in fact is continuous and, given the proper caveats, we can quantify the “genderness” of a name.

Computational natural language processing has a distributional hypothesis, summed up nicely in almost every NLP paper by a quote from Firth (1957):

You shall know a word by the company it keeps.

That is, we can figure out how words are related by looking at the co-occurrence patterns found in any large corpus of text. There are many excellent posts on this topic, so let’s get to the heart of the question. What is the gender of my name?

The key insight we need is this, we can project a name onto some gender-direction in the distributional space. There are two great papers on this:

In our case, let’s map the names onto the direction that words “he” — >”she” traces out. Instead of removing the gender component, we can use it as a measuring stick. Define gender as a variable t by parameterizing the line so t=0 is completely “he” and t=1.0 is completely “she”. For each name, compute both the closest approach from the word to the gender line and the corresponding parameterized point, eg. t=gender(name). Given a large list of names, we now have a qualitative gender continuum!

I’ve also explored this idea before in a silly little project called transorthogonal-linguistics (demo). In it, I looked at words along the path from t=[0,1] using user-supplied beginning and end words. In this case however, we are going to anchor our definitions to “he” and “she” and define gender to be the point along this line. Other definitions are certainly possible, and a more robust method might used the largest principal component of a few gender analogies.

Methodology:

Download Wikipedia and a large list of first names. Parse the wiki XML and remove all markup so only text remains. Train a 300-dimensional distributional text embedding model with either word2vec or fastText. Trace a line from two anchor words, like “he” and “she” and compute the parameterized value for each name.

One final thing to remember that all of this is in the context of a single corpus, Wikipedia. Wikipedia is an encyclopedia and as such contains notable figures throughout history. Any trained model from it reflects a historical perspective and not necessarily a modern one. For the aspiring data scientist out there, it would be interesting to see which names changes in a different corpus, like the modern news embedding from Google.

Results:

Top male names: Jovan, Wilford, Newton, Maurice, Emmanuel, Joseph

Top female names: Jasmine, Opal, Liza, Vanessa, Natalie, Lily

The complete list of all the data can be found on this github gist. A quick embed of all the names is also below (the search doesn’t work on Medium, if you follow the links you should be able to search and download). Let me know if you found this interesting or you’d like to see any followups!

Written by