If you’re trying to add a male/female column to a spreadsheet of notable people, Wikipedia may offer a quick fix.

Wikipedia doesn’t include gender in its API, but a simple heuristic may be a quick fix if you’re trying to add a male/female column to a spreadsheet of notable people. Simply counting the number masculine and feminine pronouns in a person’s Wikipedia page seems to be a surprisingly accurate method.

Here is a link to a Jupyter notebook with these tests.

def genderGuess(row, threshold=0):
male = ['he', 'his', 'him']
female = ['her', 'hers', 'she']

name = row['name']
cont = wikipedia.page(name).content.lower()
words = cont.split(' ')
he = len([w for w in words if w in male])
she = len([w for w in words if w in female])
percDif = abs(he - she) / ((he + she) / 2)
if percDif < threshold:
return '?' # does not pass the threshold
return 'F' if she >= he else 'M'
return 'e' # error
df['guess'] = df.apply(genderGuess, axis=1)

After scraping the thousand top actors and actresses from IMDb, I was able to get a CSV of notable people and their genders. Applying this method to the list, I was surprised to get 100 percent accuracy. There was one hiccup: one actor’s name is “Common” which naturally prompted multiple Wikipedia articles and triggered an error.

There are definitely a few ways to improve this function. For example, since the function splits by space, a pronoun paired with a punctuation mark would be disqualified, as in the sentence “The dog liked him.” Assuming, however, that both male and female pronouns are subject to this flaw, we can safely ignore it.

I’ve also included a threshold option which allows a minimum percent difference between the pronoun counts. This can come in handy if you are being especially wary of incorrectly guessing gender. If the percent difference between the counts of male and female pronouns is below the given threshold, it’ll return a question mark that the user can then replace manually.

Questions? Feedback? Feel free to email me. See more projects on my website.

Kevin McElwee 🏳️‍🌈

Written by

Machine learning engineer and data journalist in DC. Learn about me and my projects at kevinrmcelwee.ml

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade