Guessing gender with Wikipedia
If you’re trying to add a male/female column to a spreadsheet of notable people, Wikipedia may offer a quick fix.
Wikipedia doesn’t include gender in its API, but a simple heuristic may be a quick fix if you’re trying to add a male/female column to a spreadsheet of notable people. Simply counting the number masculine and feminine pronouns in a person’s Wikipedia page seems to be a surprisingly accurate method.
Here is a link to a Jupyter notebook with these tests.
def genderGuess(row, threshold=0):
male = ['he', 'his', 'him']
female = ['her', 'hers', 'she']
name = row['name']
cont = wikipedia.page(name).content.lower()
words = cont.split(' ')
he = len([w for w in words if w in male])
she = len([w for w in words if w in female])
percDif = abs(he - she) / ((he + she) / 2)
if percDif < threshold:
return '?' # does not pass the threshold
return 'F' if she >= he else 'M'
return 'e' # errordf['guess'] = df.apply(genderGuess, axis=1)
After scraping the thousand top actors and actresses from IMDb, I was able to get a CSV of notable people and their genders. Applying this method to the list, I was surprised to get 100 percent accuracy. There was one hiccup: one actor’s name is “Common” which naturally prompted multiple Wikipedia articles and triggered an error.
There are definitely a few ways to improve this function. For example, since the function splits by space, a pronoun paired with a punctuation mark would be disqualified, as in the sentence “The dog liked him.” Assuming, however, that both male and female pronouns are subject to this flaw, we can safely ignore it.
I’ve also included a
threshold option which allows a minimum percent difference between the pronoun counts. This can come in handy if you are being especially wary of incorrectly guessing gender. If the percent difference between the counts of male and female pronouns is below the given threshold, it’ll return a question mark that the user can then replace manually.