How Computers See Gender

An Evaluation of Gender Classification in Commercial Facial Analysis and Image Labeling Services

Morgan Klaus Scheuerman
Published in
4 min readSep 6, 2019


This blog post summarizes a paper about how gender is represented in commercially available facial analysis computer vision systems. This paper will be presented at the 22nd ACM Conference on Computer-Supported Cooperative Work and Social Computing, a top venue for social computing scholarship. It will also be published in the journal Proceedings of the ACM (PACM).

Have you ever thought about how you, as an individual, are seen by your technology? What does it think of you? How does it classify you? What labels make up who you are? Many technologies are doing this as we speak — making simple determinations about the humans they come into contact with, bucketing them into terms like “woman,” “female,” “age 21–24.”

As AI increasingly intersects with human life, characteristics of human identity are made computable. A spectrum of human characteristics — such as age, race, and gender — are being embedded into algorithmic computer systems in new and unique ways: without “user” input. A domain that explicitly characterizes humans is computer vision. Perhaps the most obvious example is automated facial analysis technology (FA), an umbrella term for computer vision methods designed to categorize aspects of the human face.

Alongside the growing commercial focus on building FA systems are growing ethical concerns about what values are being embedded into those systems. One of those concerns has been how gender is encoded into FA. What is gender to these systems and how is it being labelled and categorized? Whose gender is recognized and how is that communicated? How do these computer vision systems see gender?

These systems often display outputs in an objective way. In reality, it’s laden with value choices, abstract decisions, and subjective notions of gendered presentations. Different systems use different language to define gender; they also often classify the same face differently. In pilot testing my own photographs across different FA systems, I found myself misclassified about half of the time.

Figure 1. A selfie I took of myself, classified on the left correctly as “male” (Microsoft) and on the right incorrectly as “female” (IBM). IBM did get one thing right — I am a “cat fancier.”

Misclassification and its consequences have been a concern for individuals who exist outside of cisnormative gender expectations — binary trans people, non-binary people, gender non-conforming people, and so on. Our study uncovered the discursive values inscribed into computer vision systems, to better understand how often these types of misclassifications might happen to vulnerable gender groups, as well as how these (mis)classifications understand gender at an infrastructural level.

To do this, we conducted a two-phase study to better understand the current state of gender in commercially available FA systems — systems available for public purchase right now. The computer vision services we analyzed bundled their features in different ways, but the features we analyzed can broadly be understood as falling into two categories — facial analysis and image labeling:

  • Facial analysis employs specific feature detection functionality trained for faces. FA is trained to classify all faces by gender into pre-determined categories.
  • Image labeling (called “tagging” on some platforms) provides a set of labels for objects detected in the image (e.g., young lady (heroine), soul patch facial hair). In contrast to the consistent data schema provided by FA, the specific labels and how many are included varies, depending on what was detected in the image.
Table 1. The set of facial analysis and image labeling companies (and the service name, if it is different) whose documentation we analyzed. The “Gender Classifier Terms” column represents the language used to describe gender classification in the service, revealing the “view” each service has taken on defining gender. The “Probability Score” column indicates whether the gender classifier includes a probability score, which allowed us to assess how confidence the gender classification was. Bolded names represent the services we studied in-depth during Phase II.

In our analysis, we found that FA systems, on average, performed best on images of (presumably cisgender) women. They performed worst, on average, on images of transgender men. None of the systems we analyzed had the ability to correctly classify non-binary, agender, and genderqueer images. This was because none of the systems could understand gender beyond a binary “male” or “female” construct. We also found that binary gender is reinforced through the labels provided through image labeling functionality.

Table 2. The True Positive Rate (TPR) for each gender (woman, man, trans woman, trans man, agender, genderqueer, non-binary) across the face calls of eachof the facial analysis services we analyzed. The TPR represents the accuracy at which the FA classification correctly identified the self-labeled gender of the person in the image.

Algorithmic infrastructures, like all databases that contain human characteristics, shapes whose identity exists to a system. We show how FA and image labeling systems collapse gender into a singular, binary worldview: presentation equals gender. This worldview erases the possibility of fitting outside of a classificatory binary, trained on face data that represents a very specific gender presentation.

We use our findings to prompt discussion on how complex human identities are classified by simplistic technologies. We question how this constrains how third-parties, who purchase API packages that propagate specific worldviews, like the ones we found in this study.

The reality is, our systems are categorizing us, though they aren’t necessarily doing the best job. Researchers and designers need to consider how to ethically and equitably represent human diversity in technical infrastructures — not just improving gender error rates. Through the choices we make when constructing and annotating datasets, designing infrastructure architectures, and coding computational tasks, we have the ability to include and exclude, validate and invalidate certain groups of people. In our paper, we present design and policy considerations for creating more inclusive, equitable, and diverse algorithms — algorithms that recognize a richer array of human characteristics, rather than infer simplistic and constrained ones.


Morgan Klaus Scheuerman, Jacob M. Paul, Jed R. Brubaker. How Computers See Gender: An Evaluation of Gender Classification in Commercial Facial Analysis and Image Labeling Services. 2019. CSCW. ACM, Austin, TX. 33 pages.