Do algorithms reveal sexual orientation or just expose our stereotypes?
A study claiming that artificial intelligence can infer sexual orientation from facial images caused a media uproar in the Fall of 2017. The Economist featured this work on the cover of their September 9th magazine; on the other hand two major LGBTQ organizations, The Human Rights Campaign and GLAAD, immediately labeled it “junk science”. Michal Kosinski, who co-authored the study with fellow researcher Yilun Wang, initially expressed surprise, calling the critiques “knee-jerk” reactions. However, he then proceeded to make even bolder claims: that such AI algorithms will soon be able to measure the intelligence, political orientation, and criminal inclinations of people from their facial images alone.
Kosinski’s controversial claims are nothing new. Last year, two computer scientists from China posted a non-peer-reviewed paper online in which they argued that their AI algorithm correctly categorizes “criminals” with nearly 90% accuracy from a government ID photo alone. Technology startups had also begun to crop up, claiming that they can profile people’s character from their facial images. These developments had prompted the three of us to collaborate earlier in the year on a Medium essay, Physiognomy’s New Clothes, to confront claims that AI face recognition reveals deep character traits. We described how the junk science of physiognomy has roots going back into antiquity, with practitioners in every era resurrecting beliefs based on prejudice using the new methodology of the age. In the 19th century this included anthropology and psychology; in the 20th, genetics and statistical analysis; and in the 21st, artificial intelligence.
In late 2016, the paper motivating our physiognomy essay seemed well outside the mainstream in tech and academia, but as in other areas of discourse, what recently felt like a fringe position must now be addressed head on. Kosinski is a faculty member of Stanford’s Graduate School of Business, and this new study has been accepted for publication in the respected Journal of Personality and Social Psychology. Much of the ensuing scrutiny has focused on ethics, implicitly assuming that the science is valid. We will focus on the science.
The authors trained and tested their “sexual orientation detector” using 35,326 images from public profiles on a US dating website. Composite images of the lesbian, gay, and straight men and women in the sample reveal a great deal about the information available to the algorithm:
Clearly there are differences between these four composite faces. Wang and Kosinski assert that the key differences are in physiognomy, meaning that a sexual orientation tends to go along with a characteristic facial structure. However, we can immediately see that some of these differences are more superficial. For example, the “average” straight woman appears to wear eyeshadow, while the “average” lesbian does not. Glasses are clearly visible on the gay man, and to a lesser extent on the lesbian, while they seem absent in the heterosexual composites. Might it be the case that the algorithm’s ability to detect orientation has little to do with facial structure, but is due rather to patterns in grooming, presentation and lifestyle?
We conducted a survey of 8,000 Americans using Amazon’s Mechanical Turk crowdsourcing platform to see if we could independently confirm these patterns, asking 77 yes/no questions such as “Do you wear eyeshadow?”, “Do you wear glasses?”, and “Do you have a beard?”, as well as questions about gender and sexual orientation. The results show that lesbians indeed use eyeshadow much less than straight women do, gay men and women do both wear glasses more, and young opposite-sex-attracted men are considerably more likely to have prominent facial hair than their gay or same-sex-attracted peers.
Breaking down the answers by the age of the respondent can provide a richer and clearer view of the data than any single statistic. In the following figures, we show the proportion of women who answer “yes” to “Do you ever use makeup?” (top) and “Do you wear eyeshadow?” (bottom), averaged over 6-year age intervals:
The blue curves represent strictly opposite-sex attracted women (a nearly identical set to those who answered “yes” to “Are you heterosexual or straight?”); the cyan curve represents women who answer “yes” to either or both of “Are you sexually attracted to women?” and “Are you romantically attracted to women?”; and the red curve represents women who answer “yes” to “Are you homosexual, gay or lesbian?”.  The shaded regions around each curve show 68% confidence intervals.  The patterns revealed here are intuitive; it won’t be breaking news to most that straight women tend to wear more makeup and eyeshadow than same-sex attracted and (even more so) lesbian-identifying women. On the other hand these curves also show us how often these stereotypes are violated.
That same-sex attracted men of most ages wear glasses significantly more than exclusively opposite-sex attracted men do might be a bit less obvious, but this trend is equally clear: 
A proponent of physiognomy might be tempted to guess that this is somehow related to differences in visual acuity between these populations of men. However, asking the question “Do you like how you look in glasses?” reveals that this is likely more of a stylistic choice:
Same-sex attracted women also report wearing glasses more, as well as liking how they look in glasses more, across a range of ages:
One can also see how opposite-sex attracted women under the age of 40 wear contact lenses significantly more than same-sex attracted women, despite reporting that they have a vision defect at roughly the same rate, further illustrating how the difference is driven by an aesthetic preference: 
Similar analysis shows that young same-sex attracted men are much less likely to have hairy faces than opposite-sex attracted men (“serious facial hair” in our plots is defined as answering “yes” to having a goatee, beard, or moustache, but “no” to stubble). Overall, opposite-sex attracted men in our sample are 35% more likely to have serious facial hair than same-sex attracted men, and for men under the age of 31 (who are overrepresented on dating websites), this rises to 75%.
Wang and Kosinski speculate in their paper that the faintness of the beard and moustache in their gay male composite might be connected with prenatal underexposure to androgens (male hormones), resulting in a feminizing effect, hence sparser facial hair. The fact that we see a cohort of same-sex attracted men in their 40s who have just as much facial hair as opposite-sex attracted men suggests a different story, in which fashion trends and cultural norms play the dominant role in choices about facial hair among men, not differing exposure to hormones early in development.
The authors of the paper additionally note that the heterosexual male composite appears to have darker skin than the other three composites. Our survey confirms that opposite-sex attracted men consistently self-report having a tan face (“Yes” to “Is your face tan?”) slightly more often than same-sex attracted men:
Once again Wang and Kosinski reach for a hormonal explanation, writing: “While the brightness of the facial image might be driven by many factors, previous research found that testosterone stimulates melanocyte structure and function leading to a darker skin”. However, a simpler answer is suggested by the responses to the question “Do you work outdoors?”:
Overall, opposite-sex attracted men are 29% more likely to work outdoors, and among men under 31, this rises to 39%. Previous research has found that increased exposure to sunlight leads to darker skin! 
None of these results prove that there is no physiological basis for sexual orientation; in fact ample evidence shows us that orientation runs much deeper than a choice or a “lifestyle”. In a critique aimed in part at fraudulent “conversion therapy” programs, United States Surgeon General David Satcher wrote in a 2001 report, “Sexual orientation is usually determined by adolescence, if not earlier […], and there is no valid scientific evidence that sexual orientation can be changed”. It follows that if we dig deeply enough into human physiology and neuroscience we will eventually find reliable correlates and maybe even the origins of sexual orientation. In our survey we also find some evidence of outwardly visible correlates of orientation that are not cultural: perhaps most strikingly, very tall women are overrepresented among lesbian-identifying respondents.  However, while this is interesting, it’s very far from a good predictor of women’s sexual orientation. Makeup and eyeshadow do much better.
The way Wang and Kosinski measure the efficacy of their “AI gaydar” is equivalent to choosing a straight and a gay or lesbian face image, both from data “held out” during the training process, and asking how often the algorithm correctly guesses which is which. 50% performance would be no better than random chance. For women, guessing that the taller of the two is the lesbian achieves only 51% accuracy — barely above random chance. This is because, despite the statistically meaningful overrepresentation of tall women among the lesbian population, the great majority of lesbians are not unusually tall.
By contrast, the performance measures in the paper, 81% for gay men and 71% for lesbian women, seem impressive.  Consider, however, that we can achieve comparable results with trivial models based only on a handful of yes/no survey questions about presentation. For example, for pairs of women, one of whom is lesbian, the following not-exactly-superhuman algorithm is on average 63% accurate: if neither or both women wear eyeshadow, flip a coin; otherwise guess that the one who wears eyeshadow is straight, and the other lesbian. Adding six more yes/no questions about presentation (“Do you ever use makeup?”, “Do you have long hair?”, “Do you have short hair?”, “Do you ever use colored lipstick?”, “Do you like how you look in glasses?”, and “Do you work outdoors?”) as additional signals raises the performance to 70%.  Given how many more details about presentation are available in a face image, 71% performance no longer seems so impressive.
Several studies, including a recent one in the Journal of Sex Research, have shown that human judges’ “gaydar” is no more reliable than a coin flip when the judgement is based on pictures taken under well-controlled conditions (head pose, lighting, glasses, makeup, etc.). It’s better than chance if these variables are not controlled for, because a person’s presentation — especially if that person is out — involves social signaling. We signal our orientation and many other kinds of status, presumably in order to attract the kind of attention we want and to fit in with people like us. 
Wang and Kosinski argue against this interpretation on the grounds that their algorithm works on Facebook selfies of openly gay men as well as dating website selfies. The issue, however, is not whether the images come from a dating website or Facebook, but whether they are self-posted or taken under standardized conditions. Most people present themselves in ways that have been calibrated over many years of media consumption, observing others, looking in the mirror, and gauging social reactions. In one of the earliest “gaydar” studies using social media, participants could categorize gay men with about 58% accuracy; but when the researchers used Facebook images of gay and heterosexual men posted by their friends (still far from a perfect control), the accuracy dropped to 52%.
If subtle biases in image quality, expression, and grooming can be picked up on by humans, these biases can also be detected by an AI algorithm. While Wang and Kosinski acknowledge grooming and style, they believe that the chief differences between their composite images relate to face shape, arguing that gay men’s faces are more “feminine” (narrower jaws, longer noses, larger foreheads) while lesbian faces are more “masculine” (larger jaws, shorter noses, smaller foreheads). As with less facial hair on gay men and darker skin on straight men, they suggest that the mechanism is gender-atypical hormonal exposure during development. This echoes a widely discredited 19th century model of homosexuality, “sexual inversion”.
More likely, heterosexual men tend to take selfies from slightly below, which will have the apparent effect of enlarging the chin, shortening the nose, shrinking the forehead, and attenuating the smile (see our selfies below). This view emphasizes dominance — or, perhaps more benignly, an expectation that the viewer will be shorter. On the other hand, as a wedding photographer notes in her blog, “when you shoot from above, your eyes look bigger, which is generally attractive — especially for women.” This may be a heteronormative assessment.
When a face is photographed from below, the nostrils are prominent, while higher shooting angles de-emphasize and eventually conceal them altogether. Looking again at the composite images, we can see that the heterosexual male face has more pronounced dark spots corresponding to the nostrils than the gay male, while the opposite is true for the female faces. This is consistent with a pattern of heterosexual men on average shooting from below, heterosexual women from above as the wedding photographer suggests, and gay men and lesbian women from directly in front. A similar pattern is evident in the eyebrows: shooting from above makes them look more V-shaped, but their apparent shape becomes flatter, and eventually caret-shaped (^) as the camera is lowered. Shooting from below also makes the outer corners of the eyes appear lower. In short, the changes in the average positions of facial landmarks are consistent with what we would expect to see from differing selfie angles.
The ambiguity between shooting angle and the real physical sizes of facial features is hard to fully disentangle from a two-dimensional image, both for a human viewer and for an algorithm. Although the authors are using face recognition technology designed to try to cancel out all effects of head pose, lighting, grooming, and other variables not intrinsic to the face, we can confirm that this doesn’t work perfectly; that’s why multiple distinct images of a person help when grouping photos by subject in Google Photos, and why a person may initially appear in more than one group.
Tom White, a researcher at Victoria University in New Zealand, has experimented with the same facial recognition engine Kosinski and Wang use (VGG Face), and has found that its output varies systematically based on variables like smiling and head pose. When he trains a classifier based on VGG Face’s output to distinguish a happy expression from a neutral one, it gets the answer right 92% of the time — which is significant, given that the heterosexual female composite has a much more pronounced smile. Changes in head pose might be even more reliably detectable; for 576 test images, a classifier is able to pick out the ones facing to the right with 100% accuracy.
In summary, we have shown how the obvious differences between lesbian or gay and straight faces in selfies relate to grooming, presentation, and lifestyle — that is, differences in culture, not in facial structure. These differences include:
- Facial hair
- Selfie angle
- Amount of sun exposure.
We’ve demonstrated that just a handful of yes/no questions about these variables can do nearly as good a job at guessing orientation as supposedly sophisticated facial recognition AI. Further, the current generation of facial recognition remains sensitive to head pose and facial expression. Therefore — at least at this point — it’s hard to credit the notion that this AI is in some way superhuman at “outing” us based on subtle but unalterable details of our facial structure.
This doesn’t negate the privacy concerns the authors and various commentators have raised, but it emphasizes that such concerns relate less to AI per se than to mass surveillance, which is troubling regardless of the technologies used (even when, as in the days of the Stasi in East Germany, these were nothing but paper files and audiotapes). Like computers or the internal combustion engine, AI is a general-purpose technology that can be used to automate a great many tasks, including ones that should not be undertaken in the first place.
We are hopeful about the confluence of new, powerful AI technologies with social science, but not because we believe in reviving the 19th century research program of inferring people’s inner character from their outer appearance. Rather, we believe AI is an essential tool for understanding patterns in human culture and behavior. It can expose stereotypes inherent in everyday language. It can reveal uncomfortable truths, as in Google’s work with the Geena Davis Institute, where our face gender classifier established that men are seen and heard nearly twice as often as women in Hollywood movies (yet female-led films outperform others at the box office!). Making social progress and holding ourselves to account is more difficult without such hard evidence, even when it only confirms our suspicions.
About the authors
Two of us (Margaret Mitchell and Blaise Agüera y Arcas) are research scientists specializing in machine learning and AI at Google; Agüera y Arcas leads a team that includes deep learning applied to face recognition, and powers face grouping in Google Photos. Alex Todorov is a professor in the Psychology Department at Princeton, where he directs the social perception lab. He is the author of Face Value: The Irresistible Influence of First Impressions.
 This wording is based on several large national surveys, which we were able to use to sanity-check our numbers. About 6% of respondents identified as “homosexual, gay or lesbian” and 85% as “heterosexual”. About 4% (of all genders) were exclusively same-sex attracted. Of the men, 10% were either sexually or romantically same-sex attracted, and of the women, 20%. Just under 1% of respondents were trans, and about 2% identified with both or neither of the pronouns “she” and “he”. These numbers are broadly consistent with other surveys, especially when considered as a function of age. The Mechanical Turk population skews somewhat younger than the overall population of the US, and consistent with other studies, our data show that younger people are far more likely to identify non-heteronormatively.
 These are wider for same-sex attracted and lesbian women because they are minority populations, resulting in a larger sampling error. The same holds for older people in our sample.
 For the remainder of the plots we stick to opposite-sex attracted and same-sex attracted, as the counts are higher and the error bars therefore smaller; these categories are also somewhat less culturally freighted, since they rely on questions about attraction rather than identity. As with eyeshadow and makeup, the effects are similar and often even larger when comparing heterosexual-identifying with lesbian- or gay-identifying people.
 Although we didn’t test this explicitly, slightly different rates of laser correction surgery seem a likely cause of the small but growing disparity between opposite-sex attracted and same-sex attracted women who answer “yes” to the vision defect questions as they age.
 This finding may prompt the further question, “Why do more opposite-sex attracted men work outdoors?” This is not addressed by any of our survey questions, but hopefully the other evidence presented here will discourage an essentialist assumption such as “straight men are just more outdoorsy” without the evidence of a controlled study that can support the leap from correlation to cause. Such explanations are a form of logical fallacy sometimes called a just-so story: “an unverifiable narrative explanation for a cultural practice”.
 Of the 253 lesbian-identified women in the sample, 5, or 2%, were over six feet, and 25, or 10%, were over 5’9”. Out of 3,333 heterosexual women (women who answered “yes” to “Are you heterosexual or straight?”), only 16, or 0.5%, were over six feet, and 152, or 5%, were over 5’9”.
 They note that these figures rise to 91% for men and 83% for women if 5 images are considered.
 These results are based on the simplest possible machine learning technique, a linear classifier. The classifier is trained on a randomly chosen 70% of the data, with the remaining 30% of the data held out for testing. Over 500 repetitions of this procedure, the error is 69.53% ± 2.98%. With the same number of repetitions and holdout, basing the decision on height alone gives an error of 51.08% ± 3.27%, and basing it on eyeshadow alone yields 62.96% ± 2.39%.
 A longstanding body of work, e.g. Goffman’s The Presentation of Self in Everyday Life (1959) and Jones and Pittman’s Toward a General Theory of Strategic Self-Presentation (1982), delves more deeply into why we present ourselves the way we do, both for instrumental reasons (status, power, attraction) and because our presentation informs and is informed by how we conceive of our social selves.