Is that a boy or a girl?

Exploring a neural network’s construction of gender

I’ve always been curious about what makes someone “look” male or female, probably because I’m female but have never looked conventionally feminine. I was a tomboy as a child and remained one as an adult, and I’m also tall, with unruly hair that’s easiest to keep short. So strangers often assume that I’m male: in restaurants and on planes, I’m often addressed as “sir”.

People who know me well are usually surprised that anyone could think I was male. But I don’t find it that surprising — we don’t tend to really look closely at strangers, and just make broad assumptions about them based on their outlines. Children are often an exception — they will scrutinize me for a while and then ask their embarrassed parents, “is that a boy or a girl?”

Knowing that there has been huge progress in recent years in using machine learning to classify images, I got curious: could I train a model to classify photos of people according to their gender? What “rules” would it learn, for making the decision? And how would it classify me?

What did I do?

This started out as a fun personal project, and it led me in a lot of interesting directions, including classifying celebrity images, and my first-ever purchase of a long blonde wig — but actually ended up teaching me something very serious.

Before I get to all of that, I just need to give you some quick background on how I did this. This was my first machine learning project, so, taking the simplest possible approach, I followed this tutorial to retrain an existing neural network model to classify new types of images. To do that, you need to have clear categories, plus a large set of example photos from each category that are labelled accordingly, so that the model can learn from those examples.

Immediately, this raises a number of questions: would the categories refer to assigned sex at birth, gender identity (male/female), or gender expression (masculine/feminine)? What about people who are gender non-conforming, or transgender, or non-binary? Would the label for each photo be based on the person’s own assessment, or someone else’s? Would the photos be of the whole body, taking factors like height into account, or more focused on the face?

In practice, there are not very many available datasets for this kind of project, so in order to start somewhere, I had to settle for having these questions answered for me. I found the Adience dataset, which consists of faces cropped from Flickr photos, with binary labels (“male” / “female”) assigned by researchers, based on looking at the photos. So the model that I trained was trying to predict the judgement criteria of those researchers. See the Appendix for more details on the dataset and how I adapted it for my purposes.

After going through the retraining process described in the tutorial, I ended up with a model that was 86.2% accurate at classifying “male” vs “female” photos from the dataset — based on a test set of examples that the model had not previously seen.

This level of accuracy means that the classification will be incorrect for about 1-in-7 faces. I decided that this was good enough for me to move ahead with some initial exploration (if it can get as high as that, it must have learned something interesting) but of course everything else in this article should be considered with this accuracy level in mind.

How did it classify me?

I took the first photo of me that I found on my laptop, and cropped and resized it so it was similar to the training images (it’s the leftmost image below). The result: the model classified me as male, with a probability of 98.9% — or 1.1% chance of being female.

So what was it about my picture that made me “look” male, and which characteristics would make the model more likely to predict that I’m female? Neural networks are somewhat notorious for their lack of transparency, so it can be hard to get an understanding of why they classify something in a particular way.

To start with, I dug out some old photos of myself with different looks — in particular, with different hair lengths, since that was my first guess at what would make me look more “feminine”:

Four different looks, and how the model classified them (% probability of being female): 1.1%, 12.3%, 55.2%, 87.6%

The outcome? The model’s estimated probability of me being female went up as my hair got longer, with the photo on the right coming out at 87.6% probability of being female.

A sort-of controlled experiment

Those photos differ in many other ways, not just the hair length: my clothing, my age, my glasses, my position, the background, the lighting conditions. I was interested in whether I could isolate characteristics from each other, and I wondered: could I design something like a controlled experiment, to test a bunch of different characteristics at once?

I decided to try 5 variables:

  • Smiling or not
  • Wearing glasses or not
  • Wearing red lipstick or not
  • Wearing a long blonde wig or not (ideally, since I have dark hair, I should have used a long dark-haired wig, to try to isolate the “long hair” and “blonde hair” factors from each other… but I admit, I just couldn’t resist the slight ridiculousness of the blonde wig when I saw it in the shop)
  • Wearing either a plain black t-shirt or a plain black tank top (“vest” in the UK) with thin shoulder straps, showing more of the skin around my neck and shoulders

… and made 2⁵ = 32 photos of myself with every combination of those variables, cropped in the same way as the photos in the original dataset. I tried to keep everything else as constant as I could between photos.

Here are those 32 photos laid out from left to right in order of the predicted probability that I was female; the top-to-bottom position reflects the actual score (highest to lowest).

They fall into three main groups:

  • In the top right, with at least 98% probability of being female, are all of the 16 photos where I was wearing the long blonde wig. Basically, as soon as I put that wig on, any other factors were irrelevant.
  • In the bottom left, with the lowest probability of being female, are the 8 photos where I wore a plain black t-shirt and no wig (just my usual short dark hair). For these, the model predicted I was between 2% and 25% likely to be female. In the two highest scoring images in this group, I am both smiling and wearing lipstick.
  • The remaining 8 photos in the middle are those in which I did not wear the wig, but wore the tank top. For these 8 photos, the model predicted I was between 82% and 96% likely to be female (82–89% if wearing glasses and 91–96% if not).

To summarize: as above, hair length made a huge difference. Wearing the tank top did not have as big an effect as the wig, but was enough by itself to tip the balance towards female, so it seems that neckline is also very important to the model in classifying female vs male (or maybe just amount of skin visible?). Smiling, wearing glasses, and wearing lipstick only seemed to make a small difference, and only in specific combinations, not by themselves.

Analyzing misclassifications

As a more conventional way of trying to understand what the model had learned, I also spent a while staring at the misclassified images from the original Adience dataset (the 1-in-7 test images that were incorrectly categorized) to try to identify things they had in common.

I don’t feel comfortable with showing the misclassified faces here, given that the original dataset came from Flickr, and the people involved probably have no idea that they are part of a set of training data for gender classification. So I decided to instead use images of public figures as examples, choosing celebrities who somewhat resemble the people whose original photos were misclassified.

Note that these examples should not be interpreted as evidence that these celebrities “look like” someone of the other gender — these particular photos are just some of the 1-in-7 cases that the model gets wrong, and the purpose of studying them is to try to understand the criteria that the model is using to make classifications.

Some examples of celebrity photos that the model misclassified as female:

Shaun White (99.9% F), Harry Styles (96.1% F), Jon Bon Jovi (94.6% F), Michael Cera (63.9% F) — image credits at end

Some examples of celebrity photos that the model misclassified as male:

Judi Dench (96.3% M), Madonna (76.4% M), Annie Lennox (73.4% M), Rachel Maddow (67.6% M) — image credits at end

As with the photos of me, the most obvious factor was hair length: the model often misclassified men with long hair and women with short hair (or even hair pulled back into a ponytail or under a hat). But, again, hair length was not always a determining factor — some men with long hair were correctly classified as males, and some women with short or tied-back hair were correctly classified as females. So the model did not simply learn to distinguish long hair from short hair.

As before, clothing also seemed to play a role — for example, Madonna is wearing a shirt and tie above, while none of the misclassified males are. It’s possible that some facial characteristics were a factor too: some of the misclassified Adience images were younger men with narrower and less square faces, and some of the misclassified females were older, with wider or more square faces. Many older women also have short hair, so it is hard to separate these factors from each other.

Perhaps hair and clothing are the biggest factors because they tend to dominate the number of pixels in the image, whereas other possible factors (like make-up) involve smaller details of the face. So maybe the model is approximating what humans do when looking at someone from a distance — judging gender based on the most visible signifiers like hair and clothing.

Testing the boundaries

Next, I tried a few images that should in theory be confusing, to see how the model handled them. First, some famous examples of actors playing the other gender (like Dustin Hoffman in “Tootsie”) — but the model was easily won over by these disguises. It was even convinced by Freddie Mercury in Queen’s “I Want To Break Free” video, despite his obvious moustache, assessing two different photos of him as 87.0% and 99.3% likely to be female.

This suggests that the model’s representation of gender is still very simple, if it can be so easily confused by a disguise, and would probably be improved if I had access to a much larger set of training examples — all I had were a few hundred examples of each category, which is not really enough when there is so much variation within a category. In the case of Freddie Mercury, the training set probably did not include enough examples of males with moustaches for the model to associate them strongly with the “male” category. But more generally, it’s impossible for such a small dataset to be fully representative of the diverse set of skin colors and styles of dress that exist in the world. (For example, most of the examples in this limited dataset are of white faces).

Finally, I also tested the model on some photos of people who are known for not being confined by gender norms around appearance — like Tilda Swinton and Eddie Izzard. Depending on how they were dressed and photographed, the model would often rate them with appropriately ambiguous scores closer to 50–50.

Tilda Swinton (53.9% M) and Eddie Izzard (57.5% F) — image credits at end

What did I learn?

Technically, I really wanted some better tools for understanding and debugging a neural network model like this. I knew about the “neural networks are a black box” problem in theory, but I’ve gained a much deeper appreciation for it after spending time staring at sets of images and trying to figure out what they have in common. Next, I would love to try generative techniques to get a deeper understanding of what is going on inside a neural network — for example, these helped uncover that one model’s concept of a “dumbbell” depended on a hand being attached to the dumbbell, because that was the case in all of the examples it was trained on.

Did I learn anything about gender? Obviously, I don’t need a neural network to tell me that having longer hair or wearing different clothing would make me look more “feminine” — but it’s fascinating to me that it was apparently able to pick up these “rules” from only a few hundred examples, given that it had no preconceived notion of gender and learned the rules entirely from the examples it saw. So if it was shown more examples of women with short hair and men with long hair, it would give less weight to hair length as a gender signal. And maybe if, in our everyday lives, we saw more women with short hair, and more men with long hair, we would also be less likely to assume someone’s gender based on their hair length.

But what about people who are gender non-conforming, non-binary, or transgender? I read some academic papers about gender classification when doing background reading for this project, and didn’t see any of them mention gender non-conformity or gender transition — and yet these things immediately call into question the whole idea of “accuracy” in gender classification. For example, should non-binary people be in a separate category? Or should models try to predict where each person would put themselves on a scale (requiring the collection of new datasets)? How do transgender people affect what the models are learning about gender?

Finally, this project started out as a fun personal curiosity, but the most important thing I have learned from it is very serious: looking at the misclassifications helped me reflect on the risk that a gender classification model could be misinterpreted and misused. For example, maybe after reading this article, you wonder if it would be fun to create a new app that can automatically rate how “masculine” or “feminine” a person looks in a photo. But then imagine it being used by an insecure teenager, or by a high school bully who applies it to all of the photos in the class yearbook. In most of the world, gender norms are rigid, and come with very strong pressure to conform — and people who do not conform (or are even perceived not to) face harassment and threats to their personal safety.

As machine learning systems become easier to use, I hope that people from a broader set of backgrounds will begin to explore their possible uses (and misuses), raising our collective awareness of the ethical implications of this technology.


Thank you to the researchers who prepared the Adience dataset that I used for training data:

Eran Eidinger, Roee Enbar, and Tal Hassner, Age and Gender Estimation of Unfiltered Faces, Transactions on Information Forensics and Security (IEEE-TIFS), special issue on Facial Biometrics in the Wild, Volume 9, Issue 12, pages 2170–2179, Dec. 2014 (PDF)

See the Appendix for more details on the dataset and how I adapted it for this project.

For feedback on drafts of this article, thank you to: Anne Aula, Rachel De Wachter, Marnie Florin, Liz Hickok, Gregory Kossinets, Yelena Nakhimovsky, and Jens Riegelsberger.

Image credits (all images via Wikimedia Commons, cropped from originals):