Has artificial intelligence really become racist and sexist? Not so fast.

Last week, VICE’s Motherboard published an article titled “It’s too late — We’ve already taught AI to be racist and sexist”, featuring my paper (pdf) on stereotypes and biases in the Flickr30K dataset. The article starts off on a rather dramatic note:

They say that kids aren’t born sexist or racist — hate is taught. Artificial intelligence is the same way, and humans are fabulous teachers.

Sounds like artificial intelligence research has messed up pretty badly, right? Well, no, not really. But I do think we need to have a serious discussion about hidden biases in the data we use to train artificially intelligent systems. (And others agree, there’s already a movement for fairness, accountability and transparency in machine learning.)

What my research was about

I work in the field of automatic image description. In other words: getting computers to tell you what an image is about. This is a very difficult task, because the computer needs to learn what are the relevant parts of an image, how those parts relate to each other, and how to describe that relation in one simple sentence. To get machines to learn this, you need a lot of data.

Recent years have seen the development of several very large datasets for computer vision research. Imagenet, for example, has over 14,000,000 images that span over 20,000 object categories. The categorization of these images has been done through Crowdsourcing: cutting the task up into bitesize pieces, and having ‘Workers’ on online marketplaces like Mechanical Turk or Crowdflower carry out these ‘Human Intelligence Tasks’ in exchange for a small amount of money. (There’s been a lot of discussion of the ethics of crowdsourcing, but a full discussion goes beyond the scope of this article.)

The largest collections of image descriptions are the Flickr30K and the Microsoft COCO-datasets. (The latter has a cool online interface where you can explore the data.) These datasets consist of tens of thousands of images, for which researchers have crowdsourced five descriptions per image. Each of these five descriptions has been provided by a different Worker, and all workers are based in the USA. Here’s an example from the MS COCO-dataset (where workers have also annotated the most important entities with hand-drawn polygon shapes):

This picture has the following descriptions:

  • Several athletes are sitting on the bench inside a gym.
  • Several male athletes waiting for their turn in the game.
  • Several men sitting on a blue bench in a gymnasium
  • Teammates sit on the bench watching and waiting.
  • Men in blue and black uniforms sit on a bench at a game.

There is quite a bit of variation in these descriptions: the people in the picture are described as athletes, teammates, and men, sometimes also specifying their gender or clothing. Some workers mention the location (in the gym), whereas others pay more attention to the event (the game). When you have to describe an image in one sentence, you have to choose what to recount, and what you leave out of your description. I have looked at the choices people make in their descriptions, and the assumptions that they make about the depicted situations. (For example that the athletes are waiting for their turn.)

Assumptions, expectations, and stereotypes

Camiel Beukeboom wrote an overview article in which he shows how stereotypes and expectations modulate our linguistic behavior. People tend to mark what they feel is unexpected or atypical for a particular social group. For example with an adjective: female nurse; male surgeon; African-American business man. Other times people might use negations to indicate that something deviates from the (perceived) norm, e.g. “John is eating pie without utensils” (the barbarian!) or “Tina is riding her bike without using her hands” (the daredevil!). In sum: the way you talk reflects the way you perceive the world.

(Note that a single use of an adjective like female or African-American doesn’t mean someone is biased. We can only say that there is a bias when there is a systematic difference in how frequently people from comparable groups are marked.)

I’ve applied the theory outlined by Beukeboom to the Flickr30K data that is used to train automatic image description systems. If we can show that these descriptions are biased, then computers might learn to produce biased descriptions, too. (It remains to be shown whether machines will actually learn to be biased, though.) For that purpose, I manually inspected a large part of the data, and found two things:

  1. There are indeed differences in how people are marked. Babies with a dark skin-color are more frequently called ‘black’ or ‘African-American’ than caucasian babies are called ‘white’. This has to do with the fact that White is the default in the USA. I don’t think it is OK for a computer system to take one group as the standard, so it seems to me that we need to take action here. (Unfortunately, the authors of the Flickr30K dataset didn’t collect any demographic data about the participants. As such, it is hard to say what drives the biases I’ve observed. We do know that about 70% of the crowdworkers on Mechanical Turk in general are White, but we can only guess about the Workers who provided the descriptions. I think it’s safe to say that it’s probably mostly White workers.)
  2. People will always try to interpret an image, even if they are asked not to speculate about the contents of an image (as the Workers were asked to do in the guidelines). I do not think this is a conscious process; people just want to contextualize their experiences. If no context is given, they will add one. This can range from relatively innocent cases where women sitting with little children are referred to as ‘the mother’, to situations where Asian-looking people are referred to as ‘Chinese’ or ‘Japanese’. (I haven’t seen any cases where people were marked as being American.) People with a dark skin-color are also frequently described as ‘African-American’ — even if it’s unclear if the person is even from the USA. In general, all pictures are interpreted in an American context. Things that do not look American may be marked as Other.

From these two observations, we can draw the conclusion that the Flickr30K corpus cannot readily be used in to train automatic image description systems, if those systems are to be used in a production environment. There is simply no guarantee that image descriptions will be unbiased. At the moment, this isn’t really an issue because image description systems are simply not good enough to learn all the fine distinctions that are made in the description data. Current output is still relatively generic. But as technology improves, we should be aware that there is bias in the data, and we should work to prevent systems from being biased.

The road to a solution: acknowledging perspectives

The main issue is that computers are bad at separating fact from interpretation. The language that people use to describe images is colored by their perception of those images and by itself, that isn’t a bad thing. But we should be aware that these biases exist.

Once we’ve acknowledged the fact that people describe the world from their own point of view, it is clear that perspective is a variable that we need to control. This starts with the way we gather our data. I believe that crowdsourcing should be seen as a psychological experiment rather than rote data collection, where instead of Workers, we should talk about Participants. And, as in all experiments, we should control for variables like age, gender, and ethnicity.

So what kind of experiments should be done? I think we should move away from monocultural data, and carry out image description experiments in more parts of the world. For English data, you could think about setting up crowdsourcing experiments for participants from India, Hong Kong, the United Kingdom, the Philippines, and Australia. By gathering data in different socio-cultural environments, you can see which parts of the descriptions stay the same (the facts), and which parts of the descriptions vary (the contextualization of the facts). We need to treat these as separate sources of information.

(I am also really happy that Elliott et al. 2016 collected German image descriptions for the Flickr30K images. This is a good first step to multilingual and multicultural data.)

There are still huge challenges in the field of automatic image description, but as long as we acknowledge that there is bias in the data, and deal with that bias in an appropriate manner, then we don’t need to worry about machines becoming racist or sexist.

Emiel van Miltenburg is a PhD student at the Vrije Universiteit Amsterdam, under the supervision of Piek Vossen.