The Emperor’s automagical suit

An experiment on bias and underperformance in image recognition AI

Karlos g liberal

Published in

bikolabs

15 min readOct 7, 2020

by Ujué Agudo and Karlos G. Liberal, members of Bikolabs, Biko’s laboratory

In May of this year, Science magazine wondered if we have forged an image about Artificial Intelligence that does not correspond to its actual performance.

Let´s consider the case of object recognition AI. What do image recognition algorithms “see” when they “look at us”? Do they really identify what we do, the context in which we find ourselves, the person we apparently are, as assumed? Is their interpretation of the image neutral and free from bias? How accurate are these pieces of software that are offered in the market as quick and easy-to-apply AI solutions?

We have analysed the labelling and description of images of men and women (some of them carrying objects that have historically been associated with gender stereotypes) using some products in the market and some pre-trained AI models. The results (dire photographs descriptions, different labelling between men and women, objects that are invisible depending on gender…) lead us to add a couple of questions to those initially posed: Is image recognition AI really performing as efficiently as it is being advertised? Or are we confusing with magic what is not even a well-executed trick?

This is a story (and an experiment) about automagical models that do not adequately perform the task for which they were created and whose performance is even biased. The Emperor has no clothes and it’s time for the whole town to point their fingers at him.

The magical breadth of AI

For some time now, despite our faith in the supposed superpowers of AI, we have had to assume that it is neither as objective nor as neutral as we wanted to believe. For the most part it is ineffective, discriminatory, sexist and racist. Something that is not strange given that humans, present throughout the lifecycle of AI, can transfer our subjectivities and biases to the machine in all the phases of the AI process.

The risk of human bias in the different stages of the Machine Learning life cycle, published by Catherina Xu and Tulsee Doshi at https://ai.googleblog.com/2019/12/fairness-indicators-scalable.html

With an increase in the awareness of this subject, the effort to identify these biases and devise ways to alleviate them has also gone up (from research on practices and tools for their mitigation to the suggestion of a new scientific discipline that analyses the behaviour of algorithms).

Although these are necessary strategies to address the problem of AI biases, they all assume their implementation as an inevitable response to a problem, avoiding the previous step of determining whether the problem to be addressed can be considered suitable domain for an AI or not (at a practical and defined level, not only at a theoretical level).

Take the case of object recognition. It has been determined that it is a suitable field for AI and therefore we can find commercial products in the market such as Amazon Rekognition or dedicated models and architectures of deep neural networks such as Img2txt or MobilNet. These AIs identify objects in images or videos and tag or describe them. To what end? Currently the main areas of implementation are the visual search for products in E-commerce and the like, the automated organization of photographs in image repositories, the automatic tagging of these, the moderation of content based on that tagging or accessibility. But in the future, it is expected that these AIs will allow a better experience with Virtual and Augmented Reality, facilitation of facial recognition, images classification in the medicine field or enable autonomous driving, all of these with guarantees.

Image recognition is conceived as a very limited implementation of AI, adjusted to its current capabilities (Weak or Narrow AI) and focusing on solving a specific type of problem. Something very far from the siren songs of Artificial General Intelligence (or AGI), whose aspiration is to match or exceed human intelligence and that from time to time echoes in our ears as has recently happened with the launch of the GPT- 3 .

However, even being aware of AI limitations, as in this case, we tend to try solving complex problems with it based on extreme simplifications. Like when we intend for an AI to recommend us which movie to watch tonight but also which babysitter to hire for our children or which areas of the city to patrol to prevent crime.

For decades we have not been able to answer complex questions such as what is the purpose of punishment in criminal justice, for example, and yet we hope that a technology is capable of predicting the future criminal activity of a person and thus determine the conditions of parole. Or have an AI evaluate the academic performance of some students who have not been able to properly complete the course due to a pandemic.

Compared to these areas of implementation, the object recognition task may seem simple, but are we certain that properly identifying the elements of the world is a defined and delimited task and therefore perfect for AI? Let’s appraise it, because as Kate Crawford and Trevor Paglen of the AI Now Institute point out, “images do not describe themselves”, so “automated image interpretation is an intrinsically social and political project”.

Let´s take the case of a recent experiment, in which when testing various object recognition systems such as Google Cloud Vision, Amazon Rekognition or IBM Watson, it was found that their accuracy was 15% higher when analysing everyday photographs of objects (such as hand soap) taken in the US, than when they were taken in places like Somalia or Burkina Faso. Representing the world by classifying the universe of objects that are part of our reality sounds, to say the least, complex.

But the very situation of AI image recognition is also somewhat complex and convulsive. On the one hand, the market offers easy-to-use products with the AI brand (such as Amazon Rekognition or Google Cloud Vision) and an almost generic purpose, which ranges from identifying objects, determining whether a piece of content is inappropriate, analysing facial features or motion capture.

On the other hand, we have a whole new industry in development, where pure research is mixed with models and architectures based on neural networks (such as Img2txt or MobileNet) trained specifically for dedicated tasks and uses, but which are also being used as pre-trained models of broader implementation from tools like RunwayML.

Depending on the entry point of the person interested in using these AI solutions, it could be thought that the use of AI is as specific and complex as designing and training a convolutional neural network with its own dataset, or quite the opposite: a use as generic and simple as hiring a product from the market or using an online tool. The risk of generic solutions in areas that are not simple to understand is that they generate the feeling of a magic trick.

In fact, as Donald Norman says in his book “The Design of Future Things”, “If the intelligent, automatic devices worked perfectly, we would be happy. If they really were completely reliable, we wouldn’t have to know how they work: automagic would then be just fine (…) However, we get stuck in the in-between world of automatic devices we don’t understand and that don’t work as expected, not doing the task we wish to have done, well, then our lives are not made easier, and certainly not more enjoyable ”.

The world contemplated by an AI is very limited

In image recognition, AI find and identify objects and associate them with the tags and categories on which they have been trained. Most pre-trained AI models use a limited volume of categories. Furthermore, although the data set on which many of these models are based, ImageNet, initially had 21,000 categories, most of these models only use 1,000 of these categories.

Although trying to reduce the analysis of the visual universe to only 1,000 labels can be understood as too simplistic, the truth is that in the sector it is not considered a problem since it is assumed that pre-trained models, as we have commented, could be retrained for the specific task that they are going to perform (for example, distinguish between pizza and burgers from photos), instead of being used to evaluate the realm of objects in the world.

However, if we do not know which and how many categories are contemplated by the model we are going to use (which can be found in the technical documentation of some of them, but not all), when applying AI in categories for which it has not been trained into, its performance will seem very poor. This is what happens when we use neural networks like MobileNet in its generic implementation through RunwayML or ML5js, for example. MobileNet which does not include the category “Person” among its 1,000 categories, if it is used to analyse photos of people, it returns awkward results.

Photo 1, with “abaya” as the main tag. Photo 2 labelled as “chain mail”

Photo 3 labelled “strainer”. Photo 4 as “wig”

If, on the contrary, the model takes into account the category “Person”, what emerges are not bad results, but other problems, such as sexist, racial biases… The original presence of inappropriate labels in the category “Person” (including racist insults and misogynistic terms) was evidenced by Crawford and Paglen in their article “Excavating AI”. As the authors relate, the tags found in ImageNet, which are based on the well-known semantic structure developed by Princeton University, WordNet, use terms such as “pervert, spinster, call girl, streetwalker, stud, wanton, loser or fucker.”

ImageNet tag tree screenshot posted by Kate Crawford and Trevor Paglen in their article Excavating AI

To show this situation, they developed an art project, ImageNet Roulette, which was covered in the media and caused ImageNet to remove up to 600,000 photographs from that category. At the time the project was also eliminated.

From Bikolabs we wanted to recreate this story using the full ImageNet category system with the old Full ImageNet Network model, which does include the original 21,000 categories (including the “Person” category). And what we found were things like this:

Knowing what categories current AI take into account is not always easy, as we have commented. Some commercial software such as Amazon Rekognition, do not report on the number of categories they use or how detailed or generic they are, so it is difficult to limit the expectation about their effectiveness or detect possible biases in the categorization that was carried out during their training.

Blind with training datasets

Discovering not only under what criteria neural networks classify images but also with which datasets (and which labels) they have been trained, is an even more impossible mission due to the opacity of the AI and the companies who owns it. How did they get their training dataset? Did they use their own graphic material, or did they scrape material from platforms like Flickr and the like? This second option is the case of the IBM dataset , created to train neural networks with almost a million images, in which the owners of the photographs were not informed of their use by the company.

But in addition to the source from which the raw images are obtained, it is important to know the images that have been used to exemplify each category, so that AI could properly identify and tag objects in the future. Whether such images are appropriate for the label they represent can be crucial. A counterexample of this are the images corresponding to the label “girl” of the aforementioned ImageNet dataset, where we find that the photographs are at least not very appropriate for the label in question…

… And not very different from those represented by the label “woman”.

It may be that the lack of coherence between the labels and the images of these categories is the cause that, as we pointed out from Bikolabs a few months ago, Google Images labels the photographs of women in a generalized way as “girl” regardless of their age and appearance, yet richer in labels with photographs of men.

Tagging of photos of women in Google Images as “girl”

Tagging of photos of men in Google Images as “gentleman”, “spokesperson” or “surfer hair”

This poor performance could also be due to a mismatch between the focus of the training images and that of the analysed photographs. An example: if we look at the category “broom” in ImageNet, we find that most training photos for this tag lack context of use. The focus of the photographs is on the object alone.

When using a model that considers the broom category, as is the case with MobileNet, to analyse photographs where the object is in its context of use, the AI finds certain difficulties in distinguishing it from “similar” objects such as harps, swabs or crutches.

Photo 1 labelled “harp”. Photo 2 as “crutch”

Photo 3 labelled “picket fence”. Photo 4 as “swab”

And since we have identified tagging as key work, do we know who tagged the photos of the pre-trained models? Was the task performed by humans or by another AI? In the case of Amazon, for example, the company has both an “army” of humans who carry out work on this line (through the Amazon Mechanical Turk service, which has been used for example to build the Coco dataset), and with auto-tagging products (such as the Amazon SageMaker Ground Truth service). But we cannot know which of these options or others has the company chosen to label its dataset.

Performance in question

Poor object recognition results are not always derived from a poor choice of training images. Sometimes we simply put a non-realistic expectation on their performance.

Let’s take the case of Img2txt, which returns a textual description of the images instead of individual labels, having been trained to do so with 20,000 images described with a short sentence. If we access the training photographs and use them to evaluate the model, we find that the discrepancies between the training descriptions and the offered results are broad.

The richness of the photographs descriptions in the training dataset (“a blond tourist and a local woman on a train with red seats”, in the text file) is lost in the result returned by the model (“a couple of women standing next to each other”).

We understand that results will improve once the technology is mature enough and image processing models cross over with a good natural language processing model. However, in the meantime, we need to be aware of the level of precision that these AI can currently return before blindly implement them.

Biased results

In addition to the aforementioned problems, in the course of this object recognition AI performance analysis, we have found that, at times, the results appeared to show gender bias. The alarm went off while playing around with the shopping-oriented mobile visual search app CamFind, we discovered that when removing an object from a photo, the description changed significantly.

Description of the image when processed through CamFind, literally “blue and black woman’s dress” in the original photo with the broom, and “men’s polo shirt with blue and black stripes” if we delete the broom

For this reason, we embarked upon analysing photographs of people carrying historically stereotyped objects, using Amazon Rekognition and ImageNet, contrasting the results with the “doubles” of the people in these photographs with their apparent gender changed. Our goal was for the comparison to be made between photographs as similar as possible (all edited images are shown to the right of the original).

What we found in Amazon Rekognition, just like in CamFind, was that the photo was labelled differently based on apparent gender. But in this case, either the object was visible in a photograph but not in its matching pair, or it was confused with another object, or it returned labels that seemed to be related to the apparent gender of the person photographed.

While in the original photo the “power drill” object is not detected, it is in the photo edited with FaceApp

In neither of the two photographs the power drill is detected, but the edited one is associated with the label “cleaning”

The “hammer” object is only “detected” in the edited photo on the right

In the original photo, the helmet is detected and associated with related professions, while in the edited photo it is not detected

In the original photo, the object is mistaken for a blow dryer, while in the edited one, it is not

Activity is detected in the original photo, “cleaning”, but not in the modified one. In addition, the labels on the possible profession of the photographed person change between both images: “Nurse” vs. “Worker” and “Student”

Objects and activity are not detected in the original photo, but in the edited one: “cleaning”

On the other hand, analysing the photographs with Img2txt, we find that the descriptions also change according to the apparent gender, but, on this occasion, the results that it returns are awkward in both cases. Some examples:

The AI ”sees” a wii game controller or a banana instead of an power drill or a spray bottle

A pair of scissors or a cell phone instead of a gavel or a hammer

A snowboard or suitcase instead of a tool box

A tennis racket or a frisbee instead of a broom

A frisbee or a teddy bear instead of cleaning supplies

In addition, in Amazon Rekognition we found that when the gender of the person in the photograph was identified (binary), men were assigned only one label (Man), while women were assigned more (Woman, Female, Girl). As we said before, knowing the list of categories and their hierarchy would be very useful in these cases.

In photographs where the person appears to be a woman, gender labels are more abundant

In view of these results, initiatives such as Google AI that stopped showing gender in image tagging do not seem to completely solve the bias problem.

Conclusions

It is estimated that the “Global Visual Search Market” will exceed $ 14.7 billion by 2023. Meanwhile, and perhaps due to this same expectation, the market continues without wanting to get rid of the magic halo that surrounds AI, bedecking it with the emperor’s new clothes.

At Bikolabs we think that the magical and idyllic image about AI that they wanted to sell us does not really benefit the sector at all. It may be difficult that we are going to suffer another AI winter, but it is possible that, as Jeffrey P. Bigham points out, what does come is an autumn in which, although we can reap the harvest of our well-spent efforts in applying AI in defined problems, the empty hype that still supports the market today deflates.

It is time to point out the nudity of the emperor and to highlight the limitations and inefficiencies of that narrow AI that was focused too broadly.