Dall-E 2: Recent Research Shows the Flaw in AI-Generated Art

Ayush Jain
4 min readAug 11, 2022

--

DALL-E 2 is all the buzz in the AI industry. People on the waiting list are trying to get their hands on the product. What does it mean for the creative industry moving forward?

AI application in the creative industry has everybody talking. AI can now write scripts, make movies, produce music, and generate images — these are only a few applications. The rise of AI in the creative discipline has prompted several news articles and think-pieces to ask: Will it replace creative jobs in the future?

The question of AI replacing our jobs has been a discussion for ages. But, the high acclamation of DALL-E 2 has brought the conversation back to the front. DALL-E 2 is “a new AI system that can create realistic images and art from a description in natural language.” The AI model understands the input in terms of how we generally describe images in our language. So simply a prompt that says ‘a robot-cat wearing cool glasses, gazing at a supernova’ or ‘an astronaut riding a horse’ is enough for the AI to generate an image accordingly.

Photo from OpenAI

However, a recent paper provides an interesting discussion of how the DALL-E 2 model works. The two scientists decided to test the image-generation model based on two prompts: physical relations (e.g. ‘X on Y’, ‘X behind Y’, ‘X in front of Y’) and agent relations (e.g. ‘X helping Y’, ‘X touching Y’, ‘X pushing Y’). To give a simple example:

Instead of a prompt like ‘a donkey and an octopus are playing a game [or] the donkey is holding a rope on one end, the octopus is holding onto the other’…we use ‘a box on a knife’.

For their research, they used a model which involved 169 participants. The participants were shown the AI-generated results for ten descriptions. The researchers asked the participants to judge if the text describing the image matches the AI-generated pictures based on their intuition.

The results showed that the participants generally exhibited little agreement between the text and image for physical and agent relations. The participants felt that the AI-generated images didn’t match the description they gave. More importantly, the researchers observed variation even among cases with a relatively high-level agreement. For example, they write:

The prompt ‘child touching a bowl’ generated 87% [80.1, 93] agreement on average, while ’a monkey touching an iguana’ generated 11% [5.3, 19.7] agreement on average.

Image from https://arxiv.org/pdf/2208.00005.pdf

The reason for this, they note, is that there was sufficient training data in the algorithm to create images of a ‘child touching the ball.’ In contrast, you wouldn’t find the same with ‘a monkey touching an iguana’. The researchers found the same with two other prompts, ’a spoon in a cup’ and ’a cup on a spoon.’ The system was able to bring about the right results for the former sentence than the latter. Why though?

According to them, it was not because the AI better understood the syntax of the sentence ‘a spoon in a cup’. In the former case, the correct results are simply an “effect of training images involving spoon and cup.”

The words ‘child’ and ‘ball’ is more likely to appear in the same context in a particular setting in the training data than ‘monkey’ and ‘iguana.’ The same thing goes with ‘spoon’ and ‘cup.’ The two words are only likely to appear in one particular context: ‘a spoon in a cup’ and not vice-versa.

The AI bases much of its image generation on real-life instances. You wouldn’t find real-life cases of ‘a cup on a spoon.’ The AI system generates predictable results based on the training data and not because it “understands” the relationality of words in a sentence. On the other hand, the language we use structures the world around us. It gives a sense and meaning to the world. How we use language does not depend entirely on the world’s factual reality. Even when we can’t see a cup on a spoon, we can conceive it.

I have tried to explain a similar thing, albeit in a slightly different context, in one of my previous posts. I argued in that post that the ability of the machine learning models to perform scientific research is highly contingent on the training data they use. The ‘scientificity’ of these researches is based upon how accurately the machine can predict a particular outcome. The ML model is starkly different from the human attitude towards scientific research, which brings abstract relations between objects to ‘understand’ how something works. That is, in formulating a hypothesis, we create a particular relationality between things. Say, for example, a sentence as simple as ‘the earth revolves around the sun.’

Language is central to how humans perceive and understand the world — be it for scientific acumen or creative capabilities. There are intuitive and counter-intuitive tendencies in us, and language subconsciously is the origin of that. While AI can glue and stitch together words, it is not able to comprehensively understand the relation between them.

Hence, there is no meaning to whether artificial intelligence will replace the creative industry. Yes, it will automate a lot of things. In many creative fields, AI will reduce the hours to get things done. However, it is unlikely that AI will ‘replace’ humans, at least in the foreseeable future.

--

--