AI visual word puzzles are a hit! Monroe turns 180° and turns into Einstein in seconds, NVIDIA senior AI scientist: the coolest diffusion model in recent times

Piyush C. Lamsoge
8 min readDec 4, 2023

--

Marilyn Monroe painted by AI turned 180° and turned into Einstein? !

This is a diffusion model optical illusion painting that has exploded on social media recently. Just give the AI ​​two different prompt words and it will draw it for you!

Even if it is a completely different object, for example, a man, after color inversion processing , is magically transformed into a woman:

Even words can be flipped to create new effects, happy and holiday in just one rotation:

It turns out that this is a new study on “visual word puzzles” from the University of Michigan. As soon as the paper was published, it exploded on Hacker News, with the popularity soaring to nearly 800.

Jim Fan, senior AI scientist at NVIDIA, praised:

This is the coolest diffusion model I’ve seen in a while!

Some netizens lamented:

This reminds me of my time working on fractal compression. I’ve always thought of it as pure art.

You must know that creating a painting that presents a new theme after rotating, inverting or deforming requires the artist to have a certain understanding of color, shape, and space.

Nowadays, even AI can draw such an effect. How is it achieved? Is the actual effect so good?

We tried it out and explored the principles behind it.

You can try it directly with Colab

We used this model to draw a set of Lowpoly style paintings, making it look like a mountain from the front and the city skyline from the outside.

At the same time, we asked ChatGPT (DALL·E-3) to try to draw it, and the result seemed to have no advantage except for higher definition.

The effect displayed by the author himself is richer and more exciting.

A snowy mountain peak turns 90 degrees and turns into a horse; a dining table changes its angle and turns into a waterfall…

The most exciting thing is the picture below — viewed from four angles, up, down, left and right, the content in each direction is different.

(Here is a test for readers. Can you tell what these four animals are?)

Taking the rabbit as the initial state, every time you rotate 90 degrees counterclockwise, you will see a bird, a giraffe, and a teddy bear in order.

Although the two pictures below do not have “new content” in each of the four directions, they still show three different directions.

In addition to rotating, it can also cut images into puzzle pieces and then reassemble them into new content, even down to the pixel level.

The styles are also ever-changing, including watercolor, oil painting, ink, line drawing… everything is available.

So where can I play with this model?

In order to allow more netizens to experience this new toy, the author has prepared a Colab note.

However, the free version of Colab’s T4 is not very efficient, and the V100 occasionally exceeds the memory limit, requiring the A100 to run stably.

Even the author himself said that if anyone finds that the free version can be used, please tell him immediately.

Closer to home, after the first line of code is run, we will fill in the Hugging Face token and give the acquisition address.

You also need to agree to a user agreement on the DeepFloyd project page before you can continue with the next steps.

After the preparation work is completed, run the codes in these three parts in order to complete the environment deployment.

It should be noted that the author has not yet designed a graphical interface for the model. The selection of effects and the modification of prompt words require us to manually adjust the code.

The author has put three effects in the note. Uncomment which one you want to use (remove the hash sign in front of that line), and delete or comment out the ones you don’t use (add the hash sign).

The three effects listed here are not all. If you want to use other effects, you can manually replace the code. The specific supported effects are as follows:

After modification, run this line of code, and then the prompt word will be the same:

After modification and operation, you can enter the generation process, where you can also modify the number of reasoning steps and guidance intensity.

It should be noted that here you must first run the image_64 function to generate a small image, and then use the subsequent image to turn it into a large image, otherwise an error will be reported.

To summarize, one feeling after our experience is that this model has relatively high requirements for prompt words.

The author is also aware of this and gives some prompt word tips:

Machine translation, for reference only

So, how did the research team achieve these effects?

“Blending” multi-view image noise

First, let’s take a look at the author’s key principles for generating optical illusion images.

In order to allow the image to present different picture effects according to different prompt words at different viewing angles, the author deliberately used the “noise averaging” method to further blend the images from the two viewing angles together.

Simply put, the core of the diffusion model (DDPM) is to “break and reorganize” the image through training the model, and generate a new image based on the “noise map”:

Therefore, if you want the image to be able to generate different images based on different prompt words before and after transformation, you need to change the denoising process of the diffusion model.

To put it simply, the original image and the transformed image are “broken” using the diffusion model at the same time to make a “noise map”, and in the process the processed results are averaged to calculate a new “noise map” Noise map”.

Subsequently, the image generated based on this new “noise map” can be transformed to present the desired visual effect.

Of course, the image processing process of this transformation must be an orthogonal transformation , which is the operations such as rotation, deformation, fragmentation and reorganization, or color inversion that we see in the display effect.

Specific to the choice of diffusion model, there are also requirements.

Specifically, this paper uses DeepFloyd IF to achieve optical illusion image generation.

DeepFloyd IF is a pixel-based diffusion model that, compared to other diffusion models, operates directly on pixel space (rather than latent space or other intermediate representations).

This also allows it to better process local information of the image, which is especially helpful when generating low-resolution images.

In this way, the image can end up showing an optical illusion effect.

In order to evaluate the effectiveness of this method, the authors themselves compiled a dataset of 50 image transformation pairs based on GPT-3.5.

Specifically, they asked GPT-3.5 to randomly generate an image style (such as oil painting style, street art style), then randomly generated two sets of prompt words (an old man, a snow mountain), and gave them to the model to generate transformation paintings.

This is what some random transformations generate:

Subsequently, they also used CIFAR-10 to test image generation between different models:

Then I used CLIP to evaluate it, and the results showed that the effect after the transformation was as good as the quality before the transformation:

The authors also tested how many image blocks this AI could withstand the “shattering and reorganization” of image blocks.

It turns out that from 8×8 to 64×64, the effect of the shattered and reorganized images looks good:

Regarding this series of image transformations, some netizens lamented that they were “very impressed”, especially the one where a man transforms into a woman:

I’ve watched it about 10 times.

Some netizens already want to make it into a work of art and hang it on the wall, or use an electronic ink screen:

However, some professional photographers believe that the images generated by AI are still not good at this stage:

If you look closely, you will find that the details cannot withstand scrutiny. A keen eye can always spot the bad, but the public doesn’t care.

So, what do you think of this series of optical illusion images generated by AI? Where else can it be used?

--

--

Piyush C. Lamsoge

I'm highly motivated and dedicated student of Machine Learning, Natural Language Processing, and Deep Learning.