Audio-visual Expression Using Image Generation AI — Development Case Study at MUTEK.JP

Ryosuke Nakajima
Qosmo Lab
Published in
7 min readJan 11, 2023

I was in charge of visuals for Nao Tokui — Emergent Rhythm (AI Generative Live Set) at MUTEK.JP on Thursday, 12/8.

Emergent Rhythm” is an improvisational performance built entirely from thre sounds generated in real time by AI. It’s a generative live set that uses Spectrogram GAN and Neutone developed by Qosmo for sound generation, taming and riding them throughout the performance. For a more detailed description of the sound aspect, please see Nao’s article to be published in the future. This article will focus mainly on the technical aspects of the visual side of the performance.

➝日本語版はこちら

Below are a few shots from the performance.

In the visuals, all images displayed on the screen were generated using AI. The number of images generated was about 1 million (including many images generated by frame interpolation, which will be described later). The main model used was Stable Diffusion.

Stable Diffusion is, in a nutshell, a text-to-image AI model. The concept of generating images from text itself is not particularly new, and it has been widely researched as a Text-To-Image task. Starting with the naive approach of injecting a text-embedding vector as a condition for Conditional GAN, there has been plenty of research, including StackGAN (2017), which successfully generated high-resolution images (256x256 at the time) by using a multi-step configuration of GAN, and AttnGAN (2017), which introduced the now-common Attentional mechanism. Also, since the advent of CLIP by Open AI, the use of CLIP to guide latent vectors in StyleGAN and VQ-GAN has become more prevalent, producing more interesting outputs (Qosmo used a similar technique to create masks last year). Coupled with the rise of Diffusion models, high-quality models such as DALL-E, Midjourney, and Imagen have been released one after another.

reference: https://stability.ai/blog/stable-diffusion-public-release

Since its release last August, Stable Diffusion has attracted a great deal of attention. While DALL-E and Midjourney were closed models, Stable Diffusion is open source and open to the public. Since then, many researchers and artists have developed useful UI tools and extensions, and invited more and more people into the world of image generation.

One of our goals for this performance was to utilize this Stable Diffusion to build visuals. Outputs utilizing Stable Diffusion include illustration generation and animation generation using Deforum, etc. This time, while using these methods, we aimed to create an expression that is more unique to AI. However, at the same time, we are careful not to generate images that mimic the style of existing artists/works. (Stable Diffusion has begun to make progress on the issue of artists’ rights to AI training datasets, with the introduction of opt-out support in the future version, but the issue is still deep-rooted)

We focused on the following two features unique to AI (especially the generative model, Stable Diffusion).

  • Able to generate a large number of images that differ slightly from one another to produce variations.
  • Depending on the input text, abstract concepts can be represented as images.

Many experiments have already been made to generate a large number of patterns using procedural/generative methods, but AI can be used to generate a similarly large number of variations. While procedural/generative variation generation interacts with the system via searchable parameters, algorithms, etc., in the case of AI, especially Stable Diffusion, the text does the work. It’s possible to generate the targeted image by specifying more detailed and concrete text, and the image with more breadth by specifying more abstract text. Finding the right balance between concrete and abstract text is the interesting point of models that generate images from text, such as Stable Diffusion, and the emergence of prompt engineering, a profession that specializes in this field, is a hot topic.

(An example of using ChatGPT to generate prompts, the case that was mentioned above, where abstract concepts are represented as images)

One of the themes of this visual performance was expressing the traversal between micro and macro scales. Starting from macro perspectives such as the universe and the earth, to micro perspectives such as human skin and leaf veins, the visual travels across scales, a worldview similar to that of Powers of Ten. To realize such a worldview, we used Stable Diffusion to generate a large number of images of different scales, which were then assembled to construct the visuals. We used a gallery site such as Lexica, which allows users to browse text and generated images in pairs, to search and edit the input text and create a list, and then generated several thousand images for each text.

Struggling with Lexica

A large number of images are thus stored at hand in exchange for storage, and it’s necessary to sequence these groups of images as video. In this case, we adopted a method of sorting images based on their feature values as a means of achieving this. Specifically, the feature values (high-dimensional vectors) for each image are extracted using a general image recognition model, and the distance between each image is calculated. After calculating the distances for all the images, the route that circles them in the shortest distance is solved as a traveling salesman problem (an approximate solution, of course), allowing a large number of generated images to be sorted based on their image similarity.

I feel that the act of thinking and inputting text into the model for micro and macro correspondence is a top-down approach. It akin to sharply piercing a point in the model’s search space with a needle. On the other hand, sorting (by using AI) a large number of generated images based on their features seems to have exposed me to a slice of the search space itself that I had not thought of/perceived. As seen in the phenomenon of pareidolia, we tend to perceive familiar symbols in response to unknown input stimuli. In other words, the patterns we find in response to the visual stimuli we receive when looking at an image are considered to be fixed to some extent. On the other hand, while the patterns found by AI are similar to ours to some extent (as a result of the training process), we often find puzzling outputs at times. I feel that these patterns that challenge our intrinsic congnitive perceptions to be an intriguing/appealing characteristic of AI. (I also took up this theme in the workshop “Photobook Creation with AI: Rediscovering the World with AI” held at the IAMAS-sponsored Gifu Creation Workshop in 2021.)

“Cloud Face” by Shinseungback Kimyonghun uses an image recognition model to capture the moment when clouds are recognized as faces.

This time we combined this sorted video with AI-based frame interpolation technique. Since the video is sorted, the pixel and semantic differences between frames are not that large, but naturally there will be some breakdowns due to interpolation between images that were not originally contiguous. The breakdowns appear as a squishy and peculiar texture, which I found to be rather interesting.

In addition, with the cooperation of Masumi Endo (Dancer) and Maiko Kuno (Movement Director), we have created visual materials using live-action dance footage. The video footage of the dancer dancing to the music was input into Stable Diffusion to apply effects, a process known as img2img (img2img is applied to each frame of the video).

Stable DIffusion can apply effects only to specified areas by using masks, but when we tried that method, we didn’t notice a large difference compared to simple style transformation. Therefore, this time, we input a video in which human regions were carved out from the original material using Semantic Segmentation, and applied the effect to the model without using any masks. As a result, we succeeded in significantly changing the texture and shape while retaining traces of the human silhouette.

I think we were able to get some interesting looks while still utilizing the dancer’s movement.

We used these process to create the visual materials. As mentioned above, we feel that the ability to generate a large number of variations is one of the strengths of AI. Out of our audiovisual performances using Stable Diffusion so far, we felt a positive response using this approach where the appeal is in the enourmous amount of variations. Once again, I would like to thank everyone who visited the site and supported the event.

Credit

Artist : Nao Tokui (Qosmo)
Visual Programming : Ryosuke Nakajima (Qosmo)
Visual Programming : Keito Takaishi (Qosmo)
Dancer : Masumi Endo
Movement Director : Maiko Kuno

--

--