Hongshuo Fan on his piece “Metamorphosis”

sandris murins
25 composers
Published in
13 min readMay 13, 2022

Read my interview with Hongshuo Fan on his music piece Metamorphosis. It is an interactive audio-visual composition featuring one human and two AI performers. Starting with the same virtual ancient Chinese percussion instrument, Bianqing, the performers evolve their sound and form through learning, imitation, conflict, and cooperation. The text version of this interview is created by Estere Bundzēna.

Can you give a short background of the piece?

Metamorphosis is the final and the largest composition for my PhD portfolio. I tried my best to include all the technology and ideals during my PhD study in this piece. The overall size of this piece is quite long and normally takes 17 or 18 minutes. It involves body tracking, sensor gesture tracking and also AI conditional generating algorithms inside this piece or in real-time. The piece starts with three performers — one human performer (me) and two AI performers and we somehow form a trial. I will start the piece and the first AI performer will read and respond to my music. Then the second AI musician will respond to the first AI musician and then I try to respond to the second AI musician. We establish these three answer-question relationships with each other.

How did you choose the instrument for this piece?

The instrument we chose is an ancient Chinese percussion instrument Bianqing. It is a little bit similar to a bale, but the material is different because it is made of stone, therefore the timber and the texture of the sound are different. So in this piece, I try to use a physical modelling synthesiser to rebuild this Asian instrument inside this virtual world. We tried to give the same condition to all three musicians inside, some for the AI musicians and also for me, so we tried to find one suitable instrument for this piece. At first, I found out the pinching should be really good because they have a large range of the pitch in terms of the pitch range as well as the timber, the texture is quite unique and the playing technique is not really difficult. This kind of sensor, the human computer sensor, is used to mimic this playing technique, which is quite easy because it is just striking. Nowadays you can use a gyroscope and sensor to mimic this playing technique.

Watch full interview:

What is the main concept of the piece?

As I mentioned, my first idea was to include everything during my PhD study here. During my PhD study, I saw and heard a lot of people talking about the creativity between humans and machines. That is the one key point of this piece. Creating a hybrid of creativity between humans and machines and also trying to express this idea about how we can work with the machines and find the balance points between ultimately and human creativity. A lot of people think about AI with fear, they are thinking about AI taking over the world or just that they can do everything, but I wanted to express in this piece that without humans the machine cannot evolve. Also, we can learn a lot of things from the machines because they are good at something that can inspire us back to our human side somehow. I use this idea in this piece starting from the first question — the human side to the machine side and doing this question-answer pattern. Later on, it merges together and you cannot really differentiate which part is played by the machine and which part is pretty much human. It builds these journals from one to another, one one-dimensional interaction to multidimensional and you cannot differentiate them.

What kind of sensorial experience did you attempt to create for the listener?

I think it is quite obvious. In the first section, we have three images. I got the idea from online meetings because when I composed this piece, the lockdown had just started and everything had to be on Zoom. So you have a small screen of each performer on your screen. I pose this idea of a very old-fashioned online conference: that if someone speaks louder then their little image will zoom in and someone else will become smaller. I borrowed this idea to show and to guide the audience’s attention, as well as to tell them who is leading this small musical phrase. Later on, I break this layout because I want to also express the idea that in this moment we somehow merge together. I have three interloping sections between my first section and last section. Not only the layout changes but also the colours. In the first section, there is a black background and you can see the instrument itself but later on, it is another world, so I chose a different, white background with a white scrunt. Everything else is more abstract because, in the first section, I want to give a concrete image of this instrument and how each musician plays this instrument. When this instrument is being played, it will move and visually respond. Later on, it is more like indirective and you guide this instrument, not directly play with it, with AI help. And this instrument gives it a different kind of timbre and the texture of sound.

Do the AI models create the visual elements too?

No, but it is the overall system. They have some visual capability, like me, but I did not create the visual part. I interacted with the virtual world which visually responded to my interaction. I also wanted to create this connection with the AI musician to this virtual world. If the AI musician plays this instrument, they have the same reaction. Later on, it was a little bit different because I wanted the visual style to be consistent. The human body shape you saw, it is a real-time capture of my body which has key point traction. so you see my movement and my everything, how to play and interact with this virtual world and also this style image change that represents the AI. They have more freedom to move around in the world, but in this moment, especially in the last section, I try to create the idea that we are not directly playing with the instrument but rather creating this kind of sound and image in this virtual world.

Is it improvisational?

Yes but because I am practicing so much, I have a fixed pattern in my mind. So each time we start in more or less the same way. Just like when you play with someone else and you play too many times you know each other, you know how he or she will react to your music. So you somehow build this pattern between them.

How did you train the AI models?

The machinery mode of training is a mix. On one part it is an Asian instrument score and then some piano contacts, midi database. I think it is the Yamaha e-piano context. They record the real human performance on an e-piano and convert the information.

I used unsupervised learning. I generate the AI mode by generating midi. It is not really like midi information but they have a pitch and velocity. The unsupervised learning is called Magento, I think, from gogo and they have published a paper talking about the structure which is close to humans, with rich expressions, velocity and superpowers to generate some music information.

Were the two models trained differently?

They are the same, but I also want to try to do that. When I interact with an AI musician, it is like a human tube machine. I also want to see how machine-to-machine works. I am really curious about how machine-to-machine interacts. Especially what is the difference between the human-to-machine and the machine dimension? The second reason why I have not done it is the limitation. Right now the computer is not really powerful enough to run everything in real time but I somehow try to achieve real-time interaction during the composition, especially with the performers. So with two different agents, I can ask one agent to listen to another one and try to generate music information. During the time when the first musician is playing, the second one can start to listen.

Do you already have some observations about their interactions?

Yes, like I mentioned, as a composer you have to play a lot with them. So when you play with the first agent and it starts to interact with the second it somehow creates a different dynamic. The first agent tries to mimic what you do, what a human musician does. and the second AI agent somehow tries to avoid what the first one did, so it is a really interesting result. It is not really unconditional generating, because I used conditional generating, which means each time the ai generating is somehow limited. I asked the AI agent to generate the music information within a pitch range and also within limited speeds. so for example generating 10 notes per second or just one note per second, which control the speed of the information received. When you give a similar condition to both agents, where one agent is listening to the human and the other agent listens to the machine, it is different. One mimics what the human does, but the second tries to avoid it and does something different. For example, if you play just one single note, the first agent will try to repeat what you did but the second agent will try to wrap it and do some variation, especially in pitch range.

What kind of conditions do you have?

First of all, I think the most important thing is how many notes per second. That determines the speed and tempo of the information. The second one is the pitch range and the pitch class. It is similar to the 12-tone class system. So you can decide for example if you want more C notes, you can give more weights to that, and if you want other notes, you give a class of 12 notes and each note has a different weight. That can basically give you the ideal pitch combination you want. Another thing is the temperature in the machinery which controls the possibility of the generation. For example, if you set the temperature to one (normally it is zero to one range), they will generate in the music information which has the lowest loss, which means the best choice of the note for the AI agent. But sometimes in competition, we do not want the best, we want to have some adventure or give the AI agent some range or have some space to play a little bit. Natural language processing repeats stuff, it is the safest for the AI agent in music information generation, and repeating notes will be the best choice for the agent. However, you do not want to actually repeat them every time, so the temperature forces it to do something different.

How many sections do you have and are they pre-programmed?

I have three main sections but each is different. Not only the sounds but also the visuals have to play out in the system. I have three different sections that have different presets, and visual and sound processing. I have to record some pre-stored information to recreate this environment for me and for the AI agent.

I also determine the length of each section. I can decide when I feel like the section is too long, to move to the next one.

How did you build the virtual instrument?

I used a modelling library for physical modelling. I built the interface in Max so I could directly use that and also gave the interface for the AI because, as I mentioned, it is like midi information, so you have to have some pre-processed material to send all the information to control or play this virtual instrument. The virtual part is really simple because the shape is not difficult and I just tried to do some 3d modeling by myself. Also, I used TouchDesigner for the virtual part.

What kind of hardware did you use?

The human-computer sensor is like a Nintendo, similar to the Wii controller. You have two gyroscopes for your hands and one camera to get my whole body image. That’s for the sensor part. For the recording, I use one computer but it is a little bit dangerous, to be honest. Right now I am trying to use a separate computer which means two computers, one for running the AI engine and sound processing and another one to run just the visual part. The speed configuration is the eighth ring, so it’s the speaker of the ring. Of course, sometimes I use a projector just for the video screening.

What kind of software did you use?

AI agent has a lot of libraries inside but basically, it is the Python graph. Then the OS sends it to my other queue part, the sound processing is in Max and the visual processing is in Patch Designer.

Did you make everything on your own?

Yes, I learned everything myself and did everything.

How long did it take to compose the piece?

It is really hard to say because some parts I already had. When I started to compose this piece, it was just before the lockdown. So I believe it was March or February 2020. Because of the lockdown, it was difficult to access my studio so it took me about a year to finish the piece. but like I mentioned, some techniques I accumulated during my PhD studies. From when I officially started to compose, the end recording of this piece should be a year in duration.

What compositional stages did you go through?

I started off with an idea. For example, I want to use real-time interaction. I have a rough imagination about this composition. So then I started to develop this AI-generating music information as the first stage. The second stage was to do some study about the video part because actually I am quite new to touch scenarios. Parallelly I am doing Max patching for the sound part. I normally develop the system first and later on I just try to compose and forget about all the technology behind it. So the first stage is to develop all the technology. The second stage is to use it in the composition. To be honest, sometimes I go back because when I compose, I find something that I need to change, so I go back to fix the bug, especially for the video part. I just focus on composition, not technology.

Would you say that you were writing the code language and not the score?

That is not true because I do draw some structure for that piece, like a small reminder for myself as to what I should do for the first section. Yes, I do write the code more when I compose but also I try to use some classical structure of music.

Can you imagine the piece with one AI model?

Actually I did that. For the second piece, I used a similar idea but with one agent. This was a commission by SWR in Germany, but because of the lockdown had to postpone this commission project.

Can you imagine the piece only with the AI models?

I have not tried it, but perhaps it is possible. I do have an installation without any human interaction. The overall idea was for the machine to have some kind of sensor kit so it could sensor the environment. For example the temperature, pressure or AI conditions. This sensor was triggered and the AI agents generated some music information.

Can you imagine the piece without the AI models at all?

Yes, I can imagine it because when I composed it, I did not really think about this as an agent, I thought about this like another musician. So I have always said that it is like a musician. It gives you a chance to dig into the data and let it be high, giving you a musician with their own personality and behaviour. It is quite similar to playing with another musician, even the same piece.

During the compositional process, did you somehow change or alter the AI model training?

I did not. I only trained them once and after that training, I did not really change anything in the mode. I think I reached the state I wanted. If you train them too much, like I mentioned, it would give you the best results, but not the good results you want. So I somehow limited the training process a little bit so they have their own random character. And then also give some conditions to give the direction when generating stuff.

What have you learned during this process?

The first thing I know is that I am not really good enough at programming AI agents, but I understand my limitations. I think the logical sequence of composing is quite important for me, because when I started some new things I also got some new inspiration. I think this piece also gave me a lot of new ideas. As we discussed, having two machines, seeing what they can do and how human they can be. When I tried to compose this, I attempted to answer ‘What is the best balance between people using machine tools?’ and especially using AI for creativity. I tried my best to follow what the AI musicians did but sometimes I was kind of lost. They can basically generate more complex information than I can create with two sensors.

Another thing is knowing how to use these models to create the music that I wanted and to find the balance. It does not happen by just pressing a single button. Just like with another musician, you have to actively participate.

Lastly, just by having this conversation, I have realised a few things. For example, the section structures and now I can think about the different ways of how to present Metamorphosis. It makes me rethink what can or cannot be done and how to achieve some things. I think that both the music and the visual elements can guide the audience in a large-scale composition like this one.

Watch Metamorphosis:

Hongshuo Fan is a Chinese cross-disciplinary composer, new media artist and creative programmer. His work has involved various real-time interactive multimedia contents, such as acoustic instruments, live electronics, generative visuals, light, and body movements. His research and creative interest focus on the fusion of traditional culture and cutting-edge technology in the form of contemporary music. His output spans chamber music, live interactive electronics, installations, and audio-visual works. His works have been presented at many international events and festivals (ICMC, AIMC, NIME, etc.). He was a faculty of the Electronic Music Department at the Sichuan Conservatory of Music and a member of the Sichuan Key Laboratory of Digital Media Arts.

Photo:

Source: screenshot from video of the musical piece

--

--