Humans and AI in real-time performance — BIGYUKI x Qosmo
I had the great privilege of working as an intern this summer for Qosmo, and although my stay was rather short (a little over two months), I felt that my experience was incredibly rewarding. I not only acquired a host of new technical skills in audiovisual design, but also learned a lot about the kind of design considerations necessary to actually implement cutting-edge technology in an artful, meaningful way. My main focus during my internship was helping develop interactive audiovisual systems for a live performance by pianist BIGYUKI and Qosmo CEO Nao Tokui. In this article, I will walk through some details of what went into this performance and some lessons I learned along the way.
The central theme of this performance was exploring the interplay between human and AI. BIGYUKI dubbed this collaboration the “John Connor Project” in reference to the Terminator franchise. However, unlike in the Terminator universe, where human protagonist John Connor leads a resistance movement against malevolent AI Skynet, our performance represented a more symbiotic relationship where AI-human interaction was used to add new dimensions to musical performance that were not previously possible.
In our performance, BIGYUKI used several instruments (grand piano, synthesizer, looper, and mic) all of which were sent to Nao’s computer. From here, Nao processed the audio and MIDI data using a combination of generative AI tools and more “traditional” DJ/remixing techniques. Qosmo’s own audio plugin Neutone was heavily featured, and integrated AI in multiple ways. With Neutone, audio signals from BIGYUKI were run through a variety of timbre transfer models transforming the sounds of the piano, keyboards, and mic in real-time. MIDI data from BIGYUKI’s keyboards were sent to Neutone (in a yet-unreleased version) where an AI model trained on MIDI data from drum performances generated rhythmic variations, again in real-time. Finally, audio from multiple sources was fed into a custom Max patch running MusicGen, which generated music conditioned on text prompts entered by Nao during the performance.
In this way, there was a constant back-and-forth between the human performer and AI models — BIGYUKI would improvise on his various instruments, and the various AI tools on Nao’s computer would respond by shaping, augmenting, or complementing the incoming audio/MIDI data. This created a highly integrated feedback loop that resulted in a perpetually shifting musical environment. Part of our objective for this performance was to convey the dynamics of this interaction to the audience via reactive visualizations. As part of the visual team (in collaboration with Ryosuke Nakajima) this was my primary responsibility in this project — to mirror the interplay between human and AI agents using visual representations. To do this, I designed a handful of audio-reactive scenes in TouchDesigner to illustrate this juxtaposition.
To add to the already convoluted routing of signals between musicians, we also set up communication between the visual team. In order to have visual systems selectively responding to the human- and AI-generated sources, the audio and MIDI data from both BIGYUKI and Nao were sent separately either through OSC or directly to the visual team’s computers. In keeping with the theme, the visual design of the scenes were meant to reflect the contrast between the “organic” human sources and the “synthetic” AI sources.
The performance was a success — we played a sold-out show and created a truly unique experience. The synergy between BIGYUKI and Nao was electrifying and the use of the AI tools in this scenario propelled this hour-long improvisation into a whole new realm of creative expression that I (and probably most of the audience members) had never seen before. During some moments, I really thought “this is the future of live music performance”.
On the other hand, I don’t know if we achieved our other goal of conveying the complex human-computer interactions through the visual design. The sonic experience was already so novel and multifaceted that it was hard to distinguish what sounds were created by BIGYUKI and what were computer-generated. We had deliberately avoided explicitly labelling the sound sources using text labels on screen, so that the experience would not feel “demo-ish”. However, without such explicit delineation we only had visual metaphor and abstract representations to work with. Given that the AI and human sources were, at any given point, overlapping and shifting in unpredictable ways, it was almost impossible to disentangle where one began and the other ended.
I think the visual design was able to convey the human-computer interaction dynamic in a more broad, metaphorical way, and alternating between the more organic designs and the more quantized, synthetic ones was effective in communicating this theme. However, these abstract designs are not sufficient if the goal is to have a moment-by-moment tracking of the audio sources. In order to make this distinction clearer, we probably would have had to introduce more structure into the performance itself.
For instance, BIGYUKI made heavy use of the looper for this show, which takes in human input and loops playback automatically (without the use of AI or computer intervention). Since there is no visible human interaction, it is not clear to audience members who is producing the looper’s output — making it all the more difficult to tell where human and computer-generated sounds come from. Additionally, while we loosely structured the musical performance to move through sections focusing on timbre transfer, rhythmic generation, and the MusicGen model, in practice we switched between tools and sometimes used multiple tools simultaneously. If the tools were used in isolation or if we choreographed some intentional pauses by the human performer, perhaps it would have been clearer when AI was being used.
However, while these kinds of changes to instrumentation or musical structure might have made it easier to precisely communicate the interaction between human and AI agents, I’m not sure it would have made for an aesthetically superior performance. There was something uniquely captivating about the seamless integration of AI and human output. Furthermore, the highly improvisational nature of the performance was at the core of its appeal; imposing additional constraints, while making it more understandable, might have detracted from the infectious fun that came from having BIGYUKI jamming with AI. Part of the reason this was such a fresh and exciting performance was due to the uncertainty inherent in the mode of interaction . None of us knew what sort of musical ideas would blossom into an interesting groove or what riffs the AI would catch on to — the sense of serendipity that came from finding these little moments is where the magic came from.
Ultimately, I think this perhaps is a lesson in designing a performance on the conceptual level — while we might have fell short of creating a perfect explicatory showcase for the technology, the resulting performance was something aesthetically fresh and high quality at a holistic level. The fact that the performance seamlessly tied together human creativity and AI tools should be regarded as a strength — the fact that it was not clear where AI began and ended meant that the focus could shift to the experience as a whole. To me, this is what the future of AI in music should be about — not to replace human beings, nor to play a contrasting, opposing role, but to act as fully integrated, even invisible tool that allows us to create new human experiences that go beyond what we could have imagined.
I am grateful to have been a part of this performance, and it will continue to inspire my own research and creative projects for years to come. I will be eagerly keeping an eye out for what Qosmo and BIGYUKI do next, and I have high hopes that they will continue paving the way for these kind of innovative and humanistic uses of AI in musical performance.