[CC Lab 20春] The Dawn of “x-magic” — An Audiovisual Instrument for Magical Performances —

Ryogo Ishino
Computational Creativity Lab at Keio SFC
6 min readAug 3, 2020

Hello.

I’m Ryogo Ishino, a student at Keio SFC, and CC Lab. Since April 2020, me and Aina Ohno have worked on the “x-magic” project. In this article, let me introduce what we have done so far.

Contents

  • What’s “x-magic” ?
  • Let’s define “magic”
  • Previous Research
  • Our Work
  • Full Demo
  • Future Work

What’s “x-magic” ?

Just an image.

Music and visual arts have become technologically advanced by the evolution of AI, as well as other practical fields. I hope that we will soon be able to use “magic” such as Harry Potter, in our real world. Ahead of others, I would like to establish what and how “magic” is, in the context of audiovisual arts. This is what made me start this project “x-magic”. Our target is to create an audiovisual instrument, and perform an interactive magical performance with sound and visual arts.

Let’s define “magic”

This project is not a fantasy. Before starting, I have to define the word “magic” in this context, just to make things clear in order to create magic logically.

About sound, almost all acoustic or electronic instruments have some kind of physical input units to play it. To play a certain sound, you need to press a certain keyboard or button, twang a string, blow into a pipe, hit a snare drum, and so on. This means that the player’s gesture is already restricted by the instrument. What if you can play without pressing or hitting anything, but just moving your arms as you want in the air? Or what if you can play several types of sounds, with just a simple motion? That sounds like magic, doesn’t it?

Here, performing magic is to do various operations and expressions intuitively with a UI that only has a few input units. In other words, performing without relying on buttons, knobs, sliders, or keyboards, but operating by gestures or speech recognition instead.

Previous Research

In previous research, there are some examples that look close to magic.
- Theremin is the first electronic musical instrument in the world. You can play it not by touching it, but moving your hands in front of it.
- Piano Genie is an interface with only 8 buttons, but you can play the piano with the assistance of AI. This is an example for the extension of expression.
- Dr. Atau Tanaka’s work is a great example for gestures and sound interaction. He used both machine learning and sensors to create this system.

However, I found two problems. First, their devices are too machine-like. To show like magic, the device should be smart enough, like a magic wand. Second, about theremin and Dr. Atau Tanaka’s work, the performer is just moving his hand or just swing her arms without having anything in its hands. As an audience of the performance, I felt like, “What are you actually doing?” When playing the guitar or piano, the player makes physical contact with the instrument, so it’s clear for the audience what’s going on. Therefore, I think not only the sound but also visualization is important to make the performance understandable for the audience.

Our Work

For this semester, we made a prototype of a magic wand with Nintendo Switch Joy-Con. The sensor data are sent to the Max for Live device on Ableton Live over OSC (thanks to atsukoba).

The instructions are down below.
- The X-axis of the gyroscope Control the pitch
- The accelerometer (any axis) Control the velocity
- ZR button Press to receive the value from the accelerometer
- R button Control the duration
- B button Keep pressing to turn on the Gesture Effect Trigger Mode
- Analog stick Press and shift to control the timbre by NSynth

The sensor data are sent here.

Now, let me show some technological points.

NSynth

NSynth is a neural synthesizer instrument announced by Google Magenta in 2017. With NSynth, you can interpolate between pairs of instruments to create new sounds. For further details, check their website. In our project, we used their Max for Live device to play the synthesized sounds in real time, by exploring the intuitive grid interface. Here with Joy-Con, you can control the timbre only with the analog stick. However, the sounds are already synthesized ahead of time, so it’s not actually generating sounds in real time.

The NSynth Max for Live device

Gesture Effect Trigger Mode

In this mode, you can turn the effects units on/off on Ableton Live only with your gestures that you trained before. Here I used dynamic time warping (DTW), an algorithm for measuring similarity between two temporal sequences. This can also be used for gesture recognition. The accelerometer data will be the gesture data. You can easily implement DTW on Max with a machine learning library called ml-lib. To understand how it works, please watch the demo below.

Gesture Recognition by DTW

First press “clear” button to reset, and select “train” mode to start training your gestures. Then select a radio button in the box, and record a gesture. The device will record your gesture data while pressing B button on Joy-Con. So keep pressing it while you’re doing a gesture that you want the device to learn. After doing it for each radio button, press “train” button to make it train. It will train to match the ID of the radio button with the specific gesture you made. Finally, select “map” mode and do one of the gestures you trained, with B button kept pressed during the gesture. If it’s trained successfully, it will play a sound that is already allocated to the ID of the radio button, which matches your gesture. This demo just plays a sound to explain how gesture recognition works. In the actual performance, it turns the effects units on/off under the Gesture Effect Trigger Mode.

Visualization

This part is mainly done by Aina Ohno. We made two types of visualization with PoseNet, a deep learning model for pose estimation, and TouchDesigner.

Visualization Demo

In the first demo, PoseNet detects the direction of your face, and sends the data to TouchDesigner over OSC. It displays some lines there, and the lines correspond to the clearness of the sound. When the sound is clear with less noise, lines will also be clear and thick, but when it’s unclear, lines will appear unclear and thin.

Visualization Demo 2

In the second demo, it displays particles like they are tracing your hand. It’s done by detecting the position of your hand with PoseNet. This time the particles correspond to the sound volume. As it gets louder, more particles appear and for a longer time they remain displayed.

Full Demo

x-magic Demo
x-magic Demo 2

Future Work

  • Make an original device (with Coral, Raspberry Pi, sensors, 3D printer)
  • Explore the latent space of a timbre VAE and generate sounds in real time
  • Improve the visualization
  • AR, MR
  • Use the webcam to also generate or control the sound
  • Speech recognition

In the next step, I would like to make my own original device. As a prototype, I used Joy-Con so there were still many buttons to use. The new device should have fewer buttons, longer, and thinner so that it looks like a magic wand you’ve seen in fantasies. This means I need to rethink about the mapping of the input units. Adding speech recognition would be one solution. It would be like chanting a spell to use magic.

I would also like to think more about what an instrument is, how I can use sensors and AI in the best way, and how it would give an impact on society. This process is necessary to make this project progress in the right way, and to finally establish magic in real world.

--

--