Music co-creation with AI: an experiment

Published in

by MWM

7 min readJul 22, 2022

Using Machine Learning for music production has been a steady field of experimentation in the past years, notably Google’s Magenta project is working on new approaches to music production with the help of AI. So far the focus is mostly on desktop and web-based applications.

ML Framework

One of their early experiments in musical co-creativity was Piano Genie: A generative model for MIDI was controlled entirely by 8 input buttons, no matter which button was hit, the model mapped it the harmonically “correct” sounding keys on an 88 keys piano scale. Higher indexed buttons (meaning more to the right of a 8 button row) resulted in higher keys on the piano scale, lower indexed buttons in the opposite. This created a very convincing sounding output. The creators of the experiment, Chris Donahue et al. used 8 real buttons as inputs and a model running on a computer.

Our inspiration: Piano Genie

We loved this simple but effective method to co-create with AI and ported the experience to the mobile device: MWM LAB 01 was born.

In order to implement Piano Genie on mobile, we first re-trained the model with PyTorch, following the excellent Music co-creation tutorial given by Chris Donahue at ISMIR 2021.

Piano Genie has an auto-encoder structure that encodes piano performances into a reduced latent space of 8 keys. The decoder learns to reconstruct the original performance from this space. Both encoder and decoder make use of RNNs (Recurrent Neural Networks) in order to capture meaningful relations between notes in the time domain. While the system is trained entirely end-to-end, we only use the decoder part at inference time, as the user will play on the 8 button keyboard.

Schema of the model’s architecture with the decoder part used in the app on the right

Mobile Conversion

After some test runs of the model, the results were promising and we decided to continue with the implementation. Our first approach was to convert it to CoreML, which did not work due to the large number of inputs and the sequential nature of the model. RNNs are usually harder to convert to CoreML and requires some modifications of the model.

Another way of porting this model to mobile is to use the PyTorch C++ API for iOS. What we have to do is to convert the model in its TorchScript format, that can to be run independently from Python and is therefore optimized for edge inference.

Mobile Implementation

ML-Model Integration

Running a TorchScript model on iOS is done with the help of LibTorch-Lite by PyTorch. They provide as well a step-by-step tutorial, which explains the core concepts for the setup and showcases an example. It takes a bit of tinkering with model inputs / outputs using the PyTorch C++ API but is generally working well and running a performant inference.

Implementation Challenges

After connecting the MIDI-output of the model to our in-house audio engine, there was still some issues to be tackled.

NoteOff

For noteOff we decided to track how long the user presses a button and map the corresponding touchesEnded event to the noteOff of the generated output note. Done, that was nice and easy.

Velocity

Velocity on the other hand is a real problem on a mobile device as most devices do not track the pressure applied to the screen. Apple did that in earlier versions of the iPhone but abandoned 3D touch in newer phones. Without it is impossible to get an accurate velocity from a user touch. That said, it can be hacked. For our prototype implementation we estimated velocity by mapping the covered surface of each button per UITouch to a velocity scale from 0 to 127:

Velocity estimation hack

We assumed that in most cases a harder touch should cover more surface area on the screen than a lighter touch. After some experimentation the hack enabled a more expressive mode to play than without it, which justified keeping it for further iterations. Usually about 4 different velocity states can be detected with this method and the user can adjust naturally to this via the feedback in sound intensity.

Simultaneous Finger Touches

We ran as well into a hardware / firmware limitation. On an iPhone, only 4 finger touches will be tracked simultaneously. Once a 5th finger hits the screen, all touch events and their data are erased. On an iPad this is less of an issue, the finger limit is 11 touches. Despite our efforts to overcome this limitation, we could not push the boundaries here. A simple reminder in the onboarding screen of the app to try to play with a max of 4 fingers only had to suffice for the time being.

Going Further

To expand the existing use cases for the app, we decided to add two experimental modes Generative Co-Creation and HandTracking. We also added MIDI output integration into the app to allow the user to route the output to app-external instruments.

Generative Co-Creation

Generative Co-Creation is based on the simple fact that no matter which keys are pressed the model somehow makes sense out of it and it sounds harmonically correct. To help a user getting started with playing, especially those who are not familiar with the piano, a simple algorithm creates random button noteOn and noteOff presses with a random velocity within a certain range. These random button presses serve as the model input and are mapped to a 16th note quantization generated by the audio engine on a muted metronome track. If played back with a harp as instrument, the model output sounds surprisingly coherent — although after a while this generative soundscape of course lacks a bigger picture song structure. That said, it keeps the harmonics and feels overall like an ambient harp solo.

Hand Tracking

Hand Tracking has become easy to integrate thanks to Apple’s Vision framework. It offers a model inference for detecting Human Hand Pose right in Swift with a few lines of code and there is a helpful API walkthrough from WWDC 2020. Super nice, thanks Apple 🙂

In our implementation we are tracking one of the upper joints of both middle fingers as the tracked normalized X-Y coordinates proved to be relatively accurate even without smoothing applied. Mapping those finger position values to buttons and generating clear sonic feedback is still tricky. To offer a more fine grained control over the input triggers, we retrained the model to 16 inputs. We preferred the results of the 8 button model as the 16 button model generated a lot of higher pitched notes. It proved to be a better choice still from a hand interaction perspective though. It doubles the amount of possible triggers, which turns out to be easier and more expressive to play if the hands of the user keep a distance of more than 40 cm to the phone. In our tests it felt a bit like playing an invisible harp. While NoteOn events are tracked successfully by detecting if the hand of the user is hovering over an input button on the UI, mapping noteOff events meaningfully proves to be hard. In the current implementation we hardcoded a note duration of 0.1 seconds per note to provide a fixed noteOff. This is far from ideal but works in most cases reasonably well for a rewarding playing behavior.

MIDI Out

Music apps are a lot more powerful when they can be integrated into various other system setups and creative ecosystems. Therefore the communication with other apps and music creation tools was the final step to tackle. We added a MIDI OUT service to the app for channeling the generated MIDI values to other iOS apps or via bluetooth or cable / camera-connection kit to hardware devices like synthesizers and DAWs.

We did not find a lot of resources out there that explain in detail how to send MIDI messages with the updated Apple MIDI UMP API. After some experimentation we put together some basic code that works. Hopefully that’s helpful for someone else out there struggling with this:

MIDI routines in Swift

NoteOff is similar using MIDI1UPNoteOff as the main built in function by Apple.

Prototype UI

For the final app store version we realized changes in the UI were needed. We redesigned the main interface and the settings screen to make navigating the prototype app easier:

The app uses a simple color coding system to guide the user along the different functions: Gray is for displaying information, Yellow can be tapped / activated, Green is tapped / active. The interface elements and fonts are designed to be simple, informative and clear. Organizing the inputs as a row of buttons was based on the original hardware version of Piano Genie.

Here is another video from our MWM Lab workbench with some hardware synthesizer connected to it.

LAB 01 connected to semi-modular synthesizers

LAB 01 on App Store

That’s probably all for LAB 01 — we hope you have fun with our prototype on the App Store :

And again big thanks to GoogleMagenta + PianoGenie team for the inspirational foundations !

About MWM

MWM is a creative apps publisher and the creator of Edjing, the most downloaded mobile DJ app in the world. Our team is working on a range of different products including music, photo, video and drawing apps. Make sure to check them out on the App Store and Play Store

We are hiring, don’t hesitate to consult our open positions at https://mwm.io/

Written by Roland Arnoldt and Virgile Boulanger