A look under-the-hood of Simply Piano (Part 2)

Meet MusicSense™ — JoyTunes’ acoustic piano recognition engine

Published in

Simply

7 min readAug 8, 2018

In Part 1, I mentioned the application frameworks behind Simply Piano, and briefly mentioned that the app gives you instant feedback about how you are playing.

This part is all about the acoustic piano recognition engine MusicSense™, the core technology behind all of JoyTunes’ products that makes this magic happen.

If you’re a mobile developer thinking about integrating Machine Learning into your app, or if you’re just curious about the unique challenges a music education app like Simply Piano faces, this post is for you.

The Mission

To understand why an acoustic recognition engine is required, it’s first important to establish a few basic assumptions:

If we want to teach users how to REALLY play music and have them learn and progress, we need them to use the app on a real piano or keyboard.
Realtime feedback is crucial for a truly fun and engaging interaction. When playing a song, you need to know right away if you got a note correctly, the feedback will simply feel irrelevant if you are already thinking about the next notes that are going to come.

You might be thinking: why not just let users play on a USB/MIDI supporting keyboard? They can simply connect the keyboard to their smartphone or tablet. For instance:

This will indeed give them instant flawless feedback about how they played in the app, but it isn’t a good enough solution, for two reasons:

These cables aren’t very cheap. For a user that only wants to try the app before committing it could be an unnecessary blocker.
This doesn’t solve the problem for users with an acoustic piano.

So, essentially, we want to be able to use a cable-less setup that can analyze what the user is playing in real time and give accurate and low-latency feedback about the notes that were played.

So technically, how is it done?

The high level is quite intuitive:

We sample the device’s built-in mic and get a stream of PCM audio data. We then run the magic of the MusicSense™ engine, and get a vector of 88 piano keys and which of them were active in the last frame. The game engine then processes these keys against what was expected from the user to play and advances the game state accordingly.

So “the magic” here is actually — how to extract a vector of played keys out of raw audio data? Well, the answer is, basically: Deep Learning.

Why is Deep Learning (or any ML method) needed?

Since there are many standard Digital Signal Processing algorithms out there that easily allow you to do trivial operations like frequency detection, I am often asked why not just use a bunch of them and get the frequencies of the notes being played.

So, this post isn’t meant to assume prior DSP knowledge or serve as an intro to the field, but to put it in simple terms, there are 4 main reasons why our challenge isn’t trivial and needs some advanced techniques:

The piano is a harmonic instrument. This means every single note is comprised of many active frequencies (as opposed to a human whistle for example) and it is sometimes very hard to differentiate between similar notes, like C in different octaves.
The piano is polyphonic, i.e. you can play a lot of notes at once. This makes the problem mentioned above exponentially harder if you want to know exactly which notes were active. Simply Piano courses require you to play chords with up to 5 notes at the same time, which as you can imagine might be a big challenge to recognize.
The stream of audio we deal with isn’t exactly recorded under perfect laboratory conditions. We receive a recording of users playing the piano in a lot of different types of rooms and with different types of background noise, like air conditioning, TV, etc.
This adds a lot of… noise… to the input signal.
To add to the challenge even further, in Simply Piano, the users play songs along with high-quality production background music. This means that unless you use headphones, the microphone picks up the background music coming from the speaker, which also makes it much harder to filter out the piano playing from the input signal.

An example Spectogram of an Android recording of “Wake Me Up” in Simply Piano. Look at all the frequencies!

All of the above should explain why this is a real challenge, and we need the big guns to solve it. Common signal-processing algorithms are not enough.

On-Device Inference

If all the challenges mentioned aren’t enough, remember that we want to give feedback instantly, which means the inference should run in very low latency.

So, while we can train our models offline and with all the processing power necessary, for inference we have to run on-device and avoid unreliable network latency.

Other benefits (apart from low-latency) of on-device inference include:

Offline availability: It would be a shame if you couldn’t play a Simply Piano level when you’re away from internet access. Well, because we don’t need a server to run the models, you can! (as long as the background music track was pre-downloaded).
Cheap processing power: Running inference for our millions of users in the cloud could have been quite costly. Using the devices’ ever-becoming-more-powerful processor to achieve the same goal is a huge money saver for the company.
Data Privacy: By running on-device, we avoid the need to send all of our users’ stream of audio to the cloud in order to give them feedback. A very important concern nowadays.

The downside, of course, is having limited processing power (especially on old devices), and having to limit ourselves to very small models for speed and storage considerations.

On-Device Inference — What Frameworks?

If you’ve tuned in to any big developer conference recently, you probably noticed that on-device ML became very hot with lots of relevant frameworks like CoreML, ML Kit, or TensorFlow Lite.

JoyTunes’ first piano app was launched in 2012 on iOS, well before the current hype. With it we already had a very initial version of our piano recognition engine, implemented using Accelerate.framework.

Simply Piano’s current iOS model is much more advanced than the one from 2012, but much of the original code is still relevant. Also, since iOS devices are quite powerful and not very fragmented, we were able to run complex deep learning models on them quite well even before latest optimized frameworks like CoreML launched, so we are still using our own Accelerate.framework based implementation (but we are in the process of replacing it).

However, In the Android version, which launched in early 2017, this wasn’t the case. We had our own java/c++ based implementation of a small model, and we had a very hard time reaching good enough performance and recognition with all the fragmentation (we are a small team and optimizing for different device types isn’t our strong suit). We also found it very hard to try out new types of models to see if they improve recognition, as our implementation was tailored made for a specific model.

In our research and training environment, we are using TensorFlow. Back then TF-Lite wasn’t available and TF-mobile was very young and not trivial at all to integrate. However, integrating it was the logical thing to do. I ended up writing a (pretty popular, but now obsolete) post about how I did it:

Deploying a TensorFlow model to Android

a guide written in blood 💉

medium.com

Results were good: Integrating TF immediately shortened the feedback cycle and allowed us to try out new TF models to improve recognition.

Performance-wise, we initially found a serious issue in TensorFlow (apparently FloatBuffer on Android is very slow). We ended up implementing an internal workaround for this, using the C++ TensorFlow API via JNI instead of the Java one.
Eventually, we reached x4 faster inference time compared to our manual implementation for the small model. Great success 🚀

Here are slides from a relevant talk I gave at Droidcon TLV:

The Future

MusicSense™ has already achieved unbelievable results in piano recognition under challenging conditions, but it’s still far from perfect. We are constantly working on recognizing the weak spots of the models and trying to come up with better one.

One potential front we are tackling, is the fact that our models are not CoreML and TensorFlow Lite compatible yet. There’s reason to believe that by making them compatible we’ll have a lot to gain (performance-wise and storage-wise), and can potentially run much better models. However, since most on-device models are related to images (and not audio like us), there are a lot of incompatibility issues we’re the first in the world to tackle, so it‘s not a trivial effort.

That’s it for this part. Be sure to go back to read Part 1 if you haven’t already, and stay tuned for the next posts in the series, about how we localize, experiment with features, and more.

I would love your questions and feedback! You can comment here or find me on Twitter. If you want to be part of making this magic happen, check out our job openings.