Autonomous Agents
Published in

Autonomous Agents

Peculiar Story of a Music Neural Engine

Why is Kena’s Artificial Intelligence the most powerful and accurate Music Neural Engine? The answer lies in ignoring industry standards and starting with a fresh perspective.

When I started Kena, people said, “If you do not understand music theory, you will fail to solve the problem.” Also, many of the existing Machine Learning community “poo-pooed” the idea that the simplification and compositional accuracy of music feedback built using Neural Networks cannot be greater than HMM + hand stitching the creative rules (This was in 2019).

I ignored all of it. Kena’s AI platform is now 96% accurate. Here is a Demo.

How did we do this?

While I acknowledged every aspect of feedback, I just wanted to see why can’t we borrow ideas of self-supervised systems and multi-task learning systems from the field of linguistics into Music. And also, why can’t we borrow ideas of representational learning from the field of vision computing? (I was working in Vision and Linguistics before jumping into the field of Acoustics).

If you squint a little, the sequence learning aspects of music are similar to language models. And if you turn your head a little, the instance segmentation of melodies in spectral densities is similar to vision computing.

The latent space was similar in my mind. I was not sure why the “Fourier analysis” crowd was chasing me with bricks and bats in the chat groups :) Just kidding, ML engineers are the kindest. If there is one tight-knit community in any industry, it is the engineering community. The code runs thicker than blood in these communal veins.

I was indeed a newbie to music analysis and acoustical computing. This was an advantage! I had nothing to “unlearn” and I only had a fantastic possibility of novel ideas to try that was in front of me. Well, that is not completely true. I had to learn a lot of spectral analysis to bring sound into the vision domain. I also had to learn a lot of noise elimination techniques in the auditory spectrum. But you get the point.

I ignored Hidden Markov Models entirely cause this required me to learn music theory to shape the state machines. I ignored it not because I didn’t want to learn music theory. I ignored it because I believed that hand-shaping music theory was the wrong architectural choice for a Machine learning design for something as complex as Music.

I ignored the dimensionality reduction and hand stitching of lower-order dimensions to an aspect of midi generation. I ignored dynamic time warps and Viterbi decodings way early in the pipelines. I threw them all out and started with a self-learning system first.

Given the success of applying deep learning to existing problems in the past, I was looking for a self-supervised mechanism to train the models using deep learning. I stumbled upon this excellent paper by the Google Brain team, who were attempting to work on a Wave 2 midi 2 wave autoencoder. (Onsets and Frames: Dual Objective Auto Encoder)

Dual Objective Auto Encoder design

Voila, this architecture was beautiful and was built to train on onset loss and frame loss. Still, the midi it was generating was super noisy, very piano specific, and could not easily be used for sheet-music translations or diagnostics of musical frames.

Nevertheless, the architectural idea was inspiring. I built a VQ-VAE (Vector Quantized Variational Autoencoder) based on the NMT design of Onset and Frames with the following details.

  1. (I will point you to Kena’s first secret.) It is in VQ compression of the Mel Spectrogram ;)
  2. Instead of just Piano, train the models on Guitar as well.
  3. Focus on a two-tower “multi-task” training for a minor dataset that trains on a cleaner midi file from sheet music to design the errors towards specificity as against sensitivity.
  4. Retrain the entire system to eliminate Type-2 errors and specificity as against sensitivity.

Sensitivity focuses on outputting a midi that sounds very close to an Analog format when played. The problem with this is that you will have a very dense midi file that is painful at best or useless at most when you are training a downstream system to generate a musical transcript.

Designing your multi-task loss functions towards specificity and focusing on my validation sets during training is where most of Kena’s magic sauce exists in the Music Neural Engine.

With this, I could achieve the following:

  • A transcription accuracy of nearly 87%!! This was miles ahead already of the best-in-class HMM-based transcriptions.
  • The midi was sparse and almost 100% identical to the analog without losing quality.
  • Vector quantization retained time signatures and keys.

From here, I started focusing on recording practices and feeding them to Kena’s Music Neural Engine. Before sending the practice recordings to the Music Neural Engine, I built an end-to-end pipeline that eliminated noise and amplified signals. Feeding these amplified signals allows the Music Neural Engine to spit out clean intermediary encodings. Then, I generate a Midi file from the encodings as the next step.

The beauty is that the VQ-VAE works cleanly across 40 different instruments and 6 different genres.

Generating this Midi is where 70% of the magic lies. I coded the entire model up till here without understanding anything in music theory (People ask if I still code 🤷‍♂️ ). That was the beauty; I didn't have to learn music theory. I built a model which learned music theory on my behalf!

The remaining 30% lies in downstream pipelines to polish the transcripts for keys and time signatures. This 30% is the last mile veneer that requires music theory knowledge and an understanding of the statistical footprints of music.

Enter Mikey

Luckily, I found a Professional Jazz musician and a passionate Machine Learning engineer, Mikey. (Michael Schwartz). After giving him a homework interview, I hired him immediately as a founding Machine Learning engineer. Boy, has he delivered ever since? Hands down.

(He is also demoing the power of Kena’s Artificial Intelligence in the video.)

Mikey started to build an architecture pipeline after the Music Neural Engine spits out a clean Midi. Specifically, his pipelines and models do the following:

  1. Generate a midi output of any sheet music uploaded by the creator.
  2. Take the midi output presented by the Music Neural Engine (Which is only about 87% accurate across 40 instruments and 6 genres) and compare them to match the notes and melodic lines.
  3. Build templates that provide human-like feedback on errors.
  4. Build an error markup file for Visual Markups in sheet music.

Do not get me wrong when I say this is 30% of the magic. Sequence to Sequence alignment is not an easy task. This requires several intricacies to be managed.

  1. You have to check for the speed of the practice. Apply dynamic time warps to normalize the practice and the target files.
  2. Perform longest subsequence alignments to compare where in the sheet music the practitioner started to play.
  3. Check which sections did the practitioner skip and which sections the practitioner improvised (that was not present in sheet music)
  4. Check for freestyle (Rubato) rhythms and melodic time.
  5. Check for additional trills, vibratos, and hairy dimensions of music.
  6. Check for Keys and transpositions.
  7. And develop a template to give feedback.

This is where we could achieve 96% accuracy (from the 87% that the Music Neural Engine generates).

Separately, Mikey also built a fantastic Sheet Music simplification model that takes any complex sheet music and simplifies it to multiple levels.

Any self-respecting Machine Learning engineer knows that 80% of the effort is in improving ML models from 85% accuracy to 95% accuracy. Shaving every 1% after that is a herculean task.

I am so proud of Mikey for being part of the founding team at Kena. He is a power horse. If there is one thing I have been successful in my entire leadership career, it is having an eye for exceptional talent, empowering them to achieve unsurmountable outcomes, and just standing by to couch when needed.

Together the Music Neural Engine + the Downstream ML pipelines collectively are where Kena’s power comes from. In the industry, nothing else comes close to the accuracy, specificity, or feedback power that Kena’s AI platform offers.

It is super easy to test this claim. Play with our AI on our platform at

Let us know what you think.



Notes of Artificial Intelligence and Machine Learning.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store