Musical Note Recognition Algorithms Need Tuning Too

How Cost Function Optimization Helped us Overcome Data Scarcity When Training Our Model

Published in

Simply

6 min readJan 25, 2021

At Simply we make it our mission to make learning a musical instrument available to every household. Our most popular app at the moment is Simply Piano, that guides millions of people in achieving their piano playing goals. The app’s success, essentially, depends on the ability of its note recognition engine to identify the notes played by learners as they progress through the exercises, allowing us to provide helpful feedback. This article discusses a challenge my team and I met building the note recognition engine using limited labeled data, and how we tackled it using simple math.

Data Scarcity Challenges in Engine Design

The engine was created using supervised classification machine learning algorithms, meaning the process of creating it required a labeled dataset. It consisted of a set of recordings from various pianos and a corresponding set of listings of all notes played in the recordings at each point in time.

We needed to collect many recordings of different types, using different pianos in different acoustic settings collected by different microphones in order to make sure we cover the entire spectrum of sounds we can expect the engine to encounter. We couldn’t use an engine to label sound recordings for the training set automatically, because then any training we use on these labels will only perpetuate that engine’s mistakes. That means we had to use a labeling system that is not very scalable (for example by recording in specific setups where we can label the actual keys pressed, or by having someone with absolute pitch manually label recordings). This resulted in a relatively small dataset with a few hundred instances of each note and a whole set of challenges.

Greatly Unbalanced Labeled Dataset

One of the greatest issues with our limited dataset is the fact that the training data we collected behaves very differently from the type of data we expect our engine to meet in-app.

For simplicity’s sake let’s examine a model that is designed to identify a single note, C3 for example, that returns 1 if the note is identified and 0 if it isn’t.

If we assume all notes are equally represented in our labeled dataset, that means each note appears in only about 5% of our labeled data (1/36 plus appearances in polyphony). If we look at the example note C3’s classifier, that means that in the labeled data 95% of our data points are labeled negative, while only 5% are labeled positive. That is vastly different from how we expect the data the engine encounters in our app to be like. In the app the main issue at every moment is whether or not the learner played the particular note the app prompted them to play. The probability of the learner playing the correct note is significantly greater than 5% (for most people).

This difference in balance can easily cause the resulting classifier to be too strict as it would tend to assume a note is wrong given slight ambiguity. One way to deal with it is to cherry pick a subset of the labeled set that will behave more closely to the real world case, but this solution is not preferred due to the previously mentioned scarcity of labeled data. We do not have the luxury of discarding most of our negatively labeled data points.

Building the Right Cost Function

The cost function of a supervised classifier model is a function that assigns a cost for every set of guesses the model makes for the target variables on a labeled data set. The closer the model’s guess is to the actual labels, the lower the cost should be. In the training phase the model strives to select parameters that will minimize the cost function for a given training set. This is where we dive into some math.

Let’s get back to our simplified model that identifies C3. We will also assume that our app only uses the recognition engine to test whether or not a learner is playing the note prompted by the app and doesn’t bother checking what other notes are being played.

Index of Variables

D — our labeled data set.

L(d) — the true label of data point d in D (Either 1 or 0).

|D| — the number of labels in our labeled data.

|P| — the number of positive labels in our labeled data (number of instances of C3).

|N| — the number of negative labels in our labeled data (number of instances not including C3).

(|P| + |N| = |D|)

G — a set of engine guesses for dataset D.

G(d) — the engine’s guess for a data point d in D

F(P) — the frequency of positives in the real world data (how likely is a learner to play note C3 when prompted)

F(N) — the frequency of negatives in the real world data (how likely is a learner to miss the note C3 when prompted)

(F(P) + F(N) = 1)

|NGF| — the number of false negatives, i.e. the number of labels d where L(d) is positive, but G(d) is negative.

|PGF| — the number of false positives, i.e. the number of labels d where L(d) is negative, but G(d) is positive.

Rebalancing of Data Via a Weighted Cost Function

Let’s assume note C3 is prevalent in 5% of our data as stated above. In other words:

|P| / |D| = 0.05

|N| / |D| = 0.95

On the other hand, in the app itself we will only try to identify a C3 note if the app is currently prompting the learner to play that note. In more cases than not the learner will play the correct note. If 80% of the times a learner is prompted the C3 note it is correctly played then:

F(P) = 0.8

F(N) = 0.2

Clearly the data in both cases skews very differently. We will use the cost function itself in order to rebalance the labeled data towards the in-app behavior.

Our cost function C will be:

C(G) = F(P) x (|NGF| / |P|) + F(N) x (|PGF| / |N|)

If we replace F(N) with (1 — F(P)) we get a function with one free parameter, F(P):

C(G) = F(P) x (|NGF| / |P|) + (1 — F(P)) x (|PGF| / |N|)

Note that in the case where F(P) = |P| / |D| this function is reduced to the probability of the model to make any type of mistake — (|NGF| + |PGF|) / |D|

This cost function will optimize over reducing the number of errors in the real world regardless of the way our labeled data is balanced between positives and negatives.

Where we go from here

F(P), the one free parameter of our cost function has a real world meaning, it’s the prior probability of a learner to make a mistake. There is a true value to it and as time goes by we get better at evaluating it, through lab test, statistical modeling etc.

On the other hand, by tweaking the parameter and running AB tests we can optimize the value of F(P) over our KPIs and find which value works best, thus using the cost function itself to teach us more about the learners.

Optimizing the cost function is a way for us to teach our ML model to better serve our learners, but understanding the values of the optimized parameters is a way for the ML model to teach us more about our learners.

Read more about the day-to-day of a data person at Simply: “How to Build a Data-Driven Product Roadmap” and “Impact-Driven Data Ownership — The Generalist Approach”, and check out our open data roles.