CREPE: A Neural Network with Perfect Pitch

CDS Affiliated Faculty Member Juan P. Bello develops the highest performing pitch tracking technique

Audio processing, speech processing, and music information retrieval all depend on pitch tracking, a task for which computational methods have been studied for more than fifty years. Many reliable techniques have been built, but even the current gold standard of pitch tracking — an algorithm called pYIN — is a heuristic approach susceptible to inaccuracy from rapid shifts in pitch or uncommon instruments.

Working with researchers Jong Wook Kim, Justin Salamon, and Peter Li, Juan P. Bello, Associate Professor of Music and Music Education, has developed a new data-driven approach to pitch tracking that outperforms pYIN across almost every metric. This new method, called CREPE (Convolutional Representation for Pitch Estimation), is based on a six-layer deep convolutional neural network which operates on the time-domain audio signal.

To evaluate CREPE’s performance and compare it with other algorithms, the researchers use two datasets. The first consists of 6.16 hours of audio synthesized from the RWC Music Database (the same database used to evaluate pYIN); the second is a collection of 230 tracks with 25 instruments from MedleyDB, amounting to 15.56 hours of audio. Accuracy is measured in raw pitch accuracy (RPA) and raw chroma accuracy (RCA), which together measure the proportion of frames in the output for which the algorithm detects pitch within a quarter-tone of the ground truth.

CREPE’s accuracy is impressive — for the RWC dataset it had a nearly 100% accuracy rating, and for the MDB dataset it had over 90% pitch accuracy within 10 cents (a measure that denotes a tiny fraction of a tone). The researchers compared CREPE’s performance to pYIN’s and SWIPE’s (another high-performing pitch tracking algorithm), and found that CREPE could outperform both by over 8% when the evaluation threshold was 10 cents.

Bello and collaborators also compared the three methods’ capabilities for tracking the pitch of degraded sound. They simulated pub noise, white noise, pink noise, and brown noise (pink and brown noise have boosted lower frequencies). Except for brown noise, for which the pYIN method is especially attuned, CREPE is generally better at pitch tracking degraded noise than pYIN or SWIPE.

CREPE’s performance is an encouraging development for pitch tracking, but Bello and his group hope to improve the architecture of their model to allow for distortion and reverberation. They also intend to improve CREPE’s pitch estimation capabilities by adding recurrent architecture to increase temporal smoothness.

By Paul Oliver

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.