ByteDance High-Resolution AMT System Achieves SOTA in Piano Note and Pedal Transcription

Published in

SyncedReview

4 min readOct 12, 2020

Automatic music transcription (AMT) is the task of transcribing raw audio recordings into symbolic representations such as the Musical Instrument Digital Interface (MIDI) technical standard. The field presents a variety of research challenges in signal processing and AI, as music signals often contain multiple sound sources correlated over time and frequency. In recent years the use of neural network based approaches has increased. These can simultaneously detect music information such as note onsets and offsets and pitches, etc., and have delivered SOTA results in AMT tasks.

AMT for piano music remains notoriously tricky because of the highly polyphonic nature of the instrument. In the recent paper High-Resolution Piano Transcription with Pedals by Regressing Onsets and Offsets Times, researchers from TikTok developer ByteDance introduce a high-resolution piano transcription system trained by regressing the precise onset and offset times of piano notes and pedals. The approach outperforms Google’s onsets and frames based system to set a new SOTA for piano note transcription.

Previous piano transcription systems typically split audio recordings into audio frames using discriminative models. This enabled them to predict the presence or absence of onsets and offsets framewise, but restricted transcription resolution to the frame hop size. Moreover, any misalignment in onset or offset labels in audio recordings made it difficult to precisely detect onset or offset times.

The researchers also note that even though sustain pedals play an essential part in pianos’ musical expression, current AMT systems do not typically perform pedal transcription.

Rather than classifying the presence probabilities of onsets and offsets as previous systems have, the proposed approach uses both notes and pedal transcription systems and an analytical algorithm to predict the continuous onsets and offsets of all notes and pedal events.

On the large-scale MAESTRO dataset of paired audio recordings and high-precision MIDI files, the system achieved an onset F1 of 96.72 percent, outperforming Google’s SOTA frames system (94.8 percent). In the first sustain pedal transcription evaluation on the MAESTRO dataset, the system set the benchmark with a pedal onset F1 score of 91.86 percent.

Experiment results also saw the pedal transcription system perform well on five-second audio clips. The team says it intends to extend their new approach to the transcription of other instruments. Some speculate TikTok owner ByteDance might use the research to develop new music sources and creative possibilities for its popular short video platform.

The paper High-Resolution Piano Transcription with Pedals by Regressing Onsets and Offsets Times is on arXiv, and the source code is on GitHub.

Reporter: Fangyu Cai | Editor: Michael Sarazen

Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors

This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.

Click here to find more reports from us.

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

ByteDance High-Resolution AMT System Achieves SOTA in Piano Note and Pedal Transcription

Written by Synced