Audio Transcription with AI

4 min readJun 1, 2023

With the latest generation of neural networks, results can be achieved that go far beyond what previous artificial intelligence methods have been capable of. These powerful applications include DeepL for language translation, ChatGPT for dialog optimization, DALL*E for image creation from text, and AlphaZero/AlphaGo for board games, just to mention a few. What makes neural networks special is that they are not programmed, but rather learn from examples and can continue to improve over time.

This article demonstrates how AI has improved audio transcription in capella audio2score, a software distributed by capella-software. capella audio2score analyzes recordings, such as MP3 files, and identifies tones to create a new arrangement. Users can choose the arrangement to be for piano, quartet, orchestra, and more. The results can be exported to Midi, capella or MusicXML format, as well as PDF.

Intelligent listening

capella audio2score pro 4 has applied AI technology to audio recognition, achieving a real breakthrough in audio transcription. Unlike conventional software, it recognizes individual timbres, such as winds, strings, or piano/harpsichord/guitar, and can separate the instrument groups in a recording. For pure piano music, it employs a specialized neural network that recognizes piano tones particularly well.

Here is an example that illustrates the progress. This is a recording of the third movement “Alla Turca” from Mozart’s Piano Sonata K. 331. First, here is an original recording:

Mozart: Turkish March (recording)

And here is the result of AI recognition using capella audio2score, without any post-processing:

Mozart: Turkish March (audio transcription)

See here for more examples.

How do neural networks actually work?

Neural networks are inspired by how the brain works. They consist of neurons that are connected to each other via weighted connections, and are organized into layers including an input and output layer. In the case of note recognition, the input is an excerpt from the audio recording, and the output is the corresponding note representation.

Neural network learn a task by adjusting connection weights based on examples, resulting in the desired output for a given input. Unlike traditional computers, the learned information is not stored locally, but distributed across the network’s weights. This is what makes neural networks unique.

After successful learning, neural networks can provide meaningful outputs for audio examples not used during training.

What makes modern neural networks special?

Neural networks have been around for over 70 years, so they are anything but new. However, in recent years, they have experienced a renewed upswing. One reason for this is significantly faster computing power. Improved hardware now makes it possible to perform many computing operations in parallel. This means that the adjustment of network weights during learning no longer takes place sequentially, but simultaneously, like in the human brain. This enables the training of significantly larger and more complex networks (Deep Learning).

The speciality of Deep Learning is that neural networks can independently extract so-called features from the learning examples. In image recognition, for example, features contain information about the texture and shape of the objects to be recognized. In audio recognition, these can be patterns of overtones (timbres). Features greatly enhance the learning process, but are often difficult to put into words and thus make accessible to human understanding. This is both the curse and the blessing of this technique: its operation is difficult to understand, but this is what makes it so enormously powerful.

What’s next?

Of course, there is still plenty of room for improvement. Among other things, there is potential for improvement in the recognition of individual instruments, especially vocals and percussion. The recognition of vocal tones is particularly challenging because of the often strongly modulating voice — this also partly applies to instrumental play with pronounced vibrato. Percussion sounds are currently not recognized, but rather filtered out as noises.

capella audio2score comes with a free trial version that allows testing of all the features of the full version for 15 days.

Audio Transcription with AI

Intelligent listening

How do neural networks actually work?

What makes modern neural networks special?

What’s next?

Written by Dominik Hörnel