How do computers hear music?

5 min readJun 27, 2022

For a computer to process music, we have to first digitize the music information. If we are talking about graphics, digitization refers to the process of the real-life image that our eyes can see being captured by cameras into pixels, which are represented by a matrix of colour dots, to be displayed on a screen or stored as a fil.

電腦怎樣「聽」音樂？

談到以電腦處理音樂，先要以適當方式將音樂資訊數碼化（digitization）。以較常見的圖像方式解說的話，數碼化意思就是把眼睛可見的影像，經過相機轉換成電腦的像素（pixel），成為一個個顏色點陣在螢幕顯示，或儲存為檔案。

medium.com

Sound comes from vibration of particles in air or other media. When our ears capture the vibration energy, the brain hears something. Digitized sounds are measured as samples, which basically record the vibration amplitude of certain moments. When we need to replay a sound file in the computer, vibrating the loudspeaker diaphragm with this amplitude would re-create the sound.

Sound comes from vibration (*Icons from icons8*)

There are two general requirements for the vibration to be heard:

Periodic: The vibration should be rather repetitive, with a certain frequency. The frequency is usually measued as times per second. Once per second is named 1 Hertz, or abbreviated as 1 Hz.
Within human audible range: Human ears can hardly perceive frequencies under 20 Hz as a sound, but instead as some vibrations felt in the human body ¹. Above around 15,000 Hz (15 kHz), the frequency perception becomes less sensitive, which especially declines with age.

There are usually periodic sounds in music, with a not-so-short duration (at least around 0.1 second) and a certain frequency, referred to as a pitched sound. A higher frequency gives a higher pitch. The tuning pitch A440 in orchestras is the standard A note at 440 Hz. Let’s look only at pitched sounds first.

Audio Representation

The “purest” sound is a sine wave — we’ll look at more complex sounds later. Here is a 1-second 440 Hz pure sound, created with Python code. Play and hear it.

One second of 440 Hz pure tone

Python programming is popular these days! We will demo most of our code in Python.

So what does this code do? Basically, t is a timeline, with a value from 0 to tmax (1) in very small increments. x is the value of the sine wave at this time (t), calculated with the formula 2πft, ranging from -1 to 1. Here shows the contents of the first 100 samples.

Oopsie isn’t this too hard?? More on programming later…

Okay, how does it look? Here is the “sound” zoomed-in a lot… that the chart only shows 0.01 s of the waveform.

Look closely into the 440 Hz pure tone…

Here is how the whole 1-second sound looks — doesn’t look like the usual waveform seen in recording apps at all! Do not forget, this is a plain pure sound that has no change in dynamics and pitch. Digitized sounds are really nothing more than lots of 0 → 1 → 0 → -1 → 0 → … （or occassionally something like 0–65536 in some representations).

One second of plain 440 Hz sound

Symbolic Representation

When the complete sound is recorded, it’s usually convenient for replaying. No matter what kind of instruments, or how the acoustics are, everything is faithfully represented given that you have a reliable recording team. Yet, much of the low-level data is simply too complex for musicians, giving rise to the symbolic representation of sounds. We only need information like pitch, duration, dynamics, and instruments etc.

Many electronic/digital instruments in the market support well the standard of Musical Instrument Digital Interface (MIDI) ². It originated in 1982 as MIDI 1.0, and the release of MIDI 2.0 in 2020 was quite a hype for (electronic) musicians.

The MIDI format didn’t change a lot since the 80’s, with basic messages as “note on/off — note number — velocity”. The note number (pitch) is an 8-bit number, i.e. a value of 0–127. It can correspond to every standard pitch in Western classical music. For example, MIDI number 60 represents the Middle C on the piano.

MIDI pitch numbers and the corresponding piano keys

Here the Python code uses MIDI note numbers to notate music. melody has the pitches we want, and they are put one by one into s (the Stream). You can see the generated boring phrase of music with default duration and dynamics. Play and hear it!

MIDI comes with simple musical commands only and occupies only small storage on a computer. A decent MIDI sequencer with a good MIDI player would make the listening experience… acceptable. Before the wave of MP3 files brought by broadband Internet, MIDI files were what people have in their music collection on their home computer. At that time Windows/Mac had MIDI players in the system for simple music playing.

Although MIDI files are no longer commonplace in home computers, the format is still widely used in music production software, such as in software instruments in Apple software, or score-making software like MuseScore. They often rely heavily on symbolic representated sounds for easy editing.

*The same melody represented in “Symbolic Representation” (above) and “Audio Representation” (below)*

Audio representation and symbolic representation are good for different purposes. While audio representation gives a faithful record, editing is not trivial. Symbolic representation relies on the playing system for re-creating the sound, is great for music editing but not for non-instrumental sounds (e.g. human voice!). It’s not difficult for converting a symbolic file to audio, by just playing it through. Yet the reverse is still a research problem in computer music — Automatic Music Transcription (AMT).

Let’s talk about other interesting tidbits of the two in another article later. Try and play with these programming notebooks!

Deepnote Notebook:

Deepnote

Managed notebooks for data scientists and researchers.

deepnote.com

Google Colab:

Google Colaboratory

Edit description

colab.research.google.com

[1] Infrasound https://en.wikipedia.org/wiki/Infrasound

[2] MIDI https://en.wikipedia.org/wiki/MIDI