Web Audio API and computer audio learnings
Intention of Article
This would serve as a note to my future self on all learnings on computer audio and web audio API. I would try update the article if I got any new learnings.
I would try to make it a Q and A so relevant questions I asked myself would be answered (by myself) here.
How web audio api works
The web audio api construct graph of nodes between source (sound input like microphone, web video/audio tag, or audio stream) with some manipulation function (audio nodes) and finally reach the destination (e.g. speaker)
Questions refer to analyser
How it works?
The function call from example is very straight forward, the up stream node would pass the sound signal into the analyser node we can get time domain or frequency domain data.
So what’s fftsize?
FFT is the short form for Fast Fourier Transform, the size would control how many output points. The documentation explain the effect/result vaguely as follow:
From my understanding, the fftsize is the target bin size (or should it be the number of sinusoidal wave frequency?) the Fast Fourier Transform (FFT) would apply on the signal.
And according to source code for chromium implementation, the actual core code is at the realtime_analyser.cc, the getFloatFrequencyData or getByteFrequencyData API refer to function of same name in the C file.
The main call is DoFFTAnalysis(), which the logic is “roughly”:
1. copy data from input buffer to temp buffer for FFT
2. call ApplyWindow(buffer, fft_size) # this is to transform the data with a Blackman window to facilitate FFT
3. call DoFFT(...) on the buffer
4. Take the return from FFT, combine the real and complex number into magnitude and scale it
So why the final bin size is half of the fftsize?
I believe it’s due to the Nyquist limit, which anything beyond 1/2 of the sampling frequency are just repeating (?) and being redundant (this also might be related to FFT treat the input sample signal as infinitely repeating wave).
Why the code cannot see the actual FFT implementation?
I figured that, the line that try to perform actual FFT operation is platform specific and rely on other libraries
When we examine the folder
These different folder have their own implementation detail for the analysis_frame (I “believe” the core implementation is pffft, but I am not 100% sure)
Below is the pffft implementation, and we see len = fft_size / 2 is implemented here:
What’s a Blackman window?
I believe these 2 videos (video1, video2) would give a better explanation then I do. My limited understanding is this help the FFT to reduce leakage (and produce a better result to capture the frequency more precisely)
Note that Blackman windows is just 1 type of windows among different types.
What’s the difference between getByteFrequencyData() and getFloatFrequencyData()?
The get byte function do the same operation as get float function, just adding a scaling (to 0–255) before return.
Here I have a question to myself: when I inspect value from the float return, the values are all -ve decibels range from -20 to -190, how come the byte conversion works (as min_decibels_ is with default value = -100 and max decibel is default to -30 <= these are written in realtime_analyser.h)
How big or what is the size / unit of a bin returned by the analyzer?
The bin size, as mentioned above, is related to fftsize, so if we are taking 2048 as fftsize, we are going to get back 1024 (2048 / 2) numbers (in decibel or in byte), and each of them correspond to amplitude of signal of particular range of frequency, and the range is related to our sampling rate.
If we are sampling the audio in 44100 Hz, then each bin / data represent the amplitude of signal at “sampling rate / fftsize”.
In our example, which is 44100 / 2048, around 21.5 Hz, so the first data in the returning array should contribute to amplitude corresponding to 0–21.5 Hz, the 2nd value is 21.5–43 Hz…
And how much time are we measuring / sampling the signal?
For this one, I am not 100% sure, but as far as I understood, it’s related to frame size and sampling rate. The frame is where the “analysis_frame” in the code.
And the frame size is defined by the fftsize provided.
So we might be able to conclude that we are taking a frame of sample size = 2048 (in our example), and with sampling rate = 44100 Hz (sample per second), so we are analyzing the sound for 1 / 44100 * 2048 = 0.046 (second)
I am also referring to this online book section “Frequency and Time Resolution Trade-Off” (and following sections).
(More to go here…)
(Next item is visualization from MDN example)
Here are the references I read:
On coding and implementation
Official Web Audio API documentation — helpful up to certain degree, you would get the basic basic, with example codes they referenced (in github), you can make something already, but when I want to know why and how it’s implemented, we need to find other source
Source code of Chromium audio processing (or the original repo for blink webaudio)— Each browser implement the web audio API underneath support differently, this one is for Chromium (basis of Chrome), and from here I get to know how some API is being implemented and answered some of my “why” question
Audio Deep Learning Made Simple Series — even it is more focused on deep learning, the first and 2nd article explain how audio data is being processed and understood, clear a lot of basics questions
WolfSound — well coverage of audio programming concepts and math, have a youtube channel if one prefer video
Music and Computers Book — a book with well coverage on computer music, with sound examples and tiny web programs to help illustrate the idea (still reading)
Mark Newman’s course on Fourier Transform — His explanation is way better than my university professors back then, worth to at least view the free videos
Physics of music notes from MTU — course notes for their class, might be useful (not completed reading yet)