Phonograph.js: Tolerable mobile web audio

Audio on the mobile web is a mess. The easy way to play sound — creating an <audio> element and calling the audio.play() method — doesn’t work unless playback starts in response to a ‘user gesture’, and will only let you play one clip at a time.

The hard way — loading the audio, decoding it using the web audio API’s context.decodeAudioData(…), creating an AudioBufferSourceNode and playing that — gives you a lot more flexibility, but comes with a rather important caveat: it will crash your phone.

There’s a simple reason for that. The browser needs to store the entire audio clip, decoded, in memory, which — since a 5Mb mp3 file typically equates to a 55Mb wav file — you can quickly find yourself running out of if you’re using large audio files. When that happens, the way you find out about it is the whole tab going kaput.

For RioRun, an interactive podcast we (the Guardian US interactive team) recently built, this was a major problem. At any one time we might have as many as three separate layered audio clips playing at once, and each of those clips might be several minutes in length. And even if we didn’t have to worry about bursting the memory banks, the web audio API approach has another major drawback in that you have to download the entire clip before you can start playing any of it.

Since playback is controlled by the distance you’ve covered, as measured by your phone’s GPS, using <audio> is a non-starter — we can’t rely on the user tapping their screen.

Break it down

We created Phonograph, an open source JavaScript library, to tackle this problem. It exploits a useful fact about mp3 files: like the planarian flatworm, you can slice them into smaller chunks and they won’t die — each chunk becomes a block of audio that can be played independently.

By reading in the raw binary data and breaking it into Uint8Arrays of a few kilobytes each, we can decode just enough audio to get us through the next few seconds. As we reach the end of each chunk, the next one is decoded and starts playing.

Better still, if we’re in a browser that supports the fetch API and implements streaming, we can start playback before download is complete, by estimating the duration of the clip and how long it will take to arrive at the current rate. (That is, of course, something you get for free with traditional HTML5 <audio> — just not with the web audio API.)

Here be dragons

That’s the theory, at least. It turns out to be somewhat more challenging in practice. For one thing, you can’t just slice the mp3 file anywhere — you have to do it on a frame boundary, on which more later, otherwise you’ll lose data — and even then you’ll get audible seams between chunks because of something called the byte reservoir. This is one of the tricks that mp3 encoders use to cram more data into a smaller space. By filling unused space in less-complex-to-encode parts of the clip (such as silence) with extra bytes from upcoming more-complex-to-encode parts, encoders can achieve better quality with the same filesize. (This isn’t variable bitrate encoding or VBR, by the way — that’s a whole other can of planarian flatworms that we’ll open later.) The upshot is that any one frame may depend on as many as 9 preceding frames — and since a frame represents about 1/40th of a second, your ears notice it.

Phonograph solves this problem by linking chunks together: each chunk appends the first few kilobytes of the next chunk’s raw data to its own. Rather than stopping playback at the end of the audio that ‘belongs’ to the chunk, it continues for a fraction of a second while the next chunk starts playing silently. Once it’s safe to do so, Phonograph silences the first clip and unsilences the second.

Synchronising the volume changes is easy since the web audio API lets us schedule things like volume changes to a quadrillionth of a second. (Really! That’s a millionth of a billionth. Of course your hardware can’t possibly match that precision, but it means that two events scheduled for the same moment in time will definitely happen together.)

But our problems have only just begun.

How long is a piece of audio?

When you decode a chunk of audio with context.decodeAudioData(…), the resulting AudioBuffer has a duration property representing the length of the clip in seconds. We need to know that in order to schedule playback of the subsequent chunk.

The problem is that duration is quite likely to be a complete fiction.

Safari will read the file header rather than the frame header for the first chunk, basing the duration property on how long the mp3 file claims to be, rather than what has actually been decoded. After that, it does a much better job, but that doesn’t help us if the second chunk starts playing several minutes too late.

Chrome, meanwhile, essentially spits out random numbers. It’s fine for CBR files — those encoded at a constant bitrate, like 128kbps — but for variable bitrate (VBR) files it’s likely to underestimate wildly.

When browser APIs fail us, we only have one choice: implement the logic the hard way in JavaScript.

The anatomy of an mp3 file

In order to do that, we need to understand the data we’re dealing with at the level of ones and zeroes. Helpfully, while the mp3 specification is copyrighted (you can get the paper if you want to pay for it; I don’t), kind and smart people have reverse engineered it and made their findings freely available. (If you’re interested in this stuff, Let’s build an MP3-decoder by Björn Edström is also essential reading, though it doesn’t strictly pertain to what we’re doing with Phonograph.)

For our current purposes, we need to understand a few things:

  • An MP3 file is composed of a series of frames. Each frame contains 1152 samples. Typically, the sample rate is 44.1kHz or 48kHz, making a frame last around 1/40th of a second.
  • Each frame has a frame header followed by the ‘main data’.
  • Frame headers are four bytes (32 bits) long and contain information about encoding format, sample rate, bitrate, number of channels, and various other things. Each header starts with 11 ones (or 12, if you don’t support MPEG Version 2.5, which Phonograph currently doesn’t) — this is called frame sync and is easy to identify in a sea of binary data, though you will occasionally get false positives.
  • Some of these things change from one frame to the next, meaning you can’t just look out for the same four byte sequence to identify frame headers. But some things don’t, making it possible to eliminate false positives from the frame sync check.

Knowing all that, we can scan through the data one byte at a time until we find a frame sync header, eliminate false positives by checking that the frame’s metadata matches our reference header (the first one encountered) vis-à-vis sample rate and so on, and increment a frame counter.

Once we know how many frames there are, we multiply that number by 1152 and divide it by the sample rate (which we know from the reference header) to determine the exact duration of the clip.

Surprisingly, all this happens in fractions of a millisecond, meaning the page continues to work smoothly and the user is none the wiser.

But seriously

These are quite mad lengths to go to for something as straightforward as playing sound. Insisting on a user gesture for <audio> playback doesn’t accomplish anything except artificially restricting the capabilities of the web relative to native apps. It’s not in any spec, doesn’t achieve its stated goals, and simply results in a worse experience for both developers and users.

Browser makers should accept that it was a dumb mistake, and change the behaviour of mobile browsers to match desktop browsers.

Phonograph is a work in progress

We’ve found it to be pretty robust on Safari and Chrome, which dominate mobile browser usage. Firefox and Opera are a different story, since both apparently refuse to decode mp3 data.

Since we plan to use it in future projects, we’ll keep hammering away at these issues. But in the meantime, we’d love additional feedback and bug reports, so if you end up using Phonograph in a project please let us know how you get on.