Co-authored with Jon Mercer
We wanted something simple; Play podcasts at 4x speed maybe even 5x speed. We were using Apple’s AVPlayer but it couldn’t play past 2x without having audio distortion.
Note: Since writing this post originally, we’ve discovered that although AVPlayer’s documentation doesn’t explicitly state it, AVPlayer can actually handle rates higher than 2x. But regardless, we still valued our implementation since it’s able to do the other audio manipulations available to AVAudioEngine.
Here are our options for iOS:
- + Super simple
- + Gets things done
- + Does streaming and downloaded audio
- – does not do more than 2x (not true as explained above)
- – can’t do more advanced ideas we have in the future
- + Still simple
- + Still gets things done
- + Can do more than 2x
- – Doesn’t do streaming well
- * can do some of our advanced ideas
- + Playing downloaded audio is super simple
- + Can do 32x speed
- + Can do all the advanced ideas we have: EQ, distortion and more
- + Apple endorsed as seen in WWDC 2017
- – Playing streaming audio is a huuuge pain
- + Can do all advanced ideas
- – Painful to work with
- – Apple’s deprecating parts of it
We decided that AVAudioEngine was the way to go but there were plenty of pitfalls. If you’re a developer here are some things to watch out for:
- Bad documentation. You’ll have to scour the smallest corners of the web to find information
- Bugs that only a few people have experienced. StackOverflow won’t help you here. This year there were only 82 total questions for AVAudioEngine
- You should learn how audio is digitized and stored. Compressed, Uncompressed, VBR, CBR, etc.
- If you want streaming to work you’ll have to be familiar with URLSession
- Concurrency was a pain. Get familiar with GCD and log statements
- Even though you’re working in Swift you’ll still need to do manual memory allocation. Pair this with concurrency problems and debugging gets disheartening.
This post outlines how we configured our engine for the Chameleon Podcast Player. Neither of us are iOS Audio Engineers, so we started off with this tutorial from FastLearner as our foundation. We will start off with an overview of our architecture, then from there, we delve deeper into each layer.
We chose a layer-based approach. This kept things simple from a debugging point of view by allowing us to isolate problems into a single layer. The image below shows the real architecture and this post will only focus on the green parts.
An ELI5 explanation:
Apple built the engine to only take in a file or a special format of data. Let’s call it engine format. The converter creates an engine format from a parsed audio format. The parser converts streamed data in whatever format the server provided into parsed audio format. The parser is also in charge of finding out audio properties like duration. The throttler sends internet streamed data into the parser. Except it sends it up at a steady rate so the phone does not get hot/freeze. The streamer connects to the internet and grabs the audio data online.
For the rest of this post, we’ll focus mainly on streaming. We’ll completely ignore disk as it’s an easier problem. We’ll discuss each layer in the order of bottom to top.
Network Streaming Layer
We used URLSessionDataTask to stream data. We had to ensure that all background downloads are paused when the user starts to stream. We also ensured that only one streaming task was running at a time
Most of the complexity came from when the user seeks to another part in the audio that we haven’t received yet from the server. In order to accomplish this we had to:
- Cancel currently running task
- Clear currently buffered data and prevent stale data from coming in
- Convert seconds user seeked to into byte
- Make a byte range request
- Resume sending up network data
We found many questions on StackOverflow asking how to do data seeking. There wasn’t much information so we ended up looking at how C and Java applications did their seeking and built the same thing in Swift.
We implemented this layer later on when we found that our app was hitting over 100% CPU usage while streaming audio. The parser used a lower level API to parse network data that’s very CPU intensive. It takes about 20ms to parse one network data packet. We receive a network data packet every 200ms and feed it directly into the parser. This means that even though the user is listening to minute 3 of audio we were already processing minute 20 unnecessarily.
A solution we went with was to only parse audio data on the fly and hold on to the rest until needed. See the below image.
The parser takes network data and breaks it up into meaningful chunks. The diagram below is our interpretation of the conversion from raw mp3 data to PCM buffers. BIG CAVEAT: THIS MAY NOT BE THE MOST ACCURATE INTERPRETATION. This is just our mental model. Here’s Apple's own wording.
The parser stores these meaningful chunks and provides it for the converter. Some complexities with this layer:
- Metadata can sometimes be 100KB large. It’s important to account for this when the user seeks
- Some audio files will tell you the duration in their metadata but most don’t. If they don’t we have to predict the duration manually. We say predict because sometimes audio comes in as variable bit rate which makes consistent calculations difficult
- The first few network packets are used to find the file format. Only when we know the file format can the converter (see below) start.
Takes parsed audio packets and transforms them into a format that the engine can play. It takes these converted packets and fills up a buffer for the engine. The engine polls the converter continuously for buffers and the converter throws errors when it doesn’t have enough data.
- Lower level APIs only return OSStatus errors. We had to convert these errors into swift-style errors to be meaningful for the rest of the app.
- Holding relative state of audio (explained more below)
AVAudioEngine uses a player that can take in two types of input. One for file URL on disk and another for PCM buffer. Since streamed audio’s URL is in a remote server we had to convert network data (mp3) into PCM buffers.
The engine works in a pull model where it has to pull to add buffers to play. Buffers cannot be pushed into the engine. On the other hand, HTTP streaming works on a push model where URLSession pushes data to us from the server. The Parser works as the border between the push and pull model.
The engine has no notion of its position in audio. It simply grabs PCM buffers and it plays them. The converter is responsible for holding on to the relative position of the engine. The converter is the one that determines what should be played next and offers it to the engine.
Putting it all together
We made a high-level image of the engine’s data flow below. There are three phases for each layer.
- Initialization. Each class is initialized and the network streaming starts
- Full blast. The streamer and throttler send up data to the parser at full rate until the file format is found. The converter waits for the file format. The engine keeps nagging the converter. When the file format is found the converter receives it and then starts feeding PCM buffer to the engine
- Throttled. When the engine gets enough buffers (300 in our case) the engine switches to throttle mode. Instead of continually asking the converter the engine instead asks only when it consumes a PCM buffer. Throttle mode saves battery.
Here is the detailed image of everything working together: