Accelerating H264 decoding on iOS with FFMPEG and VideoToolbox

Damiaan Twelker
LIVEOP X Team
Published in
6 min readJul 10, 2018

At LIVEOP, we focus on providing first responders with the most relevant information in a concise manner, while at the same time not compromising on our seamless user experience. When we partnered up with Zepcam, a leading provider of wireless (body-worn) camera systems around the world, we wanted to make sure that we delivered an experience that conforms to our high standards, not compromising on performance or efficiency.

Optimal end-user experience is critical and perhaps the most valued aspect of LIVEOP X

Camera streams hosted by Zepcam come in several different formats, most importantly, HTTP Live Streaming (HLS), a first class citizen in the iOS ecosystem with built-in support in AVFoundation, and RTSP, the Real Time Streaming Protocol. HLS streams are commonly used for live television and news broadcasts. It focuses on a seamless experience for the viewer: frame drops are disallowed, out-of-order playback of frames is not allowed, and a small buffer of upcoming frames is maintained to ensure a smooth playback experience. The situations during which Zepcam streams are activated are often life-threatening. Officers could be live-streaming from their body-worn camera’s while attempting to contain a riot, or a ladder engine with a camera mounted on top could be providing a birds eye view of a large building fire including the position of firefighters on the ground. Our definition of a seamless user experience is different from the one prescribed by HTTP Live Streaming: in our case, it is important that the frames displayed to the user are as realtime as possible. They could arrive out-of-order, and a couple of frames could be dropped, as long as this benefits the realtime-ness of the stream. Adding our requirements up, we arrived at using RTSP over UDP.

Apple does not provide support in any of the high-level frameworks for playback of RTSP streams. MPMoviePlayerController, AVPlayerItem and AVPlayer, all high-level system classes for playback of video streams, do not support RTSP streams. Fortunately, FFMPEG, the swiss army knife of audio/video processing, is equipped with the right tools to process and decode RTSP streams. FFMPEG has been around for 17 years now in the open source community, and has since established itself as a reliable force behind a variety of end-user applications, such as VLC, Google Chrome, and Chromium¹.

Underneath the beautiful package lies an immense amount of power. To keep it running smoothly when every second counts, heavy optimization is key.

Setting up FFMPEG

The RTSP streams served by Zepcam are encoded with the H264 codec. In order to prevent a massive increase in binary size of our final iOS application file (.ipa), we chose to compile the latest release of FFMPEG from scratch (v4.0.1), enabling only those features that we expect to use. We use the excellent build script found here, with a couple adjustments:

  • Change the FF_VERSION variable to 4.0.1
  • Change the DEPLOYMENT_TARGET to the deployment target of your iOS application
  • Change the CONFIGURE_FLAGS to enable bitcode, and disable all features except those required for our stream:
CONFIGURE_FLAGS="--enable-cross-compile --disable-debug --disable-programs --disable-doc --extra-cflags=-fembed-bitcode --extra-cxxflags=-fembed-bitcode --disable-ffmpeg --disable-ffprobe --disable-avdevice --disable-avfilter --disable-encoders --disable-parsers --disable-decoders --disable-protocols --disable-filters  --disable-muxers --disable-bsfs --disable-indevs --disable-outdevs  --disable-demuxers --enable-protocol=file --enable-protocol=tcp --enable-protocol=udp --enable-decoder=mjpeg --enable-decoder=h264 --enable-parser=mjpeg --enable-parser=h264 --enable-parser=aac --enable-demuxer=rtsp --enable-videotoolbox"

In addition, a small change in the FFMPEG sourcefile libswresample/arm/audio_convert_neon.S is required, as described here. Compilation should now succeed, yielding several different libraries. Drag the libraries into your Xcode project and make sure to link them with your application target (Build Phases > Link Binary With Libraries).

The global setup required to achieve video playback through FFMPEG is quite straightforward. Open the input URL pointing to the RTSP stream with avformat_open_input, find the streams from the input with avformat_find_stream_info, allocate a codec context with avcodec_alloc_context3 and avcodec_parameters_to_context, and finally open the codec with avcodec_open2. It is important to implement proper error handling and memory cleanup for all these methods, as they could all fail depending on the circumstances. In our application we also chose to implement an interrupt callback in order to exit the blocking methods early in certain situations, such as a lack of internet connection as signalled by SCNetworkReachability APIs or after a custom timeout timer has expired. Especially coupling with the reachability APIs allows us to circumvent built-in FFMPEG timeouts and fail early when no internet connection is detected.

Decoding Frames

The AVCodecContext struct exposes a get_format field which allows us to pick an outputAVPixelFormat for the video frames delivered by the decoder, out of a list of available formats. If we leave this field empty, the video frames will be formatted as AV_PIX_FMT_YUV420P, a format automatically detected by the decoder based on the underlying stream. Images on iOS are formatted in RGB(A) (AV_PIX_FMT_RGB24), hence an extra step would be required to convert the frames from AV_PIX_FMT_YUV420P to AV_PIX_FMT_RGB24 before display. libswscale provides a function sws_scale that does exactly this, but unfortunately it is not implemented on the GPU, meaning we incur a performance hit while performing the extra conversion step from YUV420P to RGB24.

Out of the list of available pixel formats we receive through the get_format function, there is one that requires our special attention: AV_PIX_FMT_VIDEOTOOLBOX. Although poorly documented, this format tells the decoder to pass the incoming frames to Apple’s VideoToolbox.framework. It will decode each incoming frame on the GPU, returning a CVPixelBufferRef holding the decoded data. This is much preferred over the default implementation, which requires an extra conversion from YUV420 to RGB24 on the CPU. Our function handle passed to the get_format field of the AVCodecContext now looks like this:

static enum AVPixelFormat negotiate_pixel_format(struct AVCodecContext *s, const enum AVPixelFormat *fmt) {    while (*fmt != AV_PIX_FMT_NONE) {
if (*fmt == AV_PIX_FMT_VIDEOTOOLBOX) {
if (s->hwaccel_context == NULL) {
int result = av_videotoolbox_default_init(s);
if (result < 0) {
return s->pix_fmt;
}
}
return *fmt;
}
++fmt;
}
return s->pix_fmt;
}

We make sure the VideoToolbox format is available before attempting to use it. If it is not available, or if initializing the videotoolbox integration fails, we fall back to the format originally found by the decoder. Note that the AV_PIX_FMT_VIDEOTOOLBOX is unavailable on the iOS simulator. In the teardown method of our videoplayer class, we check if the codec context’s hwaccell_context is not NULL and call av_videotoolbox_default_free if this is the case.

Individual video frames are received with avcodec_receive_frame. On return of this function, the AVFrame output parameter will be filled with information encoding a single video frame, depending on the pixel format used. If the VideoToolbox format was successfully used, then a CVPixelBufferRef holding the frame data can be found at AVFrame.data[3]. Although this is not explicitly documented, it is obvious from the comments placed behind the definitions of other hardware-accelerated formats in pixfmt.h. The CVPixelBufferRef cannot immediately be displayed. We first convert it to a CIImage with +[CIImage imageWithCVPixelBuffer:], and then turn the CIImage into a UIImage with +[UIImage imageWithCIImage:]. The finalUIImage describes one video frame and can now be displayed within your video player, e.g. implemented as a plain UIImageView.

In our tests between using the default pixel format, including the extra conversion step to RGB24, and the VideoToolbox format, we noticed a significant performance difference. With CPU decoding, we experienced an overall sluggish playback and a relatively large number of frame drops. With GPU decoding, close to zero frames were dropped. Despite not being documented too well, digging deeper into the internals of FFMPEG to achieve GPU accelerated decoding of video streams on iOS is definitely worth it.

We cannot wait to see how our Zepcam integration will improve the workflow of the men and women working everyday to keep our society safe.

  1. https://trac.ffmpeg.org/wiki/Projects

--

--