How Arm’s NEON assembly enables efficient AV1 decoding on mobile

Ewout ter Hoeven
5 min readMay 17, 2019

--

Multimedia decoding is a challenge of scale. A modern processor doesn’t break a sweat over decoding a few pixels, but when that are 62 million pixels per second even the tiniest amount of work per pixel adds up quickly.

This is where Arm’s NEON comes in: Doing more work per clock cycle. NEON is an extension of the ARMv7 and ARMv8 instruction set that enables single instruction multiple data (SIMD) operations. This means that in a single instruction not one large (or precise) number gets processed, but multiple smaller numbers.

NEON allows 128 bits to be processed in a single instruction. In almost all cases, 128 bits is way to precise to be useful. If we had a coordinate system with 128-bits precision, we could specify each point down over at the Andromeda Galaxy to an accuracy of 0.00006 picometre. To put it in perspective, the diameter of the smallest atom, Hydrogen, is 32 pm, and the Andromeda Galaxy is 2,5 million lightyears away.

So to display your cat video, we don’t need that kind of precision. Mostly, we only need 8 bits per color per pixel, or 10 or 12 bits if we’re talking about HDR. This mean we could potentially fit 16 color values of pixels in one 128-bit operation, or at least 8 when decoding HDR or if a slightly higher precision is needed to prevent rouding errors down the line.

Knowing that NEON can fit multiple data values in a single operation and video decoders need to operations on a ton of values, they sounds like a great match.

NEON in dav1d

dav1d is the AV1 decoder maintained by VideoLAN, the people behind the VLC media player and the x264 and x265 video encoders. Lot’s of developers contribute to the project, realizing the goal of being the fastest AV1 video decoder that runs on virtually any CPU.

Back in december 2018 for the dav1d 0.1.0 release post I did a comparison between dav1d using only C code and dav1d using NEON assembly on a bunch of different ARMv8 processors (huge thanks to Janne Grunau and Martin Storsjö for the numbers). At that time only a few functions were accelerated with NEON, but still performance increased by 80% on average.

dav1d 0.1.0 without and with NEON code

A few month later, a lot more NEON assembly has been written. But before we get to our final results, it would be interesting to see what’s exactly faster.

Functions and their speedups

Decoding a video requires several steps, defined in the specification of the video. Each step is executed by a separate function, combined in functions. And depending on the encoder, its settings and content some steps are used more or less or not at all.

dav1d’s developers heavily rely on a tool called checkasm to benchmark the time a particular function needs. They write assembly code, test it with checkasm and if it’s faster, it’s likely to get merged.

On the tests Martin Storsjö performed, two compilers were used (Clang 9 and GCC 7) and three different cores, a Arm Cortex-A53, -A72 and -A73. The first one is a ‘small’ in-order core, the last two are big out-of-order cores.

In the table below the processed results of all functions currently accelerated by NEON are displayed. The numbers are the speedup, so if the C code took 5 seconds and the NEON 2, the speedup is 2.5. The full results of all 146 functions can be viewed here.

We can observer a lot from this table. First of all, the speedup is very broad distributed, ranging from a few percent to factors of 20+. We also see that in most cases, the Clang compiler opimizes the C code a little better (the NEON speedup is smaller). Furthermore, the speedup on the small in-order A53 cores is higher than on the out-of-order cores, while the A73 profits more than the A72, likely due the reduced decode-width of the former.

Something to keep in mind is that some functions get used more than others, which makes an average speedup not a representative number. For the current NEON functions it’s between 5 and 7 depending on core and compiler, but that said. Of course a weighted average could be calculated, but that would also heavily differ per video (encoder, encoder settings, content).

But in general, hand-written NEON assembly can be 4 to 5 times faster than compiler optimized C in on most functions, and over 20x in exceptional cases.

dav1d 0.3.1 performance

I will just start with the graph you all want to see:

On this 1080p video (details here) a huge difference can be observed. Where the Apple A7 and Snapdragon 835 can’t reach 24 fps with the compiler optimized C (Clang was used), the NEON enables 30 fps without a hitch. And the Apple A10 jumps from 45 to above 100 fps. Keep in mind that on mobile, these performance boosts directly result in lower power usage, saving this precious resource.

If we normalize the results we can get a close look to the exact speedup:

The Cortex-A73 in the Snapdragon 835 profits the most with almost a 3x speedup. Other cores average just below 2.5x. This means dav1d went from 1x performance with optimized C, to 1.8x with dav1d 0.1.0's NEON and now 2.5x with dav1d 0.3.1’s NEON.

Future

dav1d’s Arm64 development is far from done. The most critical functions are now sped up with NEON assembly for mobile (and AVX2 and SSSE3 for PC), but there are still significant gains ahead, hopefully reaching an average acceleration factor of 3x at some point. Better auto-vectorization could also help a lot, but the main driver keeps being hand-written assembly.

See also:

--

--