dav1d 0.3.0 Sailfish: ARMed to the teeth

TL;DR: dav1d 0.3.0 decodes AV1 video’s 24% faster on SSSE3, 26% on SSE4.1 and 4% on AVX2 (all PC), and 12% faster on Arm64 (mobile).

The open-source AV1 decoder dav1d was updated yesterday to version 0.3.0. With the third release, new assembly code provides some serious performance gains on both the PC and mobile platforms.

Previously:

PC

On the x86 side, this release mostly improves the SSSE3 performance of dav1d. Xuefeng Jiang contributed with prediction of chroma from luma and Paeth intra prediction functions, delivering 0,8% and 0,4% improved global performance.

Liwei Wang continued his work on inverse transform with larger 8x32, 32x16 and 32x32 and up to 64x64 blocks, providing the largest speedup of this release, way over 10% on some video’s.

dav1d 0.3.0 also introduces the first SSE4.1 assembly. In most cases the added SSE4.1 instructions aren’t useful in addition to SSSE3, but Victorien Le Couviour — Tuffet found a usecase where it was. He optimized the CDEF filter, resulting in a 1,15x speedup on the module level and around 1,5% overall.

Meanwhile Henrik Gramner wrote some very clever SSE2 code to speed up entropy decoding/bitstream reading, which started to eat up a large proportion of decode time, especially on AVX2. The assembly code resulted in a speedup for all 64-bit x86 platforms, measured around 4% for AVX2 and 2% for SSSE3 and SSE4.1

Overall these commits make dav1d 0.3.0 around 24% faster on SSSE3, 26% faster on SSE4.1 and 4% faster on AVX2 CPUs (full data).

While single-threaded aomdec is still quite strong, with multiple threads dav1d 0.3.0 is making libaom an even smaller spot in the rear view mirror (full data).

Arm64

Martin Storsjö delivered two very nice commits speeding up the loopfilter and selfguided looprestoration with NEON assembly code. Both functions were speeded up by about 3x, resulting in performance gains anywhere from 7% to 36%. Not only allows this for higher resolutions, frame rates and bitrates, but also brings down power consumption on identical content.

These updates push the first 1080p video above the 25 fps with a single core on a Snapdragon 835. Using multiple threads, 30 fps is now rock solid and 60 fps is reachable on some content.

Normalizing the results we see especially the RED clip profiting a lot, since it relies heavily on the loopfilter. Single-thread gains are between 11% and 36% (average 19%), multi-thread between 7% and 16% (full data).

Adoption

The adoption of dav1d is also going very well. The big news is that Chromium, the open-source project behind Google Chrome and now also Microsoft Edge, adopted dav1d and will ship in by default in Chrome 74.

Firefox 67 has also improved the dav1d implementation a lot. dav1d was updated to 0.2.1 and multiple tile threads are now used. dav1d is also enabled by default on Linux and macOS in addition to Windows.

FFmpeg and VLC still use dav1d, and Handbrake is also looking at integrating dav1d as soon as FFmpeg 4.2 is released.

Youtube is also encoding more and more AV1 streams. They even have encoded a few video’s in 4K and 8K resolutions up to 60fps, watch them here (and enable AV1 for Youtube here).