dav1d 0.1.0 release: The first benchmarks

6 min readDec 11, 2018

dav1d 0.1.0 “Gazelle” was released today as the first milestone for the AV1 decoder created by the VideoLAN. dav1d’s goal is to accelerate the adoption of the new AV1 codec, created by the Alliance for Open Media (AOM), by enabling software decoding on as many devices as possible. AV1 promised 25% bitrate savings at the same quality and is open-source and free, moving away from the licensing struggles of HEVC.

But how fast is it? Here are the first benchmarks.

PC: x86–64 with AVX2

Most users are going to experience AV1 first on the desktop. dav1d 0.1.0 is mainly optimized for CPUs that support Advanced Vector Extensions 2 (AVX2), an instruction set extension introduced with the Intel Haswell architecture in 2013, and also supported since AMD’s Excavator architecture from 2015. It’s estimated that a little over half the desktop CPU’s support AVX2 today.

Threading

One of the big strengths of dav1d is its high scalability. Where aomdec, the default decoder from the AOM, only uses tile threads to decode multiple tiles from one frame simultaneously, dav1d also uses frame threads to decode multiple frames in parallel. This makes dav1d fast no matter how the content is encoded, even with a single tile.

Here’s the performance of aomdec (the default decoder from the AOM) and dav1d in fps over two 1080p 8-bit 4:2:0 clips.

Thread scaling on an AMD Ryzen 5 1600 (6 core, 12 thread)

In the graphs above, aomdec stops scaling after 6 threads for the Chimera video (from Netflix) and doesn’t scale much further after 2 threads on the Feels Like Summer video (from Youtube/Google). dav1d maintains high-scaling after that. For a more extreme example, we tested on a 32-core AMD Epyc processor:

aomdec reaches it’s peak performance at 8 threads, while dav1d doesn’t stop scaling until over a thousand threads. This results in hugely improved performance on systems with many cores, which are more and more common these days.

While the scaling is great, a small weakness of dav1d is revealed in the process. Scaling should stop at 64-threads, but it doesn’t. This means that one tile or frame thread isn’t able to fully utilize a CPU core or thread, which means a lot of threads are needed and that makes it kind of guesswork how many threads of which type are needed. It would be an improvement if this process could be handled backstage.

Bit depths and chroma subsampling

dav1d 0.1.0 focused on 8-bit color depth and 4:2:0 chroma subsampling performance with the assembly code since almost all content uses that format right now. All color depths (8, 10 and 12 bit) and chroma formats (4:0:0, 4:2:0, 4:2:2 and 4:4:4) are supported in dav1d however, and here is a quick overview of its performance.

As you can see 8-bit performance is very high in all chroma formats and is more than twice as fast as aomdec. Moving to 10- and 12-bit color depth we see performance degrading fast, but 4:0:0 (monochrome) and 4:2:0 are still faster than aomdec, while 4:2:2 is on par. Only 4:4:4 shows significant lower performance than aomdec.

This benchmark was run on Zen which is slightly biased towards dav1d, on Intel Haswell/Skylake aomdec’s performance is relatively a little higher.

Overall performance

Looking at some of the most viewed content, which is all 8-bit 4:2:0, we see that dav1d is quite fast. 1080p 120fps, 1440p 60fps and 4k 30fps is no problem on a 5-year old mid-range quad-core processor, and 1080p 60fps and 1440p 30fps will playback fine on any dual-core with AVX2 (except maybe the extreme low-power versions).

If we normalize the performance of dav1d to aomdec on both systems we see the performance for Haswell and Zen respectively starting at 1.75x and 1.9x while reaching over 3x and 5x in some clips.

Mobile: Arm64 with NEON

64-bit mobile devices that support NEON instructions (practically all of them) have a good shot at decoding AV1 in real time — while with high power consumption.

dav1d single-core performance on Arm64 devices (Chimera 1080p 8-bit 4:2:0)

Adding NEON assembly code to dav1d speeds decoding up by 57% to 117%, making it faster than libaom in most cases.

dav1d multi-core performance on Arm64 devices (Chimera 1080p 8-bit 4:2:0)

Scaling to multiple cores, we see that 1080p at 30fps can be decoded with most high-end devices younger than 2 years. 720p at 30fps is possible for any device with ‘Big’ Arm cores.

On Apple’s A12X, 1440p at 60fps and 4K at 30fps is reachable. Yes, it’s insane.

Conclusion

A mere 3 months since its first release, dav1d has made huge strides. 8-bit performance on AVX2 systems has tripled and beats aomdec in any case, and Arm64 has adopted enough NEON assembly to make it about 5% faster than aomdec. Older CPU’s will have to wait for now, as does 10- and 12-bit content.

Writing assembly by hand is a huge undertaking. dav1d need to support at least 3 instruction extensions (SSSE3, AVX2, and NEON), for 3 color depths (8, 10 and 12 bits) and 4 chroma formats (4:0:0, 4:2:0, 4:2:2, 4:4:4) while also supporting 32–bit builds which have it own sets of limitations. Most of the times the 10- and 12-bit code can be shared and many functions work on all chroma formats, but this means nevertheless that a single function could need tens of different assembly implementations.

For the next release, dav1d will focus on improving SSSE3 performance, which is supported by 97% of x86 CPUs currently in use. Arm64 will also receive a lot of love, and some ASM for 10-bit and 4:4:4 content will be written.

dav1d is being integrated into Firefox 65 and can be enabled using the media.av1.enable and media.av1.use-dav1d flags in about:config. It will enter Beta today and is expected to release on 2019–01–29. FFmpeg and VLC have also adopted dav1d in their master branches, so check those out if you’re interested. Youtube already started streaming in AV1 on the desktop platforms, which can be enabled here, and youtube-dl supports downloading these AV1 video’s. Also check out rav1e, an alternative AV1 encoder written in Rust.

dav1d itself can be found on VideoLAN’s GitLab.

Any feedback, suggestions or requests for upcoming blogs? Feel free to contact me!

References & contributions

All x86 benchmarks were run by me, except the Threadripper threading test, which was performed by Thomas Daede.
All Arm64 benchmarks were run by Janne Grunau and Martin Storsjö, many thanks.
All data can be found in this spreadsheet: https://docs.google.com/spreadsheets/d/1AO3lDZnpC8pNJffOknY1rIxXwLog_ISwHhO_sv3Xlhg
And the used .ivf files can be found in this folder: https://drive.google.com/drive/folders/1mXk48bN9bcukkzeeaw4lyZ8TL2vIjuGM