Three months after dav1d 0.1.0 “Gazelle” was released version 0.2.0 just got tagged. Under code name “Antelope” huge improvements were made to the AV1 decoder for older PC’s and mobile devices, on 8-bit content. By hand-writing SSSE3 and NEON assembly code, most of the C functions were sped up by factors ranging anywhere from 2 to 20, resulting in hugely higher frame rates.
This blog will provide an overview of dav1d’s performance, compared to both the 0.1.0 release and the AV1 reference decoder, aomdec.
dav1d 0.1.0 release: The first benchmarks
dav1d 0.1.0 was released today as the first milestone for the AV1 decoder created by the VideoLAN. dav1d’s goal is to…
PC: SSSE3 for older x86 CPUs
Where dav1d 0.1.0 was all about AVX2 performance, an extended instruction set used by newer processors (Intel Haswell / AMD Zen and newer), 0.2.0 focuses on speeding up SSSE3 performance for older and lower-end processors. According to the Steam Hardware Survey (Feb. 2019) 97,23% of their user base supports SSSE3, while only about two-thirds supports AVX2.
Since different videos use different functions of the AV1 codec in different proportions, some saw larger increases than others. Below are the results for three 1080p videos, comparing dav1d at 0.1.0 release to the current head.
With both single-threaded and multi-threaded the improvements are huge, averaging around 2,25x for ST and 2,5x for MT. Looking at raw frame rates, this means on almost any device with SSSE3 1080p at 30fps is playable without a hitch, while quad-core high frequency processors should also be able to handle up to 1440p at 60fps and 2160p at 30fps.
The following results were reached on a Intel Core i5-4590 (Haswell, 4c/4t, 3,5 GHz) using only SSSE3 instructions:
If we normalize the values we can closer examine the gains, averaging around 2,23x:
On average, dav1d 0.2.0 is 2,23x faster on 8-bit content than 0.1.0.
x86 performance compared to aomdec
The target to beat for dav1d is aomdec, the AV1 reference decoder. At the release of dav1d 0.1.0 the performance on AVX2 CPUs was already spectacular, but older and lower-end processors that didn’t support it where at the time better performing with aomdec. With the release of dav1d 0.2.0, that changes.
All the numbers below are for 8-bit color depths with 4:2:0 chroma subsampling. For multi-thread aomdec used 4 threads, while dav1d used 8 framethreads and 4 tilethreads. Both give optimal performance on a quad-core CPU.
Comparing SSSE3 performance, with a single-thread dav1d and aomdec perform about the same. Multi-threaded dav1d is 2,5 to 3 times faster.
Moving on to CPUs that can handle SSE4.1 instructions (which is 95,82% according to Steam) aomdec claims a small lead in single-threaded performance. dav1d doesn’t have separate assembly code for SSE4.1, so performance is (for now) identical to SSSE3 CPUs. Multi-threaded dav1d is still about twice as fast.
AVX2 performance increased a very slight 1% to 2% for dav1d, which was already very fast. Single-thread enjoys a comfortable 40% leap, and with multiple threads anywhere from 2,5x to 5x.
A lot of NEON assembly has been written for both Arm and Arm64, both for the 0.1.0 and 0.2.0 releases. When 0.1.0 got tagged the speedup using NEON assembly over C was on average about 80%, now its more than double.
Starting with Arm64 (Aarch64) performance, we see an average 38% improvement for single-thread and a 53% improvement for multi-thread. On a Snapdragon 835, the improvement enables 1080p at 60fps for most videos.
32-bit Arm (Armv7) also improved a lot, since most assembly code can be fairly easily ported between the two. Single-thread saw a spectacular average speedup of 62% while multi-thread increased by 46%. 1080p at 30fps should be fluent on most CPUs with at least two ‘big’ cores.
With the release of dav1d 0.2.0 the AV1 decoder clearly distinguishes itself from the reference decoder. AVX2 was already very fast, but now SSSE3 is also at least up to par with aomdec.
There are still some functions left to write SSSE3 assembly for, as is the case for NEON. So in future releases we will see dav1d get even faster on those platforms, but in the meantime it’s more than fast enough to provide a proper 1080p experience on most devices.
This is all on 8-bit content however, which is still the vast majority on most platforms. 10-bit, and later 12-bit, there isn’t assembly code in dav1d yet, that will be something to look forward to.
VLC wil very soon release a new stable release with dav1d 0.2.0, and Firefox is also working on integrating support. FFmpeg already uses dav1d for decoding in the development branch, and Handbrake will support it soon.
Thanks to Martin Storsjö for the Arm performance numbers, and of course to all dav1d contributors for building this awesome decoder.
- Clips used: https://drive.google.com/drive/folders/1kZx75w6kKUc6B4fGEXN5jUCVn2iCO3v1
- Graphs: https://drive.google.com/drive/folders/1Fl1IkqA0SrYxS-xGn6kw3M1oKCQ1R2J3
- Spreadsheet with all data, including parameters: https://docs.google.com/spreadsheets/d/1rkPMHgy7cXEsT9KeYF-NVZQNiEGEkvrLnLo2FaLWiwA/edit#gid=22180039