AV1, Opportunity or Threat for POWER and ARM Servers?
While I haven’t seen an official announcement, Phoronix reported that the AV1 git repository was tagged 1.0, so the launch announcement is imminent. If you haven’t heard about it already, AOMedia Video 1 (AV1) is an open, royalty-free video coding format by the Alliance for Open Media.
Not a codec war
In what some are hyping as a codec war, AV1 is disrupting the traditional video landscape. It’s the first next-gen video format that, on release, is planned to be supported in all browsers and never has a video format launched with such a strong industry backing.
My personal opinion is that AV1 will disrupt cloud video services. Encoding AV1 is more than an order of magnitude more complex than encoding current generation video formats, and considering the bandwidth savings promised by AV1, the encoding cost will be a much bigger part of the total cost of ownership of video content. This order of magnitude increase in complexity can clearly be seen in this figure from the analysis done by Facebook of AV1.
Since the release of the Facebook report, many engineers have worked hard to reduce AV1 encode time, but the reality of the matter is that, as AV1 becomes the de facto video format for the web, content providers will apply much more scrutiny on the efficiency of video coding infrastructure they buy or rent.
As an active member of the Alliance, Intel has contributed significant resources to help speed up AV1. It follows that a part of the aforementioned AV1 speed ups come from Intel specific software acceleration. Concretely, non-Intel architectures are currently at a disadvantage when encoding AV1.
POWER and ARM-based servers currently at a disadvantage
Increased scrutiny of AV1 coding efficiency combined with less efficient encodes are important considerations for companies challenging Intel’s dominance in the server market.
For a concrete example, let’s consider two alternatives to Intel Xeon Servers: the ThunderX2, an ARM-based server by Cavium and the POWER9 powered Talos II, by Raptor Computing Systems. Both of these second generation offerings have considerably closed the gap with the Intel Xeon.
Even though both ThunderX2 and the Talos II offer a higher thread count than their Xeon counterparts, without software acceleration, both processors won’t outperform their Xeon counterparts at AV1 encoding.
At least that’s my opinion, we will have to wait for benchmarks to confirm this. While we wait for those, we can look at the latest 13-Way IBM POWER9 Talos II vs. Intel Xeon vs. AMD x264 Benchmarks from Phoronix and see that, as Phoronix states, POWER9 “could use improvement around multimedia/encoding”.
In the Phoronix article, the Talos II systems are top contenders in many of the benchmarks, but not when it comes video encoding. While the article does not run benchmarks on AV1, it’s reasonable to assume that the complexity increase of encoding AV1 will only make things worse.
Higher Thread Count, a Killer Feature for Video Encoding?
One thing to note about the higher thread count on the Talos II is that it is the result of a higher thread count per core. Could this higher thread count per core be a killer feature for video encoding?
Let’s find out, thanks to Raptor Computing System and IntegriCloud, I have access to a dual 8 core Talos II. That’s a grand total of 64 threads. Time to run some libvpx encodes.
To test this assumption, I used libvpx v1.7.0 “Mandarin Duck” and set the threads parameter to either 1, 2 and 4. I measured the time required to run batches of 1, 2, 4, 8, 16, 32 and 64 parallel instances of libvpx. Finally, I divided the measured time by the total number of frames encoded by that batch of parallel instances. For those following at home, you can reproduce this experiment using this script.
As expected, increasing the number of encoding threads increases the number of frames per minute. However, for a given number of threads, single threaded encodes almost always offer the highest number of frames per minute. For example, if you have 4 threads, 4 parallel single threaded encodes will results in 43 frames per minute (fpm) whereas, 2 parallel encodes using 2 threads each will output 36.5 fpm and a single 4 threaded encode will produce 30.8.
As we can see in the following figure, the POWER9 scales almost linearly when the number of parallel encodes is below the number of cores (the Talos II has 16 cores). As we approach 16 parallel encodes, performance are slightly below the linear approximation. This could be caused by contention on core resources shared by threads.
Sadly, but not surprisingly, when the number of parallel encodes exceed the number of cores the POWER9 does not scale linearly anymore. It would appear to be closer to a log function.
There’s a slight unexplained regression at 32 parallel encodes. Regressions appeared for every configuration I tested that used a total of 32 threads (1x32, 2x16, 4x8). This could be caused by more resource contention or IO. Let me know if you have ideas or assumptions on what could be causing this.
Opportunity
Now here’s the interesting thing, a higher thread count multiplies the impact of the platform specific software acceleration. This can be seen in the following figure, For example if we compare Mandarin Duck to the latest Master (show in green), which contains our latest acceleration bounties, we see the gains increasing along with the number of parallel encodes.
While our 64 thread Talos II encodes about 200 frames per minute on libVPX 1.7 “Mandarin Duck”, our latest POWER specific software accelerations bounties have increased the throughput of the Talos II to 246 frames per minute.
Based on our recent profilings of libVPX on POWER, we have updated our bountysource bounties, and we are confident that if our bounties are funded, we can reach and exceed 300 frames per minute in time for the libVPX 1.8 release.
As for AV1, we have profiled libaom on POWER, and our bountysource bounties are ready to be funded. So stay tuned for updates on AV1 speed ups for POWER as we achieve them.
Conclusion
The multiplicative effect of platform specific software acceleration on a high thread count machines could be a competitive advantage when encoding AV1. We have shown that our work has greatly increased the throughput of libVPX on the Talos II and could potentially have an even greater impact on AV1. Will this turn into an opportunity for POWER and ARM servers? Only time will tell.