Performance comparison of video coding standards: an adaptive streaming perspective

by Joel Sole, Liwei Guo, Andrey Norkin, Mariana Afonso, Kyle Swanson, Anne Aaron

Chef’s Table

Factors in codec comparisons

Many articles have been published comparing the performance of video codecs. The reader of these articles might often be confused by their seemingly contradicting conclusions. One article might claim that codec A is 15% better than codec B, while the next one might assert that codec B is 10% better than codec A.

  1. Encoder settings
  2. Methodology
  3. Content
  4. Metrics

Encoder implementation

Video coding standards are instantiated in software or hardware with goals as varied as research, broadcasting, or streaming. A ‘reference encoder’ is a software implementation used during the video standardization process and for research, and as a reference by implementers. Generally, this is the first implementation of a standard and not used for production. Afterward, production encoders developed by the open-source community or commercial entities come along. These are practical implementations deployed by most companies for their encoding needs and are subject to stricter speed and resource constraints. Therefore, the performance of reference and production encoders might be substantially different. Besides, the standard profile and specific version influence the observed performance, especially for a new standard with still immature implementations. Netflix deploys production encoders tuned to achieve the highest subjective quality for streaming.

Encoding settings

Encoding parameters such as the number of coding passes, parallelization tools, rate-control, visual tuning, and others introduce a high degree of variability in the results. The selection of these encoding settings is mainly application-driven.

Methodology

Testing methodology in codec standardization establishes well-defined “common test conditions” to assess new coding tools and to allow for reproducibility of the experiments. Common test conditions consist of a relatively small set of test sequences (single shots of 1 to 10 seconds) that are encoded only at the input resolution with a fixed set of quality parameters. Quality (PSNR) and bitrate are gathered for each of these quality points and used to compute the average bitrate savings, the so-called BD-rate, as illustrated below.

  1. It can use any metric to guide its optimization process.
  2. It eliminates the need for high-level rate control across shots in the encoder. Lower-level rate control, like adaptive quantization within a frame, is still useful, because DO does not operate below the shot level.

Content

For a fair comparison, the testing content should be balanced, covering a variety of distinct types of video (natural vs. animation, slow vs. high motion, etc.) or reflect the kind of content for the application at hand.

Metrics

Traditionally, PSNR has been the metric of choice given its simplicity, and it reasonably matches subjective opinion scores. Other metrics, such as VIF or SSIM, better correlate with the subjective scores. Metrics have commonly been computed at the encoding resolution.

Approximate correspondence between VMAF values and subjective opinion
  1. Temporal average: Metrics are calculated on a per-frame basis. Normally, the arithmetic mean has been the method of choice to obtain the temporal average across the entire sequence. We employ the harmonic mean, which gives more weight to outliers than the arithmetic mean. The rationale for using the harmonic mean is that if there are few frames that look really bad in a shot of a show you are watching, then your experience is not that great, no matter how good the quality of the rest of the shot is. The acronym for the harmonic VMAF is HVMAF.

Codec comparison results

Putting in practice the factors mentioned above, we show the results drawn from the two distinct codec comparison approaches, the traditional and the adaptive streaming one.

Results with the traditional approach

The traditional approach uses fixed QP (quality) encoding for a set of short sequences.

  • Content: 14 standard sequences from the MPEG Common Test Conditions set (mostly from JVET) and 14 from the Alliance for Open Media (AOM) set. All sequences are 1080p. These are short clips: about 10 seconds for the MPEG set and 1 second for the AOM set. Mostly, they are single shot sequences.
  • Metrics: BD-rate savings are computed using the Classic PSNR for the luma component.
BD-rate (in %) using PSNR of the 6 PSNR-tuned video encoders

Results from an adaptive streaming perspective

This section describes a more comprehensive experiment. It builds on top of the traditional approach, modifying some aspects in each of the factors:

  • Content: A set of 8 Netflix full titles (like Orange is the New Black, House of Cards or Bojack) is added to the other two test sets. Netflix titles are 1080p, 30fps, 8 bits/component. They contain a wide variety of content in about 8 hours of video material.
  • Metrics: HVMAF is employed to assess these perceptually-tuned encodings. The metric is computed over the relevant quality range of the convex hull. HVMAF is computed after scaling the encodes to the display resolution (assumed to be 1080p), which also matches the resolution of the source content.
BD-rate (in %) using HVMAF of 6 video encoders tuned for perceptual quality. BD-rates percentages use x264 as the reference.

Takeaways

Encoder, encoding settings, methodology, testing content, and metrics should be thoroughly described in any codec comparison since they greatly influence results. As illustrated above, a different selection of the testing conditions leads to different conclusions on the relative performance of the encoders. For example, EVE-VP9 is about 1% worse than x265 in terms of PSNR for the traditional approach, but about 12% better for the HVMAF high range case.

Acknowledgments

For more detailed technical information and results, you can check out the paper ‘Video codec comparison using the dynamic optimizer framework’ by Ioannis Katsavounidis and Liwei Guo.

Netflix TechBlog

Learn about Netflix’s world class engineering efforts, company culture, product developments and more.

Netflix Technology Blog

Written by

Learn more about how Netflix designs, builds, and operates our systems and engineering organizations

Netflix TechBlog

Learn about Netflix’s world class engineering efforts, company culture, product developments and more.