Scalable codec testing with Are We Compressed Yet?

Raphaël Zumer
Vimeo Engineering Blog
6 min readNov 17, 2020

As part of our research and development activities on the Video Transcoding team, we’re often confronted with the need to test video codecs in a standardized, reproducible, and comparable way. For example, we may want to evaluate whether a new version of a video encoder achieves a significantly higher level of quality than a previous one and what its performance tradeoffs are, compare runtime performance between different configurations, or understand how different codecs stack up against one another for different types of content. One tool that we use to answer these questions is Are We Compressed Yet? (or, as we like to call it, AWCY).

What is AWCY?

AWCY is an open-source tool created by the Xiph.Org Foundation that combines a distributed encoding and decoding framework with performance and quality metrics gathering, comparative views and reports, and data set management. Supported codecs can accept custom configurations and be used to encode any preconfigured data set through jobs distributed over a given number of worker cores that can be located locally or laid out in an arbitrary array of remote machines. Each worker core performs the encoding-decoding-metrics gathering process for a single video and reports results to the host, which compiles them into a report once all tasks have completed. Completed runs can be compared with one another as long as the data set is the same, allowing cross-codec comparisons.

What makes AWCY particularly useful and effective for objective quality measurement and comparison is its Bjøntegaard delta rate, or BD-rate implementation. Quality measurements cannot be directly compared accurately due to variations in bitrates, which, in offline video coding, translate to differences in the compressed file sizes. As encoder configurations vary, so do the resulting rates, and an improvement in quality isn’t meaningful if it also leads to an increase in rate, which is why the latter must be controlled when making such comparisons. BD-rate measurements are made by running multiple encodes at the same configuration with varying quality levels (typically constant quantizer values). The resulting quality and rate can be plotted on a graph and linearly interpolated to estimate the expected rate-quality tradeoff of a given codec in the tested configuration. BD-rate can then be computed for two sets of points based on the difference in how well they optimize quality while minimizing rate, effectively integrating the area between the two line plots.

What follows is a comparison view of two AWCY runs. The configuration associated with the upper line plot offers significantly better compression at all quality levels given the resulting rates.

This BD-rate summary shows the overall quality improvement going from the s10 configuration to the s3 configuration. BD-rate is expressed as a percentage, and a larger negative value is indicative of higher quality. It can be interpreted as the amount of bits needed on average for the second set of results to reach the same level of quality as the first one. In this example, the s3 configuration needs roughly 30 percent fewer bits to maintain the baseline s10 level of quality.

The video below shows the Blue Sky test clip encoded as part of the above run at a constant quantizer (roughly an inverse quality level) of 172 in rav1e, the AV1 encoder that we use at Vimeo to bring high-quality video to Vimeo Staff Picks. On top is the video encoded at speed level 3, and below is the same video encoded at speed level 10. We can see significantly more banding artifacts in the sky itself, as well as plenty of noise around the edges of the tree leaves in the latter.

In this specific case, the speed 10 encode has a significantly higher rate, so it’s clear that the s3 configuration fares much better than the s10. If the second video were far smaller in size, we may not be able to determine which of the two configurations is actually better, but since AWCY computes a measure of the tradeoff between rate and quality rather than directly comparing encoded files, we can identify improvements or regressions at a glance. For this video, the overall average improvement in coding efficiency is between 35 and 50 percent, depending on the quality metric.

Notably, AWCY was successfully used by the Alliance of Open Media to support the development of the AV1 standard, and it has been used regularly since then by developers of open-source AV1 encoders, including libaom, SVT-AV1, and rav1e. As an open-source tool, AWCY can be expanded to support any codec, configuration, or data set. An active public instance is available at https://beta.arewecompressedyet.com/, although job queuing is restricted due to the high cost of running distributed encodes. We have developed scripts to ease initial deployment of both the web server and worker nodes.

Architecture overview

The AWCY system has a modular architecture where the web server, job scheduler, and worker nodes where encoding occurs can reside on the same or separate machines. These components are usually grouped together when talking about AWCY as a whole, but, in practice, the AWCY component only covers the front end. The central piece of the system is rd_server, which identifies worker nodes, schedules jobs, and reports results back to AWCY for storage and visualization. rd_server is also responsible for building codec binaries before transferring them to worker nodes for processing.

AWCY can also send encoded files to a bitstream analyzer for manual inspection of coding decisions in VP9 or AV1 streams. This web application can navigate through two encoded videos in a frame-by-frame comparative view and display partitions and block-wise encoding parameters as well as frame-wise statistics, as shown in the following image.

Below is a deployment diagram for an AWCY system.

A new video test data set: vimeo-corpus-10s

As part of our contributions to AWCY, we’ve also released a new data set named vimeo-corpus-10s that includes videos highly varied in resolution, frame rate, content type, quality, and visual characteristics. Videos were chosen to be challenging to encode, and they include scene cuts. This set can support the research of rate control algorithms, scene detection for keyframe placement, and other components of video encoders that cannot be evaluated easily through classic data sets. vimeo-corpus-10s is included by default in AWCY instances, and its 15 videos are licensed under Creative Commons. The following table provides a summary of the visual characteristics of the videos.

In summary

AWCY makes it easy to run exhaustive sets of video encoder tests consistently and at scale, as well as aggregate, compare, and share their results with coworkers and collaborators. We’ve used it at Vimeo to find encoding bugs, review our encoding parameters, and evaluate new versions of rav1e. We hope that other engineers and companies will consider adopting AWCY for their own use and contribute to its future development. Finally, we hope that researchers and engineers in the field will find the vimeo-corpus-10s test set useful to evaluate video coding and processing techniques in a practical setting.

--

--