Scaling Image Validation Across Multiple Platforms

by Natasha Lee

Introduction

Our Team — the Test, Tools, and Reliability (TTR) team — is responsible for validating the quality of the Netflix SDK which we make available to our partners (Samsung, LG, Sony, Vizio, Roku, PlayStation, Xbox, etc…). For context, the core make up of our SDK includes the scripting engine, the layout system, the graphics backend, the text shaping/rendering engine, the network stack, the animation stack, the effects framework, etc… and thus, it is instrumental for our various partner devices to produce a consistent experience in order to provide our customers with the best Netflix experience they’ve come to love.

To achieve this goal at scale, we validate the SDK across a wide range of devices with a heavy reliance on test automation. More specifically, we leverage automated image comparison techniques to verify rendering correctness. Rendering in this context refers to the state of the device frame buffer after widgets, images, and text have been drawn.

In this post, we will:

  • Highlight the challenges with image comparison at scale across different platforms
  • Describe the solution we implemented
  • Share our best practices established during and after our implementation

Automated Image Comparison

The most simpleton approach to achieving image comparison would be to leverage an image comparison tool, such as ImageMagick, to perform an exact pixel-to-pixel match.

Let’s take a few examples and see this in action…

Case 1: Corrupted Glyph Cache

In this case, the cache corruption is an obvious rendering flaw, with the image comparison also highlighting the error.

Case 2: Rendering Dust

In this case, the rendering dust is much more subtle and is easily overlooked subjectively. However, pixel-to-pixel image comparison helps us flag even the most subtle rendering differences.

Case 3: Rendering Deltas Across Different Platforms

Different platforms use different rendering APIs (example: Direct3D and OpenGL), and thus can present rounding differences that impact anti-aliasing, scaling, blending, or effects.

In this last case, the differences are unnoticeable to the human eye, but was flagged by the automated image comparison. While this is a good thing, it’s not necessarily the best thing for extremely subtle differences which certainly does not compromise the user’s rendering experience.

But what would it actually take for us to tolerate these subtle differences? The solution is obvious: we would require separate reference images for each platform. With roughly 5–10 image validations per test and 1000+ rendering tests, across 10+ devices, that translates to at least 50,000–100,000 reference images!

We’ve come to the realization that pixel-to-pixel match when comparing images across platforms simply does not scale. In other words, we want our automated image comparison to only intelligently flag actionable failures rather than inundating us with meaningless failures due to minute platform differences.

The ideal solution would be to reuse reference images created from a single “trusted” platform, and use it across other device platforms. The rationale here is to incur the cost of confirming with the human eye a base reference image (or “golden” reference) only once using our primary development environment (a Linux-based reference application), and re-use those same references for comparison across different device platforms. These golden references will only be updated when there is an accepted product change that impacts rendering.

While we are sensitive to any image differences of this golden reference, we could potentially be a bit more lenient when comparing it to other platforms for scalability reasons. But what does “leniency” really mean? The challenge was to identify a measurable formula that could apply across different device platforms, using a single-source reference image, that would highlight only meaningful failures worth spending development and test cycles to address.

Defining Image Comparison Thresholds

Our solution was to investigate a few image comparison metrics and hone in on the ideal metric and threshold for us to scale our automated image comparison tests. While experimenting, we focused on two image comparison metrics: Absolute Error (AE) and Root Mean Square Error (RMSE).

Absolute Error (AE)

Absolute Error (AE), gives you the total number of different pixels. With AE, you could also allow a fuzz value, which is the distance within a colorspace. This means that colors within this fuzz value would still be deemed acceptable and could alleviate some of the minor image deltas.

However, since it’s a colorspace distance, it can span one or more dimensions. This should be kept in mind when selecting an appropriate fuzz value. To help visualize these differences, here are examples of 2%, 5%, and 10% fuzz.

In the second row, the color distance is split equally across the three RGB channels, whereas in the third row, the full color distance is only on a single channel (red). When comparing the last two rows to the original, the squares in the third row appear to have a larger difference, but both rows are actually equally different to the image comparison tool.

However, each pixel is evaluated individually, so the image comparison would fail if all pixels, or even just 1 pixel, exceeded the fuzz value.

In the example below, all four error images would be viewed equally as 10% fuzz.

Root Mean Square Error (RMSE)

This now brings us to the next metric, Root Mean Square Error (or RMSE). The mathematical formula is:

Rather than looking at each pixel individually, this metric takes the mean of all errors, which provides an average error across all pixels. Additionally, by squaring the error before taking the mean, this weighs larger errors much more heavily than smaller errors. The final square root is simply to normalize the error %.

Going back to the 10% fuzz examples, this is where the mean in RMSE helps to highlight overall image error severity.

When all pixels are equally different, the RMSE value matches the fuzz value, but in the other cases, the RMSE value is actually lower than what its fuzz value is.

To help visualize how RMSE handles error magnitudes, here are some examples of 10% RMSE.

In both cases, only a subset of the image differs, but the black-colored pixel error is much smaller than the white-colored pixel error, and yet both share the same 10 percentage RMSE value? The reason why is because the expected color (lavender) is much closer to white than it is to black, meaning there is a larger leniency for smaller errors than larger ones.

These are both great metrics that would reduce a large number of image comparison failures due to minute rendering differences, but one must always be cognizant of the potential side effects of increasing these threshold values.

Threshold Selection

RMSE was the clear winner as it allowed us to programmatically gauge the difference between a subtle but important difference, and a subtle but less important difference. The next step was to decide what RMSE threshold value was going to work best, knowing full well that large values could potentially mask real issues. With that in mind, we opted to keep our default threshold as conservatively low as possible. We default to RMSE with threshold 0.1% when comparing against other device platforms, and fallback to an exact pixel-to-pixel match (AE with threshold 0) when comparing against the same base platform used to capture the “golden” reference images.

During our initial investigations, we also tested with AE + fuzz, but found RMSE to scale slightly better, while still allowing us to keep a conservative threshold. For our image verification tests, using a 2.5% fuzz value allowed 46% image reuse, but with 0.1% RMSE, we instantly achieved 78% image reuse, without compromising on quality. Additionally, further loosening the RMSE threshold was only marginally better, so we opted to keep the more conservative value.

Next we will share some of the best practices that allowed us to select these conservative thresholds.

Best Practices Guide

Reduce Variability

We first review the test to see if there are improvements that can be made to facilitate image reuse without compromising the original test intent.

One example is to reduce the variability, or extraneous features exercised by your test. Instead of crafting a test that is visually appealing, with extra features like text effects, animation, transparency, etc… instead we keep the test scene as minimalistic as possible to reduce the likelihood of being hit by rounding, blending, or anti-aliasing platform deltas that would hinder image reuse.

For example, semi-transparent widgets may blend differently across platforms.

Keep assets simple

Another best practice is to keep your assets simple. If your test is validating that we can properly stretch, crop, or tile an image, rather than using an image of the world that is susceptible to anti-aliased edges, instead use a synthetic image with simple geometric shapes.

Because of the different colors, you would still be able to easily validate the various image positions, without suffering the anti-aliasing penalty.

Prefer High Contrast Elements

For RMSE, we try to use high contrast elements to avoid diluting errors. For example, if we failed to render a line of text, it’d stand out more prominently as black text on a white background (14.96% RMSE) than black text on a green background (5.08% RMSE).

Consider Image Size

Since RMSE normalizes the error across the entire image, the dimensions matter and should be taken into account when selecting a threshold value. For example, if a 50x50px square rendered as green instead of white, it would be weighted more heavily on a 100x100px image (35.36% RMSE) than a larger 200x200px image (17.68% RMSE).

Create Platform Goldens as Needed

In some cases, even after applying the above tips, your scenario may still exceed the default threshold slightly. The test author still has the discretion to override the threshold at a test or test step level.

However, when the platform’s rendering pipeline differs enough to exceed the threshold by a considerable amount, increasing the threshold here may actually mask real issues. In these cases, we instead opt for platform-specific references, and that’s OK! These special cases tend to be in the minority and applies to features that are implemented differently across platforms, so they really aren’t expected to render identically anyway.

Even here, there might be some opportunities for image reuse. For example, maybe a PS3 reference can be reused for PS4, or an XBox 360 reference reused for XBox One.

Conclusion

All of these best practices helped us to identify an ideal RMSE value to apply in order to scale our image validation tests across our platforms without compromising quality or increasing maintenance overhead. In our case, we chose a 0.1% RMSE threshold due to the majority of our images being 720p. If we were rendering at an even higher resolution or using lower contrast elements, we’d need to lower that RMSE threshold even further to uphold the same level of quality confidence.

With the 0.1% RMSE threshold, we were able to reuse nearly all of our “golden” references across our platforms. This reduced our reference image count by 5000+ per platform, and we now have the foundation to easily accommodate future tests or new devices. We hope our learnings will help other teams address their own image validation scaling challenges.