Computer Vision For Complex Test Automation Challenges — Part I

Published in

Kongsberg Digital

10 min readMay 4, 2021

This is a multi-part series that will discuss one of the most difficult test automation challenges — performing image comparison operations against large image datasets with the goal of achieving nearly 100% accuracy almost throughout.

It will uncover the means of effectively handling various types of minute differences between images — including false positives and negatives — so that the comparison operations are rightly evaluated and automation ROI is achieved. As the solution is built with Computer Vision — a fascinating and powerful division of Artificial Intelligence, the series will also talk about what and how aspects of various Computer Vision(CV) algorithms and image processing techniques leveraged to navigate the challenges.

What’s In Store In This Part ?

1. The very nature of the challenges and the problem statements to be defined around them

2. Some contextual info about the application where the challenges are originating from

3. Some common reasons as to why the differences exist between two different versions of the same image

5. Existing solutions for finding image similarities and the limitations of them

6. About various experiments conducted with very limited success

7. A glimpse of CV algorithms that help handle the challenges

8. Lastly, a brief overview of various pieces of python based CV solution stack that is being built to counter the challenges

Let’s jump right off the bat into the challenges we are talking about here.

The Big Image Comparison Challenges

It’s quite common that when we compare two visually identical images captured at different points in time, they are showing some pixel differences that are mostly difficult to locate with the naked human eyes. That being said, not always the unintended pixel level differences exist. Even if they exist, they are not guaranteed to be the same set of differences every time we capture and compare. There are four types of scenarios we grapple with the image comparison tasks in this project:

1. Visibly minute differences exist between images, but ignore them
2. Visibly minute differences exist between images, but report them as variances
3. No visible differences exist -not even minute, but mismatches are reported due to pixel level variances. Shouldn’t be reported as having issues.
4. Clearly visible differences exist, report them as issues

Scenarios 1 and 2 are actually two conflicting requirements, but that’s what it is and we must handle them differently on a case-to-case basis.

Scenario 3 should not have been existing in the first place, but it is and more often they result in false alarms, due to various reasons largely associated with the intricacies of image capture tools and mechanisms. Scenario 4 is straight-forward, and no issues in dealing with it.

Scenario -1 : Images with a minute difference between A and B -the space between the bar and the black line, but the variance should be ignored

Image (A)

Image (B)

Scenario -2 : Visibly very minute difference -a small blue line in Image (B), but should be reported as an issue

Scenario -3 : No visible differences exist -not even minute, but mismatches are reported due to pixel level variances. Shouldn’t be reported as having issues

Scenario -4 : Images with visible differences between A and B -should be reported as variance. Picture enlarged to show the details.

Before we discuss any further about the challenges, let’s take a sneak peek into the application where the challenges need to be dealt with, the test automation requirements of the project, and the tech stack employed.

About The Application

The application where we have the large-scale image comparison requirement is a modern and visually rich web application serving as the front end for upstream Oil & Gas customers. The application consists of various types of charts, plots, dashboards, and widgets. It is leveraging the popular INT platform, libraries, and tools for visualization and analysis of complex upstream oil & gas data.

The application is supporting cross-browsers and mobile browsers across a total of 10+ Android and iOS devices.

Test Automation Requirements

As a part of fulfilling the test automation needs of the application, we have to automate multiple hundreds of test cases that are capable of running across all the supported cross-browsers. A subset of the large test suite should run across a list of 10+ mobile devices as of today. Whereas, there are scenarios that are dealing with real-time plotting of data on a per second basis.

More than 90 % of test cases involve image comparisons. Majority of them need a strict match, and as a result, even a small issue in our solution stack causes major blows to the automation ROI.

In a nutshell, the automation system for this project has to deal with comparisons of multiple thousands of images across browsers and mobile devices with a decent performance.

Automation Tech Stack

Our automation development tech stack for this project includes : (a) protractor, which is a tool from selenium web driver family for browser automation running on Node.js platform, (b) plenty of npm packages to support automation workflows © TypeScript and JavaScript for automation coding, (d) PowerShell

Ok, let’s get back to the discussion on the image comparison challenges, and the path to finding the potential solutions being built to tackle the challenges.

Why Do Unintended Differences Exist Between Images ?

The differences between images are commonly seen and obvious even in controlled environments. In general, it could be due to OS color scheme and user theme, image size, screen resolution, color depth, DPI settings, font smoothing, screen coordinate offset mechanisms, image format, alpha channel etc.

The above descriptions can also be said in slightly modified terms : lighting/illumination conditions of image capture system, different resolutions between systems of capture and comparison , the color space used — CIE, RGB, YUV, HSL/HSV, and CMYK, subtle behavioral differences of the underlying visual rendering libraries and platforms (INT library in my project) across different points in time, contrast and sharpness, and the different image creation mechanisms used by various screen capture libraries among others.

All that being said, in my case, the screen flickering behavior of the tool we use for browser automation -protractor -at the moment of screen capture has also been causing obvious issues.

It’s a weird and uncommon behavior on comparing with other client libraries in Selenium family.

Existing Image Comparison Solutions

There are plenty of off-the-shelf tools and scattered solutions out there addressing good number of use cases with numerous suggestions and directions to explore and try out. The existing solutions have capabilities that are black-boxed. They are not all-inclusive to address the needs of complex projects such as one in discussion, and also they don’t alleviate the risks associated with accuracy and reliability convincingly across all the scenarios.

Most of these utilities do the pixel-to-pixel comparison or apply some image processing and computer vision techniques -but I couldn’t find them adequate against all the defined scenarios in the project in question.

In addition, the other most common feature of such libraries and tools is adjusting the image matching threshold or tolerance level. For example, instead of letting us to perform the comparison at 100 % threshold, which is a strict match, they let us lower the threshold so that false positive are avoided.

However, the problem with the above approach is that there’s no guarantee that some minute bugs won’t be missed out when we lower the strict match bar, because they deal with pixel-to-pixel and no other proven robust mechanisms.

These tools are also shipped with key features that are related to image processing techniques. Examples:

converting the colored images to grey scale before comparison,
calculating the aspect ratio of the images and resizing,
adjusting contrast and sharpness etc.,
edge and contour detection,
thresholding the grey scale images,
changing color spaces,
cropping and masking among others.

However, the fact still remains the same — the above techniques help us tackle the issues only so much in complex image matching challenges. The above ones, too, are not adequate to counter the complex image matching cases being discussed here.

Long Explorations And Experiments

Prior to my explorations in this space, we built a solution with JavaScript in that it takes multiple baselines of the same screen region for comparisons. These baselines are used to compare with a single runtime image in a hope that it would increase the probability of a successful comparison operation against all odds.

However, it did not help, and even then we had to use it for long time out of no other known reliable options. Another problem with this approach was that it simply inflated the test run duration to more than 40 % of what we had previously as a whole. Also, we couldn’t use this approach against per-second data plotting real-time scenarios.

Going by the nature of the challenge, lesser online contents that are addressing the specific needs of this project, and the project’s automation requirements against the multitude of visual scenarios, I couldn’t pick any off-the-shelf tools, libraries, and code snippets built in TypeScript/JavaScript, and even in python. Because none of the seemingly promising solutions fulfills the very hard and stiff requirements or even coming closer to solving the problems laid out above.

As a part of various PoCs with regards to working out the potential solutions, I tried plenty of approaches against a diverse range of image processing techniques as well as small solutions for finding image similarity scores. Given below a summary of the diverse range of efforts.

Resize failed images to equal size without using aspect ratio and perform bitmap comparisons
Resize failed images to equal size using aspect ratio and perform various types of comparisons
Update the contrast, brightness and sharpness of both images and perform various types of comparisons
Creating different image formats
Creating images with different color spaces and performing comparisons
Perform various image processing techniques — conversion to greyscale, thresholding the images, masking a sub-area of the images etc. — and perform various types of comparisons
Mean Square Root
Image subtraction
Manhattan distance

So, what’s the outcome of the above techniques on trying separately -one at a time — as well as combining some of them with others ? Again, it was useful only to a limited extent. It wasn’t possible to satisfy all the defined scenarios. So, I had to take my exploration and experimentation to the next level.

Computer Vision — The Answer

After long and tiresome efforts on various approaches and techniques, found that only a combination of multiple Computer Vision algorithms should help navigate these issues in a reliable way.

With some focused efforts spanning multiple weeks, picked four major algorithms and gained a good deal of confidence through a set of small PoCs with each of them. Each of them has their own inner workings, merits and limitations.

But, there’s a silver bullet in combining together the results of each for evaluation.

Structural Similarity Index (SSI)
Perceptual Hashing (P-Hashing)
BRISK — FLANN : Binary Robust Invariant Scalable Key points — Faster Library Approximate Nearest Neighbour
Differential Hashing (D-Hashing)

Multiple Pipelines

There are four major building blocks that constitute the CV based solution with the requirement of a data pipeline that needs to be built upfront.

1. Preparing and setting up datasets for preprocessing stage

2. Data quality : cleaning and image preprocessing

3. Core processing pipeline

4. Net result evaluation gate

Once the processing is done in the data pipeline, the next stage is to go across to the core processing pipeline where a set of algorithms are put into action. Then the standalone results of each algorithm for all images are fed into a flexible and robust evaluation gate to arrive at a net conclusion on image similarities in a binary representation - True or False. Of course, consolidated JSON and Excel reports too are generated at the end of the operations.

CV Solution Stack — Components

Two major components in the CV solution stack.

ImageBench : A standalone CV tool built with python for performing image benchmarking against a large number of images — no dependency on automated tests and test runs — just feeding the baseline and runtime images to the tool is needed. It consists of rich configurations to use the tool.
ImageVision : An in-house Computer Vision Framework consisting of two core layers.
A. Frontend : TypeScript based Test Development Kit comprising a set of APIs referencing the backend CV engine. The APIs need to be called from tests.
B. Backend : CV Engine built with python and CV libraries with rich configurations.

We’ll discuss the solution stack in detail in the subsequent parts of the article. Whereas, we are talking a lot about CV here. Why can’t we use Deep Learning techniques ? — a subset of Machine Learning — to find the answers. Let’s discuss that too.