Objective Video Quality Analysis at Airtime

Published in

Airtime Platform

12 min readDec 18, 2019

by Caitlin O’Callaghan

How far is New York City from Los Angeles? Far.
How cold is the North Pole? Freezing.
How hard is a diamond? Hard.

If you are inquisitive, or have a knack for technicalities, the above qualitative responses are insufficient answers. You’d prefer to know that:

NYC and LA are 2,446 miles apart, measuring along a direct flight route.
In January, the North Pole averages a high of 2°F and a low of -13°F.
A diamond has a hardness of 10 on the Mohs Hardness Scale.

This leads us to the question:
How good is the video quality on Airtime?

As humans, we like to know the trends of change, and we like to know the delta, or quantitative measure, for this change. The data of a video stream slightly changes, too, as it is sent from one device and received by another. However, video quality is still predominantly measured with subjective analysis, where adjectives like “bad”, “okay”, or “great” are assigned to video streams after they have traversed the network.

Subjective evaluation, typically measured as a Mean Opinion Score is costly, time-consuming, and inconvenient. Furthermore, there may be disagreements in how to interpret the viewers’ opinions and scores.

At Airtime, we recognized the need to objectively quantify the quality of our real-time video chatting and streaming services. As a result, we embarked on a quest to implement objective video quality analysis into our real-time test applications.

In the Airtime app, a user publishes video captured by their camera to other members subscribing to the video chat. As the video frames are transferred, the original image can become distorted by the video encoder as it reacts to constraints like network bandwidth and the processing power of the subscribing device. Therefore, the image rendered on the subscriber’s screen can appear differently than what was originally captured on the publisher’s camera. Formally, the original image is referred to as the reference image, and the received image is known as the distorted image.

High level work flow of video traversal from a publisher to the receiving clients.

The perceived difference in quality between a reference image and a distorted image is dependent on the human visual system. Different types of errors have different weightings in their affect on the perception of video quality. For instance, we accept changes in contrast more readily than added blurring or blockiness.

Below is an example of a reference image (top left corner) and five distorted images. All of the distorted images have the same Mean Squared Error (MSE) quality score. MSE averages the squared intensity differences of the reference image pixel to the corresponding pixel in the distorted image. Although the images have the same MSE score, when looking at the image collection, it is clear that a manual tester would classify some of the distortions as worse than others.

Comparison of five distorted images to a reference image, displaying the difference between algorithmic measurement of error and human perceived error [1].

To best replicate a user’s experience and perception of quality, it is necessary that the video quality analysis algorithm implemented by Airtime considers the human visual system.

Objective Quality Analysis

Objective image quality analysis can be split into three categories of implementation:

Full-Reference: The entire reference image is known.
Reduced-Reference: Only features of the reference image are known and used in computation.
No-Reference: Nothing is known about the reference image. This is a “blind” analysis.

Full-reference

Full-reference models allow for complete comparison of the distorted image to the reference image. They are characterized by having high accuracy; however, they require a backchannel to pass the original image to where the comparison is taking place.

Common algorithms include PSNR, histogram analysis, SSIM, and MSE, which was previously exampled.

Peak Signal-to-Noise Ratio (PSNR) is an objective algorithmic analysis that treats the change in photo quality as an error signal overlaying the original image. This method does not assign weightings to different types of errors; thus, all errors are handled as if they have the same visual impact.

Histograms can be used for similarity analysis. The reference and distorted images are graphically represented as histograms where the x-axis is tonal variation and the y-axis is the number of pixels for that particular tone. The two histograms are then compared for similarity using various mathematical formulas such as correlation, chi-square, intersection, Bhattacharyya distance, and Kullback-Leiblier divergence. Multiple histograms my be used to analyze other features within the image such as luminance.

The Structural Similarity Index (SSIM) is a variation of MSE analysis that incorporates the human visual system. Weightings are assigned to luminance, contrast, and image structure. The algorithm is computationally quick and considers human perception, however it does not consider viewing distance or screen size.

When researching, we also came across Video Multi-method Assessment Fusion (VMAF), an open-source software package developed by Netflix. The VMAF algorithm calculates a quality score on a scale of 0 to 100, which serves to replicate the 0 to 100 opinion scoring that subjective evaluation typically uses. A score of 20 maps to “bad” quality, and a score of 100 maps to “excellent” quality. VMAF analysis considers the human visual system, as well as display resolution and viewing distance. Additionally, VMAF is capable of calculating PSNR and SSIM scores.

Reduced-Reference

Reduced-Reference models require information about the reference image, such as structural vectors, but they do not require the entire image. Therefore, they are most useful in cases with limited bandwidth. Since the full reference image is not available for analysis, there is a decrease in accuracy in comparison to full-reference models.

Both ST-RRED and SpEED-QA are reduced-reference models.

No-Reference

No-Reference models do not use any information from the reference image. Instead, a machine learning model is trained with a supplied dataset of images. The model is then able to identify features in the distorted image (blockiness, contrast levels etc.) and correspond these errors to examples found in the image dataset. A resulting quality score is then calculated.

This method has lower accuracy than full-reference models, since the computation relies on a provided training dataset. Accuracy can be improved by increasing the variation of errors found in the data set. No-Reference models do not require a backchannel to the reference image which allows for easier integration into the service provision chain.

Examples of reduced-reference models are Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE) and Naturalness Image Quality Evaluator (NIQE), which differ by their model training.

The BRISQUE model is trained with a dataset of images and their corresponding subjective scores. This allows for low computational complexity, which makes BRISQUE highly compatible with real-time applications. However, because the algorithm calculates scores based on previous image-to-score pairings, the algorithm is unable to detect errors that it has not previously been exposed to during training.

On the other hand, the NIQUE model is trained with a dataset of images; it is not trained with subjective analysis data. This is beneficial, because the system can begin to recognize errors that it is not explicitly exposed to during training by noting vectorized similarities between types of errors. NIQUE, however, does not take into account the human visual system.

In order to ensure that the quality analysis framework that Airtime implements is accurate and a true qualification of video quality of a given stream, there are multiple considerations that need to be made during development, including frame synchronization, pooling strategy, and region of interest.

Frame Synchronization

Because full-reference and reduced-reference models require the reference image for evaluation, there is a need to synchronize the frames of the reference and distorted stream, in addition to adding a backchannel to grab the frames. It is important to note that Airtime’s encoder will drop frames when placed under network constraints; therefore, our video quality analysis system must be able to only match frames that will have a decoded counterpart.

Common methods for synchronization are:

Optical Character Recognition (OCR) — Text, such as a frame number, or an image, like a QR code or barcode, is included in the reference video.
Pattern Match — Comparison of the structure and/or coloration of a small region of interest in both the reference and distorted frames — note, this method is only useful for pattern changes and generated examples.

Both OCR and pattern match methods require extensive work to implement, which would detract from our main focus of getting an initial quality analysis system up and running. Therefore, it was determined that we would create our own matching solution in the case of proceeding with a full-reference or reduced-reference model.

Pooling Strategy

The majority of the evaluation methods described above are designed to be applied to an image or single frame in a video. In order to measure the quality of an entire video segment, analysis is conducted on each frame, or on a sample of frames in a video. The data is then pooled to calculate a single quality metric. The pooling strategy that leads to the most accurate results differs with each quality analysis model. Recommended pooling strategies can be found in: Study of Temporal Effects on Subjective Video Quality of Experience.

Region of Interest

An important consideration when conducting quality analysis is the region being analyzed. The decision to crop an image and only analyze a region of interest or to analyze the entire image will greatly affect quantitive scoring.

At Airtime, our general use-case is a video of a user’s face centered in the screen, since our app is purposed for video chatting. Background imperfections are not as noticeable as errors in the center of the screen, since the viewer’s focus is on the person. The effect of selecting a region of interest is exampled below by running an image of a cat through MATLAB and conducting image quality analysis using a built-in SSIM function. All grayscale images are compared to the corresponding original image and scored from 0 to 1, where 1 is a perfect match.

The affects of analyzing different regions of interest during objective quality analysis.

The scores greatly differ depending on the analyzed region. If we take the full image, the distortion with the highest quality is Distortion 2; however, if we use the cropped images, then the distortion with the highest quality is Distortion 3. Perceptually, Distortion 3 seems to be the best quality. This indicates that isolating a region of interest may be the most accurate method when implementing Airtime’s objective video quality analysis.

Requirements

After preliminary research, it was determined that Airtime’s quality scoring algorithm must:

Provide a single quality score for a specified stream — using an accurate pooling strategy.
Be able to run on real-time video streams (and allow for delay in input material retrieval).
Reflect the human visual system in its scoring.
Be embedded into Airtime’s testing environments (C++ compatible).
Not be too computationally expensive.
Handle dropped frames (and match frames in the cases of full or reduced reference).
Handle frame rescaling.
Run on Mac, Linux, and iOS.

With these requirements in mind, a decision was made to select the quality analysis algorithm that best fit Airtime’s needs.

The Decision

In the end, the open-sourced VMAF analysis was selected for Airtime’s video quality analysis algorithm.

The full-reference, VMAF static C++ library best fit Airtime’s requirements. The software calculates a VMAF score, which is based on data from subjective analysis, and accurately reflects human perception. Moreover, it also has the capacity to calculate SSIM and PSNR scores, which is useful for result validation.

However, many personalized modifications would need to be made to the library, including the analysis of real-time frame inputs rather than predetermined files. A backchannel in our video system must also be constructed in order to extract the original, reference frames.

Fresno

To handle the needed changes, an application programming interface (API) was developed to act as the communication link between Airtime’s testing tools and the modified VMAF. This was the birth of the Fresno project, whose name follows Airtime’s media team’s California landmark project naming convention — fun fact, Fresno is the largest raisin producer in the world!

To accurately pair corresponding reference and distorted frames, we dove deep into WebRTC, the open source video chat framework that Airtime is built upon — see here for more details about Airtime’s use of WebRTC.

In our use of WebRTC, frames are dropped before the encoding process when there is not enough bandwidth to send every frame. Because of this, we are able to intercept the pre-encoded, or reference, frame before the encoding process and after the potential frame drop checkpoint. This timing ensures that every reference image stored for Fresno will have a matching distorted frame. By accessing the reference and distorted frames in WebRTC, we were able to overcome the hurdle of frame synchronization.

In addition to dropping frames, frame resolution will be decreased by our encoder in cases with limited bandwidth. Fresno is able to detect this difference between the reference and distorted frames and rescale the distorted frame to match the dimensions of the reference frame.

Fresno itself is written in C++ and utilizes process synchronization to ensure proper timing and data collection. It allows user specified start and stop of video quality analysis, and upon completion, it returns the aggregate VMAF quality score. A series of command line options allows for user specified configuration of the analysis. A JSON data log also fills with frame by frame analysis data, including PSNR, SSIM, MS-SSIM, and VMAF scores.

Sample of the quality analysis data report.

Fresno is currently compatible with OS X and is integrated into Airtime’s video publication and subscription test application. Here, Fresno analyzes the frames of the publishing stream for the specified duration of the testing session.

Fresno in Action

To test the integrity of Fresno’s analysis, quality analysis was conducted on a pattern generated video stream for differing constraints, specifically analysis duration, frame rate, and bitrate. VMAF scores can be interpreted ona linear scale where a score of 20 corresponds to “bad” quality and 100 to “excellent” quality.

Bitrate is the speed at which data are transmitted along a network. Quality analysis plots for 250kbps, 1Mbps, and 2Mbps can be seen below.

Quality scores per frame for generated video streams at 250kbps, 1Mbps, and 2Mbps.

Across the board, the video quality is initially high. The first 25 to 50 frames have scores ranging from 95–100, which translates to excellent quality. Once the network is initially probed, the encoder responds to the available bandwidth and the quality drastically decreases.

At 250kbps, the network never improves, and the scores fluctuate between 50 and 70 for the rest of the video stream. At 1Mbps, the quality increases from 75 to 85 as the network is probed and bandwidth is recovered. At 2Mbps, sufficient bandwidth is available, and the quality score skyrockets to 100. It is important to remember that a score of 100 does not indicate identical frames. Rather, a score of 100 translates to excellent quality where the difference between the reference and distorted frame is negligible.

For each 30 second video stream, the aggregate VMAF scores are as follows:

250kbps → VMAF score: 63.15, fair
1Mbps → VMAF score: 84.84, good
2Mbps → VMAF score: 89.76, very good

These scores are the arithmetic mean of the individual frame scores of the testing session.

Quality score distribution for generated video streams at 250kbps, 1Mbps, and 2Mbps.

Overall video quality is higher for videos streamed at a higher bitrate which is expected.

The Future of Fresno

The Fresno project is still in progress as we continue to build upon its compatibility with our testing environments. We plan to integrate Fresno into our media server, so we can evaluate the quality of transcoding streams and end-to-end data from live cameras. Further extensions of Fresno, such as analyzing a specified region of interest, are also in the future plans. Additionally, we plan to iterate upon our current WebRTC frame sychonization method and implement a less invasive method, such as optical character recognition.

In the meantime, we can celebrate the fact that we can now objectively quantify Airtime’s video quality in real-time on OS X. Fresno will allow our testers to numerically score videos under different network constraints. This quantitative analysis will allow us to further optimize our encoder and understand how it handles situations of limited bandwidth.

With Fresno in Airtime’s codebase, we can now rest, knowing that we can begin to objectively, and sufficiently, answer the question, “How good is Airtime’s video quality?”

References

[1] Z. Wang, A. Bovik, H. Sheikh and E. Simoncelli, “Image Quality Assessment: From Error Visibility to Structural Similarity”, IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004. Available: 10.1109/tip.2003.819861.