Harder, better, faster, cheaper — Optimizing video bitrate for ultra low latency live content

Published in

ExMachinaGroup

10 min readNov 7, 2019

Recent breakthroughs in video technology are shaping the way live video is consumed, and one of the most crucial evolutions has been the reduction of glass-to-glass video latency.

Achieving the lowest possible latency during a live video stream requires that all parts of the video delivery chain be tuned to perfection. Certain optimizations are related to CMAF, WebRTC, LHLS or any of several other video protocols, while others work with all possible streaming protocols/setups. One optimization that is important for both the latency and streaming cost is selecting the proper video quality/bitrate.

In general, all of our customers have three main requirements for their live video stream; the lowest latency, the best video quality, and the lowest price. These 3 requirements can be mapped in a “Scope Triangle” illustrating the trade-offs inherent to the defined priorities.

This article will be focusing on one of the three requirements: video quality. We will be taking a look at how Livery determines the best possible bitrate and compression for its customers. If you’d like to read up on CMAF-based low-latency video before reading on, check out this post.

In general, video quality is based on bitrate. The general rule of thumb is that video quality increases as bitrate increases. As a video publisher, you generally pay for the amount of data (gigabytes) you deliver to your end users, and as an end user, you’ll usually pay for the data you receive (data bundle), so increasing the video quality affects both the publisher and the end users cost-wise.

The Livery team generates reports that determine the point of diminishing returns for a customer’s content type, i.e. the sweet spot where there is an optimal balance between quality and data (bitrate). Moving beyond the point of diminishing returns means the extra data input no longer results in an appreciable improvement in quality (output).

There are multiple ways to measure video quality. The most popular metrics are the signal-to-noise ratio (PSNR) and the mean-squared-error (MSE). These metrics are based on pixel-by-pixel comparison, while metrics like SSIM, VSSIM (video version of SSIM), NTIAs VQM, PVQM, and PEVQ use reference information and/or original frame to determine the relationship between the original and the decoded frame.

However, the problem with the metrics above is that none of them take human perception into account. They measure the distortion added due to the compression on a frame-by-frame basis and don’t make allowances for human perception over a period of time.

VMAF (Video Multi-Method Assessment Fusion), developed by Netflix in cooperation with the University of Southern California and the Laboratory for Image and Video Engineering (LIVE) at The University of Texas at Austin, combines human perception with a set of tools to generate a quality rating.

To start, Netflix created a dataset of short compressed video clips and asked a test group to rate the videos compared to the original (source) video. All of this data was used to train a machine-learning algorithm to rate video from 0 and 100 (the higher the score, the better the quality).

The fact that VMAF is an objective full-reference video quality metric which can be trained for specific content types makes it an ideal framework for the Livery team to find the optimal mix of codec, resolution, compression, and more for their customers.

Test setup

In the following section, the Livery team will be reviewing 4 types of content to determine the point of diminishing returns for each one using the VMAF model. The content will be consumed on mobile devices and the source materials are HD videos (1920x1080) 24 FPS. The sample set was selected to represent a wide range of content types:

Animation — the complexity of the content is “medium”, meaning the scenes contain many flat regions without any noise and a minimal amount of motion between the frames.
Sports (Soccer) — The content is “complex.” It has a high amount of temporal motion, fast-moving small objects, and a medium number of scene changes.
Action movie — The content is “complex.” with a high amount of spatial texture. It contains a high number of scene changes, explosions, and fast-moving objects.
Talking head — The content is “simple.” The background scenes contain many flat regions, no scene changes, and only a small amount of motion between frames.

The VMAF score is measured based on an algorithm that uses a combination of bitrate, resolution, and quality/compression. All of the different combinations add up to a total of 750 samples per content type (3,000 in total).

The following table is a snapshot of the data we gathered during testing. The VMAF test was performed on a c5.metal AWS instance with 96 cores, and it took a total of 24 hours to run the test for all 4 content types.

According to the VMAF methodology, a score difference of 6 points can be perceived by the human eye. Smaller differences are too insignificant for us to notice. We asked a test group to rank different videos on a scale of 1–5 (Excellent, Good, Fair, Poor, and Bad), and each sample reviewed by the test group had a VMAF score difference of at least 6 points. This provided us with the following score mapping:

Source, CMAF 100pt
Excellent, VMAF 86–100pt
Good, VMAF 71–85pt
Fair, VMAF 41–70pt
Poor, VMAF 21–40pt
Bad, VMAF 1- 20pt

For each content type, an example screenshot was gathered to provide a visual reverence in relation to the score/quality mapping. It’s important to note that the screenshots were captured from a video: distortion that becomes visible when reviewing a single frame might not be visible when the video is played at 24fps, highlighting the limitations of a pixel-by-pixel comparison method (compare screenshot A with screenshot B) versus a human perception-based method like VMAF.

Results

Our Livery solution is focused on ultra-low latency live content where encoding decisions are made upfront, meaning a fixed value/setting is configured at the start of the live event. Optimizations like shot-based encoding can only be implemented when the content is available upfront. For the Livery and Ex Machina use cases, the VMAF score model has proven to be an indispensable metric to determine the proper encoder settings for ultra-low latency live streams.

When the best VMAF scores per bitrate and content type are plotted on a line graph, a clear pattern reveals itself. The information derived from the graph is used to determine the best possible encoder settings.

The graph shows a logarithmic pattern for each content type, meaning that an increasing amount of data (bitrate) is needed to get similar quality improvements. For example, increasing Animation from 0.8 to 1.0 Mbps provides a score improvement of 5pts, but going from 1.8 to 2.0 Mbps only raises the score by 1pt.

This is the graph we use to determine the “point of diminishing returns”. The exact point varies by video category and/or business requirements, but in general, it lies between a VMAF score of 71 and 94 points.

Scores above 94 will not provide a visual improvement compared to the source. The bottom bar can be set at 71 (Good) or 86 (Excellent) depending on the client’s requirements. For most Livery customers, “Good” quality suits their needs. When video quality is key, 86 pts can be considered a baseline.

The data clearly indicate the difference in the point of diminishing returns based on the type of content.

Adaptive bitrate

Almost all Livery live streams use an adaptive bitrate (ABR) ladder, which allows users to switch to lower bitrates/qualities when the internet connection is not fast enough or when the end user prefers to save bandwidth. The point of diminishing returns is set to be the highest quality in the ABR ladder. The VMAF results are also used to determine lower ABR settings to form a bitrate ladder.

The number of different video quality options has an impact on the encoder’s performance. Ideally, it would encode the video in all possible quality levels, but that is not realistic due to real-world hardware limitations. If a smooth transition from low to high quality (or the other way around) is requested, the following bitrate ladder is advised for animation.

Based on our experience, 6–12 pts up/down is hard to detect within the threshold of average human perception (not considering golden eyes). The viewer perceives this to be a smooth transition to a higher or lower quality. The 6 different bitrates are within the capabilities of the encoder developed by the Livery team.

In practice, a smooth transition works when quality is increasing, but diminishing the quality is an issue given the limited buffer of an ultra-low latency live stream. Increasing the bitrate difference between the different steps in the bitrate ladder, allows the ABR algoritme to reaches the best matching bitrate without video stalls. When the target latency is lowered to 1.5 seconds the bitrate difference need to be increased even further.

Resolution vs Bitrate vs Quality

Video quality and resolution are often used interchangeably, especially for marketing or sales purposes. A relationship between resolution and quality exists, but a 720p stream might provide better perceptual quality than a 1080p stream.

The graph above shows the VMAF score per bitrate for a movie trailer played in various resolutions. A bitrate of at least 2.0 Mbps and a resolution of up to 1080p provides a better perceptual quality compared to the other resolutions. In all other cases, a resolution of 1080p provides lower quality than both 320p and 720p resolution. If the target quality is “Good” or “Excellent,” it can be achieved with a resolution of 0.6 Mbps / 360p / 76 pts or 1.0 Mbps / 720p / 87 pts.

The following explanation is a simplified version that doesn’t take any codec magic into consideration. To specify the color of a pixel, a value needs to be assigned to each unique color. The more bits you use to specify those color values, the more distinct colors you can have, and the better the pixel will react, increasing the overall video quality. With low bitrates, the amount of data (information) per pixel is too low to produce the desired behavior. Lowering the resolution increases the amount of data per pixel, improving the behavior and the overall quality.

A 0.6 Mbps stream (78,643 bytes per second) yields the following metrics:

As shown in the table above, the amount of data per pixel for 720p and 1080p resolution is too low to display a proper image.

Increasing the bitrate to 0.8 Mbps provides enough data for the 720p (82 pts) resolution to surpass 360p (81 pts), with a total of 0.11 bytes available per pixel in said 0.8 Mbps stream. The tipping point for the 1080p resolution is around 2.0 Mbps, with 96 pts versus 95 pts for the 720p resolution. At this point, a total of 0.13 bytes is used per pixel. As reflected in the VMAF score per resolution graph above.

Just like there is a minimum, there is also a maximum amount of data per pixel. The VMAF score will not increase when the upper limit is reached. The upper-quality limit of the different resolutions is clearly visible in the graph below. When the upper limit for a given resolution is reached, any increase in bitrate will not result in a higher VMAF score.

The only way to improve video quality when the upper limit of a resolution is reached would be to increase the compression “efficiency.” The graph below shows a movie trailer streaming at 720p with different compression efficiencies. Increasing compression efficiency also increases the CPU required by the encoder to compress the video frames.

The results of this research are the foundations of a new feature that is part of our Livery solution. It allows us to test a wide range of encoder configurations in combination with our customers’ content. When the test is complete, it will recommend an ABR ladder with the best possible settings for each step. The screenshots below are an example of how the tool fits into the online stream control center.

The test set is based on an H.264 codec. In the next article, we will review the effect of different codec types (H.265, VP9, AV1) on video quality and cost.

If you enjoyed this article, we think you’ll like a few of our other insights:

The Ultra-Low Latency video streaming roadmap: from WebRTC to CMAF Part 1 & Part 2
Things to consider when building a large-scale interactive live streaming platform
The next generation of live video driven by ultra-low-latency streaming and interactive video layers

Interested in learning more about our Livery platform? Get in touch! At Livery, we provide everything you need to launch your very own interactive ultra-low latency video solution, including concept creation, business modeling, front-end design, back-end development, and project management. Check out our website to get inspired by our portfolio and client list, or contact me directly on LinkedIn.

Harder, better, faster, cheaper — Optimizing video bitrate for ultra low latency live content

Test setup

Results

Adaptive bitrate

Resolution vs Bitrate vs Quality

Written by Jeroen Mol