The Ultra-Low Latency video streaming roadmap: from WebRTC to CMAF — part 2

Welcome back! In part 1 of this series, Jeroen Mol compared a few different types of streaming solutions. He also explained why Ex Machina needs a low-latency streaming solution, and why it is currently working with ultra-low latency CMAF (ULL-CMAF).

In this article, we are going to take a look at the techniques behind ULL-CMAF. First, we will cover the basics of traditional HLS and DASH video protocols, then we will apply this knowledge to explain how ULL-CMAF can improve latency.

Seconds

When you want to reduce latency for video, it is important to understand where latency comes from and how it works. The above image lays out every part of the chain that adds latency, from the encoder to the player. When the encoder encodes video, it will output small files, which are called segments. In this example, we use a segment size of 6 seconds, which is a commonly used number.

In the case of a 6-second segment, the encoder has to encode 6 seconds’ worth of video before it can write the segment file to its output. This means the encoder has already added 6 seconds’ worth of latency before the segment is uploaded to the CDN. Uploading the segment to the CDN’s origin adds another couple hundred milliseconds. Then, the CDN has to propagate the segment over its servers, which takes roughly the same amount of time as the duration of the segment. Finally, the video player needs another couple hundred milliseconds or so to download the segment.

This brings us to a total of 12 seconds from the moment a frame is provided to the encoder until it is available on the end user’s device using the HLS or DASH protocol. However, the player will not start playback after 12 seconds: if the next segment takes a little bit more time to download, the player will be stuck downloading the second segment after the first segment has already played. This would result in a stalled feed. For this reason, the player will begin playback only after it has downloaded a number of segments. The HLS specification recommends downloading 3 segments before starting video playback, which results in a latency of roughly 24 seconds.

That’s a lot of seconds. The good news is, the segment size can be lowered to reduce the overall latency — however, using a segment size of less than one second is not recommended since lowering the segment size increases the bitrate. This is due to the fact that each segment has to start with an IDR frame, which contains all of the pixels that are displayed on the screen. P and B frames contain only a subset of the pixels and reference other frames to fill in the missing ones. Therefore, an IDR frame uses more bytes compared to the other frame types, meaning that a smaller segment size will increase the bitrate if the video quality remains the same. With a segment size of one second, we can achieve a latency of 4 seconds.

Before we look at ULL-CMAF, we need to dive one level deeper into the segment files. With CMAF, the segment files are written as MP4 files, which are made up of smaller parts called boxes. With traditional HLS and DASH, each segment starts with a small MOOF box, followed by a larger MDAT box. This MOOF box describes how the player should interpret the binary data in the MDAT box. The player therefore needs the MOOF box in order to read the encoded frames inside the MDAT box.

Ultra-low latency CMAF

ULL-CMAF decouples latency from segment size. It also pushes data instead of pulling it. To properly adopt this model, all components in the pipeline need to be adjusted.

To start, the encoder writes a MOOF and MDAT combination as soon as it has encoded a single frame. Instead of waiting for the full segment length, it can immediately write something to its output after the single frame has been encoded. This process reduces the latency on the encoder side from more than one second to the time it takes to encode one frame.

For each segment, the encoder opens an HTTP POST request to the CDN, sending the data over using a technique called chunked transfer encoding. As soon as the MOOF and MDAT boxes are created, they are packed into an HTTP chunk and pushed to the CDN.

On the other side of the CDN, the video players do something similar: they open an HTTP GET request for a segment, and the CDN uses chunked transfer encoding to deliver the response. Whenever a chunk for that segment is pushed by the encoder, the CDN will forward that chunk to any players that have a request open for that segment. The player will then receive frames of a segment while the encoder is still creating the rest of the segment.

Traditional JavaScript players use a GET request to retrieve segments. The downside of this is that the JavaScript code can only use the segment after it has been downloaded completely. On the other hand, our ULL-CMAF player uses a FETCH request with a fairly new technique called streaming response body. This allows the JavaScript code to read the data associated with each of the chunks as they arrive. Because the chunks contain a complete MOOF and MDAT box, the video player can add the content inside each chunk to its buffer immediately.

The whole pipeline has moved from pulling segment files to pushing frames. While every component of the pipeline still adds a little bit of latency, we are now able to shave the latency down to one second.

Synchronization

Since we are dealing with video quiz apps, it is crucial that we don’t spoil information. For example, we don’t want to show the answer to a question on one device while another device still has the option to answer that same question. To solve this issue, we added synchronization to the low-latency video stream.

Synchronization means having different types of devices connected over different types of internet connections displaying the same part of the stream at the same point in time. In short, if you laid those devices out next to each other, they would be streaming the exact same image. Our video players maintain a common delay by calculating the latency and adjusting the playback speed.

In order to achieve synchronization, the encoder writes a timestamp as soon as it starts encoding the stream. This timestamp is passed along to all of the video players using the video stream, so they know when the stream has started and which portion of the stream they are currently decoding. If a player subtracts the stream start time from the current stream position, it can calculate how far behind it is compared to the encoder. This final figure is the current latency.

Players are trying to deliver the stream at a given target latency. If the target latency is set to 3 seconds and the player is playing with a latency of 2 seconds, it will slow down the playback speed until it reaches 3 seconds. On the other hand, when the latency is higher than the target latency, the player will increase the playback speed. However, one main issue with changing the playback speed is that the pitch of the sound changes along with it. We solve this problem by applying pitch correction, allowing us to modify the playback speed without giving the host a chipmunk voice!

By pushing frames instead of pulling complete segments, we can achieve an ultra-low latency video streaming experience. Combined with video synchronization, we have created the ultimate video streaming solution for interactive (mobile) applications.

If you are interested in our products and services, please don’t hesitate to get in touch. At Ex Machina Group, we provide everything you need to launch your very own ultra-low latency video stream, including concept creation, business modelling, front-end design, back-end development, and operations. Check out our website to get inspired by our portfolio and client list. You can also contact me directly on LinkedIn.