The Ultra-Low Latency video streaming roadmap: from WebRTC to CMAF
Millions of players interacting within a large-scale interactive game show environment presents a number of challenges. Ex Machina was able to tackle any and all challenges they encountered in the past 10 years. The most recent challenge is the demand for large-scale (>500K concurrent users), affordable, ultra-low latency video streaming solutions, supporting HQ Trivia-style applications.
In this two-part article, we will be looking back at a number of key decisions, changes, and improvements we had to make to get to where we are today: ultra-low latency via chunked CMAF. The first part is an introduction to the technology & terminology that underpins ultra-low latency streaming, while the second part will focus on the technical implications and challenges that had to be addressed in order to provide the best possible streaming solution.
The term “latency” is used in many different contexts. For the purpose of this article, the definition is the following:
Latency — [leyt-n-see]
“In the video world, latency is the amount of time between the instant a frame is captured and the instant that frame is displayed on the end-users device”
There is no universal absolute value that defines latency. Instead, what is considered acceptable varies per organization and application. Wowza’s infographic is a great starting point to define some common thresholds, but Ex Machina created its own version to map latency, protocol, and project type in a single image:
Ex Machina works with the following thresholds:
- Legacy Latency, 10 or more seconds
- Low Latency, Between 10 and 5 seconds
- Ultra-low Latency, Between 5 and 1 seconds
- Sub-second Latency, Less than 1 second of latency
The graph maps a number of real-world examples within each latency threshold for some relatable context, including:
- Facebook Live, 7–13 seconds
- Cable TV, 12 seconds
- Twitch stream, 10–30 seconds
- Voice chat, less than 1 second
In general, video streaming solutions require similar setups. The technology chain starts at the point where the mixer or camera is connected to the encoder. The encoder then pushes the video chunks to the end users via the CDN (web, mobile, desktop). It is important to understand that every step adds latency to the overall total latency, so achieving the lowest possible latency means that every step in the chain needs to be optimized.
Here are just a few innovative ways that Ex Machina was able to optimize the process:
- Shift encoder priority from quality to latency optimization
- Push video in real-time to the CDN, instead of request-based
- Ensure the ingest server is close to the encoding facility
- Optimize the CDN mid-tier for fast transfers
- Ensure servers are close to the end user with a minimal amount of hops
- Optimize buffer logic for frame-by-frame playback
- Implement sync logic
In this article, we won’t be looking at the studio setup required for an HQ-Trivia style application. If you’re interested in this topic, you can check out the article ‘How to develop your own HQ Trivia’ for more info. If you don’t like to read you can find a Video Post of the article here.
Video Streaming Protocol
There are several protocols that are used to deliver low-latency video streams to the end user. WebRTC is the newest & fastest kid on the block, and in theory, a no-brainer when latency is the main focus. While using WebRTC for our projects, two critical limitations were exposed: firstly, WebRTC struggles with large numbers of concurrent users and spikes in traffic. Secondly, WebRTC requires a dedicated hosting solution, reducing its flexibility and increasing hosting costs.
Alternatives to WebRTC are RTMP, HLS & DASH. RTMP is one of the oldest protocols able to deliver low-latency video, but it requires Flash for web-based playback, which affects the player’s cross-platform/browser support. There are additional solutions available, like websockets, but they currently lack standardization.
This leaves HLS & DASH as the remaining alternative protocols to WebRTC. The default HLS & DASH setup can achieve a latency as low as 5 seconds, but with the help of CMAF (Common Media Application Format), a latency of 1 second is theoretically possible.
The following overview compares the different protocols based on our findings:
CMAF is not a protocol — it’s a container that facilitates the combination of HLS and DASH in a single optimized format. CMAF’s low-latency mode allows the encoder to push video chunks instead of request-based video segment delivery. To understand how Ex Machina uses CMAF, it is important to have a basic understanding of the default HLS & DASH solution.
HLS & DASH Explained
The encoder can be seen as a production facility: first, it receives the raw video materials from the camera/video mixer and builds streamable video parts (.ts or .mp4 files). The parts and a “manual” (.m3u8 or .mpd) are then packed into a large shipping container (CMAF). The smaller the segment, the lower the theoretical latency.
The end user requests video segments from the “distribution facility” (CDN), which forwards the request to the encoder. These segments are released by the encoder, allowing the CDN to deliver the proper segment type (HLS or DASH) to the end user’s video player. The video player unboxes the segments and loads them in buffer. The stream starts playing when all parts are unboxed and loaded in the player. The HLS specification states that 3 segments should be loaded before the player is allowed to play the video.
Chunked CMAF Explained
Chunked CMAF is a highly-optimized solution for delivering HLS & DASH video to the end user.
The encoder wraps a single .mp4 file and a .mpd & .m3u8 file inside one CMAF container. Then, the data is pushed via HTTP 1.1 Chunked Transfer Encoding via the CDN to the end user’s video player. The system no longer waits for a request — instead, it makes the assumption that the chunks can be pushed to the player as soon as they are available.
The different manifest (.mpd & .m3u8) use the same .mp4 files, making them platform agnostic (iOS & Android). The player plays the chunks immediately after being decoded. On top of the custom buffer implementation, syncing logic is included: the video stream is synced on all possible platforms based on a server timestamp, allowing all video players to show exactly the same frame at the same moment in time.
When it comes to low-latency video streaming, there is no one-size-fits-all solution. The streaming setup and the associated technology needs to be tweaked and tuned to meet customer requirements.
In part 2 of this article, the lead developer on this project will be guiding you through the technical challenges Ex Machina ran into, and the optimization solutions put in place to achieve a latency of less than 2 seconds with HLS & DASH based on the CMAF standard.
If you are interested in our products and services, get in touch. At Ex Machina Group, we provide everything you need to launch your very own ultra-low latency video stream, including concept creation, business modelling, front-end design, back-end development, and operations. Check out our website to get inspired by our portfolio and client list. You can also contact me directly on LinkedIn.