How to go low latency, without special tricks

Stefan Kaiser
Zattoo’s Tech Blog
11 min readJun 23, 2021

Every two years it’s hitting video streaming services in Europe — and for other occasions even yearly worldwide. Either the FIFA World Cup or the UEFA European Championship is coming up. This makes companies like Zattoo think about their end-to-end latency every once in a while. And this time we tackled it. Let’s talk about what needs to be done to lower the latency and how we did it without any special tricks — fully compatible to legacy devices and without losses in scalability by keeping our cacheable streaming architecture in place.

What’s the problem?

Every time, both media and users are questioning how far away from “live” they are when watching a stream. They are wondering if they will be alerted about a goal by screaming neighbours, a dozen push notifications, or by actually seeing the goal on their own screen.

We all know satellite TV; it’s the benchmark as it is available almost everywhere. Even if you need specific hardware to access the content aired on SAT, it’s the most common way to watch TV.
But there is also the Internet. And now everyone has some device that can access the Internet. The majority of the content that is defining Internet traffic is video, not only due to its size but also its importance. And then there is TV streaming being the big alternative to satellite TV.

With the TV signal from the satellite, your satellite dish will catch a broadcasted TV signal and your satellite receiver will immediately decode it to display the images on your TV screen. This comes with a slight delay, technically about a quarter of a second for the transmission to/from the satellite. Additionally, there is a negligible delay in the transmission between the receiver and your TV. But there is a further delay in your TV’s decoder to wait for the next incoming keyframe that contains the first full picture to show and also some content buffering. In total this typically ends up with a couple of seconds to what is happening live.

On the other hand, TV streaming is being transmitted over HTTP nowadays, like any other website. This assumes the availability of the content on a web server. In turn, this means that there is a pre-processing happening on the server-side to make this content available. This pre-processing is similar to what the TV does with the broadcast signal. Actually, the source of most TV streams is in fact a satellite signal, which makes SAT the main reference for comparing latency. Additionally to the decoding part, pre-processing is also encoding the content again into a format that is formatted for being shared via an HTTP streaming protocol. Once the content is available, this content needs to be fetched by a streaming client, which could be a TV again, or any other Internet-connected device in this case. And there we are with the decoder step again, but due to the nature of HTTP, the delay introduced here is typically much longer than for the case of broadcast signals. Without any low latency advances, you end up with a delay of around 30 seconds or more for TV streaming.

What is end-to-end delay in streaming?

End-to-end latency: The timespan between the moment one frame was received at the server-side and the moment the same frame is displayed to the user with the player in state playing.

The processing of a signal to be streamed via HTTP contains multiple steps, each introducing some amount of delay. Our idea of end-to-end latency considers the beginning of the latency of a specific frame as the time when we first receive this frame. The contribution source of TV signals varies depending on the broadcaster and the technical availability of signals. Some signals can be received via IP multicast, potentially being the source with the least contribution latency. But many signals still come only via satellite, the same way in which they would reach the end-user as well. In the worst case, a TV signal is made available via HTTP streaming and has a huge delay on its way for contribution already.

The situation that TV streams are technically delayed to broadcast signals is a well-known topic. This is not only concerning pure TV streaming services, but also the broadcasters themselves. Typically a broadcaster has an online offer in addition to its broadcast product. With that, the broadcaster is competing with its own product portfolio and running into the same competitive disadvantages on the streaming part. In the German broadcast/streaming market, we recently saw developments of broadcasters that probably delay their TV signal on the play-out to different contribution paths. This would allow them to align the delays of Internet streaming and satellite signal reception.

Steps covered in the measurement of end-to-end latency for live TV streaming.
Steps covered in the measurement of end-to-end latency for live TV streaming

No matter what the source of the signal is, our measurement for end-to-end latency considers the time a frame arrives at our server. We are processing the frame, encoding it, potentially transcoding it, packaging it into some streaming format, and make it available on the CDN. From that point on this frame can be fetched by a client and put into the player buffer. Depending on the buffer configuration this introduces more or less additional delay. Once the player displays the frame we are measuring and is also in state playing, we take this as the other end of our latency measurement.

Technical implementation of end-to-end delay measurement

We are talking about low latency and how to reduce it. This only makes sense if you also measure it! We thought about the technical possibilities to automatically measure end-to-end latency and came up with a solution that works across most of our platforms.

There are two main streaming formats we currently support as state of the art: MPEG-DASH and HLS in version 7. This results in fragments being served in ISO-BMFF format.
Of course, there are multiple (more or less) legacy formats we also support, but it’s always about catching the majority. And the majority is updating their apps and uses modern devices.

In the ISO 23009 (Amendment 1) we have a great extension to the ISO-BMFF specification, which is this awesome ‘emsg’ atom that allows you to specify payload that should be emitted as an event on a specific media time.
In this article, we always talked about specific frames of which we want to track the end-to-end latency. To simplify the implementation we reduce this to the first frame of each fragment. This keeps a significant granularity to detect changes, but also doesn’t overload the metadata for the purpose of latency measurement.
Glueing that together we have an event on every first frame of a fragment that tells us the current time of this frame on this device. The payload of the event contains server-side data about the timings this fragment was encoded and packaged. Together with a quick clock-drift estimation to adjust the client-side timings we have all data available to report as latency measurement telemetry regularly.

This allows us to automatically track latency across many devices and also reveals interesting insights into the differences of various platform-player architectures. But for our topic of reducing the end-to-end latency, this builds the major benchmark and foundation to progress further with the actual changes to reduce the latency.

What do we want to achieve?

We are users as well. And of course, we use TV streaming to watch the UEFA European Championship, so our goal from the user’s perspective is simple:

We want to reduce the end-to-end delay for the user to such a low value that the user experience in case of live events can not be disturbed by external influences.

Possibilities for low latency streaming?

Our streaming infrastructure serves HTTP streaming formats like MPEG-DASH, HLS, or Smooth Streaming. We even serve HLS in three different protocol versions (v1, v5, v7). It is built explicitly for live streaming and performs at scale by leveraging multiple caching layers. With this bouquet of streaming protocols, we reach every device on the market, legacy or cutting edge. Typically you don’t change a working streaming architecture completely for a single reason, even if it is low latency streaming. That means we want to consider all possibilities carefully and check each pro and con for feasibility and also have an eye on potential reach.
So, what are technologies that could be used for low latency streaming?

  • Exposing MPEG-TS via HTTP
  • RTMP
  • LL-HLS / LL-CMAF
  • Plain DASH / HLS (currently in use)

MPEG-TS via HTTP Chunked Transfer Encoding. This is actually possible, but you will have a very limited number of clients supporting this. Without special hacks, that will probably be limited to Android. That doesn’t help us much with our current player ecosystem, which is much bigger. The main objection, however, is the fact that this is not scalable like typical HTTP streaming.

RTMP. Yeah, back to the 2000s. Flash is dead, isn’t it? As huge as its usage was 15 years ago (interesting talk from one of the creators), it’s currently a contribution format rather than a distribution format. Limited client-side support nowadays in addition to the fact that it’s also not scalable makes it also drop from the list.

LL-HLS / Chunked CMAF. We were talking about the fact that we currently use HLS and DASH, so LL-HLS and Chunked CMAF doesn’t seem to be far off what we want. That’s true. But what does it mean to move from plain HLS to LL-HLS? On the one side, we need a client that supports this new format, the same for Chunked CMAF. That’s even possible for LL-HLS to find; for Chunked CMAF it’s getting more tricky. Widespread and stable client-side support for that is still not in a state I see being present. Then there’s the server-side; both variants assume that we can announce the availability of fractions of the content, even before the full fragment is finished. Unfortunately, this doesn’t work out with our architecture without a major rework as we rely on the presence of full fragments to be picked up by on-the-fly transcoding and packaging processes. With those arguments, we also realised that LL-HLS and LL-CMAF is not a viable option for us at the moment.

All of that makes one thing clear: We want to reuse our current streaming infrastructure as it is already scalable and has the reach we need. Also, it is significantly backwards compatible. So what can be done with plain DASH and HLS to reduce the latency significantly?

Latency reduction with plain DASH and HLS

Looks like no tech could help us out on its own. So we stick with plain HTTP streaming —but what can we do with that to lower the latency anyway?

There is one question unanswered so far: where does our current latency come from exactly?
We measured end-to-end latency to get the full picture. But we also added some steps in between to find indications. Additionally, we analysed our architecture in detail to find the spots where delays are introduced. In the end, we could pinpoint the introduction of most delays to be tightly coupled to the fragment length. Most delays are introduced because we finish, store, process, prefetch or buffer a specific number of fragments. Even the obvious categorisation of dividing the end-to-end delay into server-side and client-side leads to correlations with the fragment length for each case.

For the server-side path, we consider everything from the initial encoding to potential transcoding and packaging and making it available for client requests. The client-side in the player is basically buffering and loading/starting the stream. Our fragment length has been four seconds since forever. A rough calculation would lead us to the following numbers:

+-----------------+--------------+----------------+
| Fragment length | 4 seconds |
+-----------------+--------------+----------------+
| Encoding delay | 4 fragments | 16 seconds |
| Random spread | 0-3 seconds | 0-3 seconds |
| Player buffer | 3 fragments | 12 seconds |
| Total | | 28-31 seconds |
+-----------------+--------------+----------------+

With that, we were wondering how far we could get down by just reducing the fragment length. Actually changing the fragment length was one of the things we had in our backlog for a while, as the current value of four seconds is not optimal, the reasons are explained in a previous blog post. In that post, we also identified a set of different fragment lengths that are valid for all types of content. The lowest reasonable fragment length from that investigation is 1.6 seconds. So, let’s have a look what the impact of only changing this parameter would be:

+-----------------+------------------+----------------------+
| Fragment length | 1.6 seconds |
+-----------------+------------------+----------------------+
| Encoding delay | 4 fragments | 6.4 seconds |
| Random spread | 0 - 1.6 seconds | 0 - 1.6 seconds |
| Player buffer | 3 fragments | 4.8 seconds |
| Total | | 11.4 - 12.8 seconds |
+-----------------+------------------+----------------------+

During the process of analysing our ingest architecture, we found a few additional spots that we could tweak to reduce the end-to-end latency by refactoring our codebase. Basically, those things are related to buffering settings of the ingest signal that were set higher than needed. In the end, this can lead to another two fragment length of latency reduction:

+-----------------+------------------+----------------------+
| Fragment length | 1.6 seconds |
+-----------------+------------------+----------------------+
| Encoding delay | 2 fragments | 3.2 seconds |
| Random spread | 0 - 1.6 seconds | 0 - 1.6 seconds |
| Player buffer | 3 fragments | 4.8 seconds |
| Total | | 8 - 9,6 seconds |
+-----------------+------------------+----------------------+

Here we are! 10 seconds end-to-end latency with plain DASH and HLS.

Side effects

Besides the latency reduction, we also see two side effects from our change to a fragment length of 1.6 seconds.

As elaborated in detail in our article about fragment lengths, we had a rather bad choice with a fragment length of 4 seconds. In fact, we ended up having a different fragment length for video and audio for DASH. This was 4 seconds for video and 3.2 seconds for audio to match exact frame numbers and block sizes per fragment. The consolidation to 1.6 seconds for everything also makes life easier from the perspective of handling code fragmentation now.

The player will most likely also honour the smaller fragment length with a smaller stream start-up time. The threshold to start playback is typically not the same value that is configured as the target playback buffer; it’s actually much lower. Let’s run through an example where we have a player buffer configuration of 12 seconds. The start-up buffer in this example is just 2 seconds. In the case of 4-second fragment lengths, this will lead to requesting 4 seconds of content. In the case of 1.6-second fragment lengths, this will lead to requesting 3.2 seconds of content which should, in turn, be downloaded and displayed faster.

Follow up

With that being said, even though we are not using any special cutting-edge technology or inefficient protocols, this effort was still massive. It was impacting the full chain from ingesting, over encoding, transcoding, packaging, to playback on all major devices. The playback part also has a significant take on latency reduction and comes with its own challenges. Those are described in an upcoming article: Smart Buffering — and the two types of player configurations.

This is the second part of a blog article trilogy that covers all aspects of our journey to low latency streaming.
The first part is about choosing a fragment size as the fragment size plays a crucial role in our low latency attempts: The definitive guide for picking a fragment length.
The third part deals with the client-side changes needed when approaching low latency streaming: Smart Buffering — and the two types of player configurations.

--

--