The definitive guide for picking a fragment length

Published in

Zattoo’s Tech Blog

11 min readJun 22, 2021

No matter if you are using MPEG-DASH, HLS, Smooth Streaming, or CMAF, if you are just beginning to learn about streaming or working for a company that is pushing a single-digit percentage of overall internet traffic to users across the world, there is one thing that will haunt you for the duration of your journey in video tech.

The Fragment Length

Yes, it’s that small detail that will be the first question you have about when you run an FFmpeg command to generate a MPEG-DASH stream out of your MP4.

If you are aiming for a streaming protocol like HLS (Apple) or Smooth Streaming (Microsoft), you might have read their recommendations on the fragment length. Those two protocols are driven by a single company, which is also giving out proposed values to use:

HLS currently recommends 6-second fragments [1] (by the way Stefan Lederer, 10 seconds are outdated in [2])
Smooth Streaming’s default is 2 seconds

They both have the same goal but are recommending a significantly different value here.

Short vs. Long

There are also a few articles around that talk about different fragment sizes and their impact.

The executive summary on what the Internet (e.g. [2, 3, 4]) says about pro arguments for the different fragment categories is:

Short fragments:

faster adaptation / prevent stalls
lower latency
faster encoding

Long fragments:

fewer files
smaller playlist size
better encoding efficiency/quality

How serious are those arguments?

Of course: it depends. It depends completely on your encoding and streaming infrastructure and your use case.

As this post is coming out of Zattoo, let’s assume we are building a live TV streaming service. Additionally, let’s assume we have on-the-fly transcoding and on-the-fly packaging in place:

Faster adaptation is key in a live streaming scenario where you don’t have the possibility to buffer huge amounts of media in advance.
Lower latency gets a significantly higher weight as well due to the live streaming situation.
Faster encoding is something that we also like for live encoding as we need to meet a real-time constraint.
In our case, we can skip “fewer files”. There are of course still more files to some degree, but our storage architecture is built to serve exactly that purpose.
For playlists, gzip compression can be used, so you don’t need to care too much about the playlist size anymore. There is also a good mechanism in DASH with template URLs to avoid bloated playlists. For HLS there is also something useful available since the rather new version 10, called HLS Playlist Delta Updates.
For live streaming, the goal of encoding is not better encoding efficiency anymore; it’s faster encoding speed to meet the real-time constraint in the first line. From that on you can add up encoding efficiency.

Seeing this, it looks like we should go for the category of short fragments for the Zattoo live streaming case.
On the other hand, if you want to go purely VOD, the result may differ. Maybe you want to avoid building and maintaining an on-the-fly transcoding/packaging infrastructure and rather go for a pre-encoded bitrate ladder. This could easily turn your choice into longer fragments being the better fit.

After all, we end up with many pros for short fragments, but basically still need to evaluate one specific counterargument: lower latency vs. encoding quality. More on that later on in this article.

Why encoding quality is impacted by the fragment size is described well enough in [2] and [3]. In short: The need for a more dense keyframe interval steals bitrate from the temporal prediction (B-/P-frame) and with that lowers the overall quality.

The counterpart is your encoding latency. You need to encode the full fragment (be it, for example, 2 or 6 seconds) until you can expose it. In case you are streaming live content, this means that the fragment length directly impacts your latency.

But there are also already solutions to solve the encoding latency problem. Chunked CMAF and Low-Latency HLS are aiming to help out. Both protocols allow exposing parts of the fragment before it was fully finished by the encoder. But on the player side, you still need a keyframe to start playback, so stream startup time will depend on your keyframe cadence, and your latency on the length of the fragment parts exposed.

Valid fragment lengths

The generic decision on the length category of a fragment is only one part of the journey to pick the right fragment length. Another important step is to find the exact fragment length that is valid in all circumstances for your content. Some pitfalls need to be avoided.

At Zattoo we used to run streams with a fragment length of 4 seconds. Company history says this was due to the protocol recommendation of the first streaming protocol we used: HTTP Dynamic Streaming (HDS) for the Flash player. And it stuck even after introducing HLS in version 1, Smooth Streaming, MPEG-DASH, HLS in version 5, and also during the latest addition; another HLS variant, version 7.

With those 4 second fragments, we eventually ran into issues. Issues that are not related to latency or encoding quality. We faced A/V sync issues on some devices — and we knew the reason.

Video

With a fragment length you want to cover a specific amount of time, and in the streaming protocol specification you typically see references to video GOP sizes. This might be a rather easy pick if you only have video with a single frame rate, but that’s typically not the reality. As a live TV service, you potentially want to serve streams with 50 frames per second (fps) in your highest quality levels, but lower-quality levels might be 25 fps only. In our case, we even go lower than 25 for a few very low-resolution quality levels and provide them with 12.5 fps or 5 fps.

That looks good considering our 4 seconds fragment length, we have an even number of frames to put into one fragment:

+------+------------------------+
|      |  fragment length in ms |
| FPS  |                   4000 |
+------+------------------------+
|   50 |                    200 |
|   25 |                    100 |
| 12.5 |                     50 |
|    5 |                     20 |
+------+------------------------+

The way of using 50/25 frames is a European thing, leading back to the PAL TV standard. What about going abroad and taking the US market into account, while still maintaining your existing infrastructure? NTSC is your history pal. The exact frame rate would be 60000/1001 and 30000/1001 if you follow the traditional standard. But to simplify your life as an OTT operator, you might want to tweak your HTTP streams to be delivered with more even frame rate numbers. In our case we ended up with 60/30 fps for the US case. Also we added low frame rate qualities to the bouquet again, explicitly 15 fps and 5 fps.

+------+------------------------+
|      |  fragment length in ms |
| FPS  |                   4000 |
+------+------------------------+
|   60 |                    240 |
|   30 |                    120 |
|   15 |                     60 |
|    5 |                     20 |
+------+------------------------+

Audio

Okay, that also works. So what’s the problem we were talking about?
It’s the neglected friend of the video streaming scene: Audio!
The classic audio codec in streaming media is typically AAC in 48 kHz with a block size of 1024 samples.

+-----------------+------------------------+
| codec /         |  fragment length in ms |
| block size      |                   4000 |
+-----------------+------------------------+
| AAC / 1024      |                  187.5 |
+-----------------+------------------------+

There it is. The 0.5 that makes the difference. You end up with an uneven number of audio fragments per 4-second fragment. With that, you need to place 187 audio samples in even fragment numbers, and 188 audio samples in odd fragment numbers. This flapping audio sample will potentially blow your manifests/playlists and distract the players. In a better case, it’s hidden in a manifest that is not exactly accurate, but most likely works. But if it doesn’t work, you might run into A/V sync issues as your player is not recognizing this correctly. Before that, you already ran into a server-side effort of flapping the number of audio samples between odd and even fragment numbers.

The fun thing is, there is not only AAC around; there are more audio codecs with different block sizes. In our case we also need to deal with Dolby Digital Plus / E-AC-3:

+-----------------+------------------------+
| codec /         |  fragment length in ms |
| block size      |                   4000 |
+-----------------+------------------------+
| E-AC3 / 1536    |                    125 |
+-----------------+------------------------+

Luckily in the case of 4 seconds, this is not causing extra trouble. Still, we know that 4 seconds is not the perfect match for us and we need to find a proper fragment length.

Finding the sweet spot

Thinking about fragment lengths and what’s recommended on the Internet, we see values between 1 and 10 seconds. Let’s put all of those into one table to compare by calculating the numbers with the various frame rates and sample sizes we discussed so far. We see that there is basically one match that fits all with a fragment length of 8 seconds.

+------------+---------+-------+---------+-------+---------+
| FPS or     |                       fragment length in ms |
| block size |    1000 |  2000 |    3000 |  4000 |    5000 |
+------------+---------+-------+---------+-------+---------+
| video:                                                   |
| 50         |      50 |   100 |     150 |   200 |     250 |
| 25         |      25 |    50 |      75 |   100 |     125 |
| 12.5       |    12.5 |    25 |    37.5 |    50 |    62.5 |
| 5          |       5 |    10 |      15 |    20 |      25 |
| 60         |      60 |   120 |     180 |   240 |     300 |
| 30         |      30 |    60 |      90 |   120 |     150 |
| 15         |      15 |    30 |      45 |    60 |      75 |
| 5          |       5 |    10 |      15 |    20 |      25 |
| audio:                                                   |
| 1024       |  46.875 | 93.75 | 140.625 | 187.5 | 234.375 |
| 1536       |   31.25 |  62.5 |   93.75 |   125 |  156.25 |
+------------+---------+-------+---------+-------+---------++------------+--------+---------+-------+---------+--------+
| FPS or                             fragment length in ms |
| block size |   6000 |    7000 |  8000 |    9000 |  10000 |
+------------+--------+---------+-------+---------+--------+
| video:                                                   |
| 50         |    300 |     350 |   400 |     450 |    500 |
| 25         |    150 |     175 |   200 |     225 |    250 |
| 12.5       |     75 |    87.5 |   100 |   112.5 |    125 |
| 5          |     30 |      35 |    40 |      45 |     50 |
| 60         |    360 |     420 |   480 |     540 |    600 |
| 30         |    180 |     210 |   240 |     270 |    300 |
| 15         |     90 |     105 |   120 |     135 |    150 |
| 5          |     30 |      35 |    40 |      45 |     50 |
| audio:                                                   |
| 1024       | 281.25 | 328.125 |   375 | 421.875 | 468.75 |
| 1536       |  187.5 |  218.75 |   250 |  281.25 |  312.5 |
+------------+--------+---------+-------+---------+--------+

That means, 8 seconds would be the clear winner. The comparison also shows that typical recommendations of 2 or 6 seconds are not resulting in valid numbers for common types of content!

But we also wanted to go for a short fragment length, and 8 seconds definitely counts into the longer category. So what’s the actual common denominator? 8 seconds? 800 milliseconds? Let’s put it on the table again, and maybe try also 1600 milliseconds while we’re at it ;)

+------------+--------+---------+-------+
| FPS or          fragment length in ms |
| block size |    800 |    1600 |  8000 |
+------------+--------+---------+-------+
| video:                                |
| 50         |     40 |      80 |   400 |
| 25         |     20 |      40 |   200 |
| 12.5       |     10 |      20 |   100 |
| 5          |      4 |       8 |    40 |
| 60         |     48 |      96 |   480 |
| 30         |     24 |      48 |   240 |
| 15         |     12 |      24 |   120 |
| 5          |      4 |       8 |    40 |
| audio:                                |
| 1024       |   37,5 |      75 |   375 |
| 1536       |     25 |      50 |   250 |
+------------+--------+---------+-------+

Turns out that it’s not 800 but 1600 milliseconds which makes the common denominator for the content specs we aim for ( EU/US frame rates + AAC/E-AC-3 audio). This exercise didn’t lead to a single perfect value for a fragment length, but it led to a set of possible fragment lengths that you can choose from, depending on your use case. You can choose between the shorter and longer fragments to balance the trade-off between latency, quality, and potential infrastructural decisions. Considering that 8 seconds might be the maximum you would consider, we end up with a set of five fragment lengths:

+------------+--------+---------+-------+---------+--------+
| FPS or                             fragment length in ms |
| block size |   1600 |    3200 |  4800 |    6400 |   8000 |
+------------+--------+---------+-------+---------+--------+
| video:                                                   |
| 50         |     80 |     160 |   240 |     320 |    400 |
| 25         |     40 |      80 |   120 |     160 |    200 |
| 12.5       |     20 |      40 |    60 |      80 |    100 |
| 5          |      8 |      16 |    24 |      32 |     40 |
| 60         |     96 |     192 |   288 |     384 |    480 |
| 30         |     48 |      96 |   144 |     192 |    240 |
| 15         |     24 |      48 |    72 |      96 |    120 |
| 5          |      8 |      16 |    24 |      32 |     40 |
| audio:                                                   |
| 1024       |     75 |     150 |   225 |     300 |    375 |
| 1536       |     50 |     100 |   150 |     200 |    250 |
+------------+--------+---------+-------+---------+--------+

What did we choose?

Our goal was to achieve low latency. That also initiated the intense search for a proper fragment length. We eventually decided for going with 1.6 seconds from now on.

At the beginning of this article, we discussed the trade-off on encoding quality that comes with it. After implementing the new fragment length, we evaluated its impact. The result was a measurable decrease of the MOS metric of around 0.03. Even though this is measurable in data, this change is not visible. [5] states that “an absolute difference of less than 0.05 (up to 0.07 if confidence bounds are taken into consideration) in MOS value can be considered insignificant since it cannot be perceived by users”.

Summary

Check your current content specs (frame rate, audio codecs) and what you might have planned for the future. Then do the math to find your matching fragment lengths and decide if you want to go for a shorter or longer variant, depending on your use cases. Everything else will lead you into trouble at some point.

This is the first part of a blog article trilogy that covers all aspects of our journey to low latency streaming.
The second part describes our general concept for reducing the latency of live streams and goes into detail on the server-side changes we did, as well as how end-to-end latency can be measured in an OTT streaming service: How to go low latency, without special tricks.
The third part deals with the client-side changes needed when approaching low latency streaming: Smart Buffering — and the two types of player configurations.