A progressive approach to the past

How we moved to a modern transcode system at Vimeo while ensuring cheap and efficient backwards compatibility.

Published in

Vimeo Engineering Blog

8 min readAug 18, 2023

The year is 2023, but the type of files that most video transcode systems output hasn’t appreciably changed since Obama was president, despite most playback not occurring on those types. Among the most promising remedies to the problem was widely believed to be technically impractical, until we here at Vimeo got our hands on it. This article explains the origins of Artax, our solution for legacy progressive video on the fly.

Some history

Adaptive Bitrate streaming, or ABR, has been the industry standard for video playback on the Internet for almost a decade now. Pretty much every player and device you’ll come across supports it. It’s great — you no longer need to sit and buffer enough until playback is possible; the player can choose resolutions and bitrates for playback to give you the best experience possible.

The standard way ABR is served is via Common Media Application Format, or CMAF, fragments, which is based on fragmented ISOBMFF — that’s ISO Base Media File Format, better known as MP4 — with audio and video tracks stored as separate files. These files are fragmented, in our case at six-second intervals, which means that every six-second fragment can be used as a switching point in playback software to change resolution, bitrate, and so on. Both HLS and MPEG-DASH support CMAF, as do almost all players that use their own manifest format.

Given this, our current transcode stack naturally outputs transcodes in this format, and we store them this way, rather than progressive ISOBMFF, which is the sort where all of the audio and video data is put at the back (or the front) of the file, with a single index for all that data — the moov box, in ISOBMFF terms — usually at the very beginning of the file. Our old transcode system, and most transcode systems out there, output progressive files.

The past is not actually the past

As a developer, after moving to a modern transcode system, you might not want to support old-style progressive ISOBMFF. Here’s why your customers might want you to reconsider:

To enable their users to download files for offline playback, where separate and fragmented audio and video tracks are poorly supported. This is by far the most common (and most legitimate) reason.
Users using HTML5 <video> tags with a direct video file URL, for use in, for example, content management systems that don’t support player embeds. This is the second most common reason — I didn’t say they were all good reasons.
You want to be able to import transcoded files into non-linear editing systems.
You designed a user-accessible API 15 years ago that included non-expiring video file links. Whoops.

Possible solutions

To solve this — providing fragmented ISOBMFF for ABR playback, and progressive ISOBMFF for the cases above — there are a few options, each with pros and cons.

Do two muxes of each encode (one progressive, and one fragmented), and store them

This doubles storage, and you may as well set that money on fire.

Do one set of encodes and store them as progressive ISOBMFF

With this option, every single file needs to be remuxed and segmented for playback at the content delivery network, or CDN, level. Given that the vast majority of traffic a given video sees is for ABR playback, this has a significant amount of compute and cache overhead. However, it is by far the easiest semi-sane route to implement, and most vendors do this, either via an in-house packager, or via the CDN’s.

We did this for many years, until recently. There is a great talk at HAProxy Conf by my co-worker Andrew Rodland on some of the challenges we faced.

Write a Very Clever Service that transparently proxies sets of fragmented ISOBMFF files as a single progressive ISOBMFF over HTTP

Since the progressive versions may see only one or very infrequent requests, this is efficient from a storage and cache cost point of view. It’s not that difficult to do if you are okay doing it all at once on disk on the first request (at very high latency for the time to first byte, high cache size, and cost).

It’s also not that difficult to do if you are okay with the moov box being at the end, meaning that no seeking is possible until the whole request is served and muxed. This means the Content-Length header and range requests would be unsupported.

However, it’s pretty difficult to do if you want the moov box at the start, with Content-Length set, and range requests supported from the first request. This, however, is required if you want to be able to reasonably use this service as the source for an HTML5 <video> tag.

Pretty much nobody does this because of the problems and engineering difficulty associated with the approach. Naturally, we chose to build it: a transparent proxy that would make any set of fragmented ISOBMFF files appear as a normal progressive ISOBMFF file with the moov box at the front, and all HTTP features supported.

The swamp of sadness

Sticking with the NeverEnding Story theme of our stack, we named our solution Artax, which seemed apt for a number of reasons.

It had to be performant and cost effective: only read exactly which bytes we need from the input files for a given range request, only do expensive operations like reading all the moof boxes once (of which fragmented ISOBMFF has one per fragment), must play nice with caching for our CDN partners (hence the range request support being a must), and so on.

To achieve this, you can’t just throw the files though a remux, courtesy of one of the many muxers out there like FFmpeg, GPAC, or L-SMASH — they all seek around input as they see fit and write as they see fit, even if you use custom I/O callbacks, making them a poor fit (well, a non-starter) for this use case. This makes sense when you think about it: they are demuxing and then muxing again, with all that entails, when what we want is more akin to deserializing and then partially serializing in a slightly different way (see Figure 1).

**Figure 1.** Simplified MP4 anatomy. We combine FMP4 video and audio and all their boxes into a single progressive MP4. The box structures get deserialized, combined, and reserialized in the final output.

With that in mind, we designed an approach.

The easy bit: deserializing, indexing, and caching

For the first seen request, the easy bit: when a request comes to our origin, you request and parse the ftyp, moov, and sidx boxes from each input file’s header in parallel. (In our case, what’s happening is closer to deserialization than parsing.)

It’s important to note two things regarding this step:

The fragmented ISOBMFF files you create and store need to use a single global sidx box that covers all fragments (known as VOD style), rather than one sidx box per fragment (known as live style). This lets you cheaply know the positions of every fragment in the file without having to traverse the whole thing.
The moov boxes in fragmented ISOBMFF files don’t contain any of the indexed info about the packets in the file, such as offsets and frame properties, unlike the moov box in a progressive file. Each fragment has a moof box containing that fragment’s info.

Now that you have the sidx box from each input file, which contains the offsets to all the moof boxes in them, you can grab all of those at once in parallel — though, since this is a large amount of requests, use a connection pool for efficiency reasons.

At this point, there is enough information to construct the output’s moov box (which does contain all the packet info), and to directly calculate the exact output size for Content-Length, if needed. Note that knowing the size of all the boxes that appear in the output before the main mdat box (which contains all the packets) is important for being able to satisfy range requests as well, since all packet positions are offset by this amount.

Of course, this structure is cached, so you only ever have to do it once, as it contains all the information you need to efficiently satisfy any range requested in the output file.

See Figure 2 for a flowchart of this process.

**Figure 2.** Flowchart of the input file deserialization and output index generation process. Input boxes are parsed in parallel and used to calculate the output box contents.

The hard bit: Dealing with the packets

However, deserializing and reserializing the output moov is the simple part — satisfying arbitrary range requests within the output mdat box is trickier. You need to be able to figure out exactly which output packets the range request intersects and exactly where in each file these packets are read from on the input side; fragmented ISOBMFF files have one mdat box per fragment, after all. Further, you ideally want to use a single request per input file to grab all of the packets you need, even if there is a little waste by grabbing the moofs again in between. The benefits of a single request outweigh the small loss.

You also need to calculate how many bytes of the first and last written packets to ignore to satisfy the range, as well as know the exact position and state of the packet interleaving where this range starts:

Since the output progressive ISOBMFF contains audio and video interleaved inside its mdat box (500 ms interleaving in our case), you need to calculate exactly where in the interleaving process the intersecting packets lie.
You can either cache a packet location map on the first request, or rely on the fact that your interleaving is deterministic (for example, 12 video packets, then 24 audio packets) and calculate the position per request, which is a small amount of CPU, but far easier infrastructure-wise.

How we fared

Putting this all together, though it may not not sound as difficult as you were expecting, required extensive planning, careful bookkeeping, and working with CDN partners. For example, our CDN partner always makes a 0–0 range request upon first request for a given progressive version, which enables us to deserialize and cache input boxes, and return it a Content-Length so that it can grab chunks of the file to cache at its edges as needed, while starting the customer’s download or playback immediately.

We quietly rolled this out to customers in January 2022, and well, nobody noticed at all. Which was, by all accounts, a sign of a great launch! It is a transparent proxy, after all. If you have downloaded a video from Vimeo since then and seen Vimeo Artax Video Handler in the handler_name field, you’ve got a file that went through this system without even knowing!

There was also a side benefit: it aided our move in killing off the last internal users of our previous, very old transcoding system, by enabling services that expected progressive ISOBMFF output to be silently proxied via this new service, while they worked to move to a more modern workflow.

All in all

We had one of the quietest and most successful rollouts possible for a technically complex service. I could not be more proud of the teams involved — Video R&D and Video Platform, who brought this idea to fruition and to millions of users.

If you would like to learn more and see a demo (and a shaggier beard, as this was filmed during lockdown), I did a talk on this subject at London Video Tech in 2021, before Artax went to production.