What’s new at Vimeo behind the curtains

A look at recently added multimedia features.

Raphaël Zumer
Vimeo Engineering Blog
9 min readDec 14, 2023

--

In the past several months, we’ve been sneakily working on supporting and improving our coverage of new (and old) features and standards in the video, audio, and image spaces at Vimeo. This post covers some of the recent features that we rolled out and haven’t yet communicated broadly.

Dynamic HDR video with HDR10+

Complementing our integration of Dolby Vision in 2021, we’re bringing the other major dynamic high-dynamic-range video format to Vimeo: HDR10+. Like Dolby Vision, HDR10+ standardizes dynamic metadata that can modify lighting properties as needed, from scene to scene or even frame to frame. In contrast to standard HDR static metadata, which is immutable and limits the level of detail that can be represented across various scenes of a video, dynamic metadata enables fine-tuned videos with diverse visual content to look their best by utilizing the full dynamic range of the display to represent any given frame, improving the contrast within darker or brighter scenes without compromising the quality of the rest of the video.

Any system that supports HDR display and HEVC decoding will receive HDR10+ videos by default from us, although not all displays can render the dynamic parameters — they need to implement HDR10+ technology specifically to do that. Since HDR10+ is a backwards-compatible extension of the HDR10 format with a simple implementation, incompatible displays simply ignore the dynamic component and fall back on static metadata.

You can find a list of HDR10+-certified products on the official standard website. To capture HDR10+ video on compatible devices like smartphones, you may need to toggle it in the settings. A successful HDR10+ upload is labeled as such next to the video title on Vimeo, either in the player (see Figure 1) or below it on the watch page.

Vimeo player with an HDR10+ label.
Figure 1. Vimeo player with an HDR10+ label.

As of this writing, we support HDR10+ files only in the HEVC format with dynamic metadata embedded in the video stream at the codec level, which is the most common distribution format, although we’re always on the lookout for improvements to our format and device coverage. We also don’t take dynamic metadata into account when tonemapping (compressing the dynamic range), so it won’t affect the look of the video when viewed on a display that doesn’t support HDR at all, or at lower resolutions. While dynamic tonemapping provides benefits in maintaining fidelity in brightness and contrast, in SDR there are significant tradeoffs that limit its appeal. With a limited dynamic range available to begin with, the benefits of dynamic metadata for improving contrast are minimized. Pulling the dynamic range towards a darker or brighter range without adequate support on the display also leads to flattened areas of low-contrast highlights and shadows. Finally, tonemapping in SDR would lead to inconsistencies between the look of SDR renditions and regular HDR, as HDR panels without HDR10+ support ignore the dynamic metadata instead of attempting to tonemap it as well. Considering the limited device support for dynamic HDR formats, typical dynamic HDR content is mastered with this caveat in mind and should look good with static metadata alone.

To visualize the relationship between dynamic metadata and tonemapping, below is a comparison between frames of the HDR10+ Dynamic Metadata Test from FF Pictures, tonemapped and rendered in SDR via libplacebo, along with their respective luminance histograms. When dynamic metadata is ignored (just as when viewing the video on an HDR display without software tonemapping), Figure 2 looks most similar to the source.

HDR10+ dynamic metadata test for ideal tonemapping, rendered in SDR, with a luminance histogram.
Figure 2. HDR10+ dynamic metadata test for ideal tonemapping, rendered in SDR, with a luminance histogram.

The alternate shots have contrast and brightness altered to an extreme degree, which transforms the look of the scene and emphasizes highlights and shadows. When the video is viewed on a supported HDR10+ display, this is achieved without any loss of detail, but when tonemapped to SDR, Figures 3 and 4 show that the luminance histograms are compressed, crushing the brighter and darker parts of the image respectively and destroying all texture in those areas.

HDR10+ dynamic metadata test for increased contrast, rendered in SDR, with a luminance histogram showing compressed highlights.
Figure 3. HDR10+ dynamic metadata test for increased contrast, rendered in SDR, with a luminance histogram.
HDR10+ dynamic metadata test for decreasing brightness, rendered in SDR, with a luminance histogram showing compressed shadows.
Figure 4. HDR10+ dynamic metadata test for decreasing brightness, rendered in SDR, with a luminance histogram.

Immersive audio with surround and ambisonics

We’ve given a much-needed upgrade to our audio pipeline by enabling surround audio tracks with standard 5.1-channel and 7.1-channel layouts. Newly uploaded videos to Vimeo automatically receive those new formats if they are eligible. We attempt to convert any channel layout on tracks that have more than two channels to 5.1-channel audio, as well as 7.1-channel if they have more than six channels, retaining audio directionality at a higher fidelity for immersive content. All major browsers support playback of 5.1-channel audio, as we provide AAC as a fallback format for those that don’t support the Opus codec in MP4 containers, but that isn’t the case for 7.1-channel audio, so Safari and Internet Explorer users are limited to 5.1 AAC as of now.

Vimeo 360 is getting some love as well: spatial audio, which tracks the camera’s orientation when viewing 360° spherical videos, is now preserved for first-order and even second-order ambisonic tracks for the highest fidelity in immersive video. Ambisonic audio is typically captured using a specialized microphone array. The large majority of ambisonic microphones are designed for first-order capture, corresponding to four channels of output: omnidirectional (the sound level captured in every direction), front-back, left-right, and top-bottom. Second-order ambisonic microphones are more rare, but provide five more channels of audio for higher definition. Figure 4 shows the channel orientations for ambisonic orders 0 (meaning monophonic) to 3. Each ambisonic order includes all the channels below it, in addition to the ones displayed on its corresponding row.

3D visual representation of 0th to 3rd degree spherical harmonics showing the increasing number of channels and increasing spatial definition.
Figure 5. Spherical harmonics up to degree 3, as used in third-order ambisonics. Image by Dr. Franz Zotter, CC BY-SA 3.0.

Since directional information is known at the source with ambisonic audio, playback can accommodate variable speaker positions. In spatial video like Vimeo 360, the locations of our speakers are virtual, based on the orientation of the camera or VR headset. As we move those virtual speakers by adjusting the view of the video, the ambisonic channels are interpolated to match the viewer’s perspective and provide a more lifelike experience.

There are a few ways to produce ambisonic audio tracks, but the preferred format today, and the one that should be used when uploading spherical videos with spatial audio, is B-format, using SN3D normalization and ACN channel ordering — a combination that is sometimes referred to as AmbiX (although technically this is a distinct audio format). For those who need to provide additional audio content that isn’t part of the 360° video environment, such as narration, this content can be embedded as additional head-locked (fixed-position) stereo channels within the same track.

We recommend encoding ambisonic tracks in the Opus format using a mapping family value of 2, since it provides codec-level support for those channel layouts, but AAC tracks are also processed as long as the files include container-level metadata via the SA3D box defined in the Spatial Audio RFC draft from Google. As with HDR10+, successful ambisonic uploads are labeled on Vimeo near the title of the video, as shown in Figure 6.

Vimeo player with a 360 Ambisonic label.
Figure 6. Vimeo player with a 360 Ambisonic label.

Finally, as with 7.1-channel surround audio, the browser needs to support Opus in MP4 for playback, since we convert all ambisonic tracks to that format. To provide audio on unsupported browsers, we downmix the ambisonic track to 0th-order by eliminating the directional channels and leaving only the omnidirectional information. Then we encode it as a mono track, which makes it possible for affected users to watch a 360° video without the directional audio information.

For more details on playback requirements for Vimeo 360, see our Help Center.

An AVIF update

We’ve made some updates to images recently to enhance quality and performance when serving AVIF. The magnitude of these improvements depends on the properties of the source image, but overall this means more consistent and lower encoding times, as well as better detail preservation across the board.

In our initial benchmarks, encoding time was the main downside of AVIF compared to legacy image formats. Some of the added latency associated with it is offset by lower load times attributed to the superior compression efficiency, and the large majority of images in a typical browsing session are served directly from the CDN cache, rendering image processing time irrelevant, so we considered it to be a reasonable tradeoff. However, there are corner cases where images are still encoded dynamically, so fetching a large image like a full-size 4K video thumbnail could still negatively influence load times.

As we’ve broadened our usage of dynamic image format selection, the amount and variety of AVIF content that we serve has also expanded, leading to higher peak load times for images. As we’re mindful of the need to keep performance reasonable across the Vimeo platform, we recently worked to lower the 99th-percentile response time for our image server, mainly by developing improved benchmarking tools and making our AVIF encodes smarter.

While our AVIF encoding stack (libavif backed by the libaom encoder) is capable of parallel processing via tile threading, there’s a limit to how well this strategy scales, due to both resource constraints and diminishing returns in encoder performance gains as the number of tiles increases. To combat this, we needed a way to evaluate tradeoffs between compression efficiency, encoding speed, and output image quality. To achieve this, we first developed new benchmarking tools to collect and visualize performance across those three metrics. This confirmed that while our AVIF encodes yielded lower file sizes and higher quality than other formats, response time was becoming an issue at higher resolutions. We then tweaked our encoding parameters iteratively to flatten that curve while ensuring that our AVIF outputs remain competitive with — as good or better as — JPEG and WebP in the other dimensions.

As a result, we’re now seeing AVIF scale better with resolution than other formats in many cases, with no degradation in quality! In addition, thanks to our benchmarking work and aided by performance improvements to our encoding stack in recent years, we were able to prioritize compression efficiency at lower resolutions, where response time is not as much of a concern, yielding even lower file sizes and better-looking thumbnails without sacrificing performance in a typical session.

Figure 7 illustrates some local benchmarks. These demonstrate AVIF’s competitiveness on a diverse set of images sampled from our vimeo-corpus-10s data set, at resolutions ranging from downsampled 360p to Ultra HD. AVIF is more than ever the clear winner when it comes to optimizing for file size and quality, and encoding speed is even competitive with mozjpeg and libwebp in many cases, sometimes faster depending on the image characteristics.

2D convex-hull representations of JPEG, WebP, and AVIF performance as functions of size vs. speed, size vs. quality, and speed vs. quality, showing higher quality, and less predictable, but more scalable encoding speed for AVIF.
Figure 7. 2D convex-hull comparisons of JPEG, WebP, and AVIF size, speed, and quality.

Thanks to broader adoption of the format in the most popular desktop and mobile browsers, as well as broader use of dynamic image format selection across Vimeo services since our initial release, about two thirds of all images served on Vimeo today are AVIF, up from less than 40 percent at release. The full distribution of image formats served on Vimeo in the 18 months following the initial AVIF release is shown in Figure 8.

Stacked area chart showing the proportion of AVIF, WebP, JPEG, and PNG images served on Vimeo from June 2021 to December 2023, with AVIF steadily increasing from below 40% to above 60%.
Figure 8. Image formats served on Vimeo since AVIF release.

When fetching images from Vimeo outside of a typical web browser environment (through the API or otherwise), be sure to include the values image/avif or image/webp explicitly in the Accept header of your requests in order to receive these respective formats. Otherwise, your client will fall back to JPEG, which is ubiquitous but far less performant.

Looking forward

We’re continually exploring opportunities to enhance the quality of experience on Vimeo and broaden our reach by incorporating the latest technologies in web multimedia. While we can’t reveal all the specifics of our current undertakings, our focus remains on gradual refinements in image, audio, and video quality, improvements in performance, and the addition of support for new devices, formats, and features, while leveraging AI technologies* along the way. There’s much to anticipate in the pipeline, so stay tuned for upcoming updates.

* For example, by writing conclusions to our blog posts with ChatGPT.

--

--