Optimizing video playback performance

Published in

Pinterest Engineering Blog

8 min readAug 11, 2017

Norbert Potocki | Pinterest engineering manager, Video & Image Platform

Guaranteeing a great video playback experience for all Pinterest users is a big engineering challenge. In this post, we’ll discuss various aspects of playback performance and how our engineering team optimized it. Read on!

The importance of playback performance

Engagement with video is significantly impacted by how fast it performs. There are external analyses showing how interruption or slowness of video loading can result in users abandoning the experience. A three second delay can cost you 13 percent of views. It takes a lot of infrastructure and fine tuning to consistently meet a performance bar. This is especially true in places with poor network infrastructures. The type of smartphone and data plan a user has can also make a big difference.

Five symptoms of poor video performance

To make sound decisions and set meaningful performance goals we needed visibility into how video performs on all client platforms (web, Android and iOS) and across various geographical regions. To do this, we started working backwards by identifying how performance bottlenecks manifest themselves to Pinners and how we can measure them.

Symptom 1: Video loads slowly

How many times have you become restless while watching the spinner turn as you wait for a video to load? We’ve all been there. To capture and measure the scale of this issue we use a metric called Perceived Wait Time (PWT). It measures the time between the user’s intent to see a video (click on a video, scroll auto-playing video into the viewport) and the moment the playback starts. There are few other metrics we pay close attention to:

Time to first frame (TTFF): The time between the video request being sent and the moment the video is loaded (i.e. enough data is received to render the first frame). It can be significantly higher than PWT if the client implements prefetching or stores a copy of the video locally.
The number of adaptive bitrate (ABR) streaming segments needed for playback to start. (We’ll cover this in more detail later.)
Network-level metrics: Technical metrics for connection performance analysis. These include DNS resolution time, connection handshake time, download speed and bandwidth utilization.

*Waiting for a video to load can be annoying — displaying a spinner doesn’t help much.*

Symptom 2: Video pauses randomly

Have you ever seen a video suddenly pause halfway through? There are several reasons why this could happen, and we try to measure the issue with these metrics:

Number of stalls per-play: How often a video playback pauses and falls into “buffering” state. This is weighted by the video length when we aggregate over multiple playback sessions.
Time-to-resume (lag length): How long it takes from the moment a user sees the stalled video to the moment it resumes.

Symptom 3: Low picture quality

Another symptom of poor performance is a low picture quality which is commonly a result of not having sufficient bandwidth to deliver a higher quality video stream. This could also be a result of a poorly optimized video or lagging infrastructure. To identify which scenario it is, we review the following metrics:

Video variant usage: Measures how many times each video variant is used and counted on a single ABR segment level.
A number of variant changes during playback: How often the video changed a variant that’s shown to the user.
Bandwidth utilization: How much of the available bandwidth is used by the video stream.

Symptom 4: When seeking (scrubbing) through a video it takes a while for it to resume

When you drag and release the scrubber, you expect the video to resume immediately. Often that’s not the case, and you have to wait for the playback to start again. We monitor this lag with a metric called Trick Mode Delay.

Scrubbing should be a snappy experience.

Symptom 5: Video is out of sync with audio or doesn’t play altogether

This behavior can be caused by a multitude of reasons, but very often it’s due to the end device being underpowered. To understand the issue better, we use information about device capabilities including data on hardware support for the codecs we use and device resource utilization during playback.

Our remedy: Well-tuned Adaptive Bitrate (ABR) streaming

To enable efficient streaming and mitigate some of the problems listed above we use adaptive bitrate streaming, which is currently the industry standard for video delivery. Two of the most popular implementations of ABR streaming are Apple’s HLS and MPEG-DASH. Both of these technologies work by encoding a source video file into multiple streams with different bitrate. Then those streams split into smaller segments with similar duration times (e.g. few seconds). The video player then seamlessly switches between the streams depending on the available bandwidth and other factors. This allows the video to still play (at a lower quality) when the signal is poor and jump to the higher quality stream when the signal strengthens.

Tuning ABR streaming

The tricky part of working with ABR is fine-tuning the configuration for each of the streams so they perform at their best within the product. It’s worth mentioning that there’s no single configuration that can be universally applied to all products. It’s a highly product-specific setup which evolves over time. At Pinterest we deal mostly with short, pre-generated content (think VoD) that autoplays as users scroll through their home feed. It’s mostly used on various mobile devices where bandwidth can be limited. Let’s discuss which parameters can be tuned to satisfy that criteria.

The entire process is sort of “rinse and repeat” by nature. You select starting parameters, observe how playback performs on live traffic and adjust your configuration based on those signals. You may need to continue updating your configuration as the network infrastructure and device capabilities change over time. Collecting the metrics mentioned above does simplify the process.

Number of streams, resolutions and bitrates

Our initial decision for how many streams to provide and which resolutions to use were based on two pieces of information. First, we had a good understanding of all product surfaces where a video could appear and their dimensions. Second, we used the knowledge of the network infrastructures available in the major markets we operate in to set initial bitrates.

Once we rolled out the initial setup, we started collecting metrics: PWT, TTFF, how often video player changes the used stream and how closely our streams represent the available bandwidth. Based on those signals we further tuned our setup to minimize PWT.

Frame rate

The frame rate you choose for your video streaming significantly impacts the necessary bandwidth needed for smooth playback. It may also alter the feel of the video. There are three commonly used frame rates for streaming.

24fps: Historically used in movie theaters. Can be a bit jerky on modern devices.
30fps: This is the standard for most streaming services, and it performs well for most types of content. Fast action sequences (e.g. sports) and situations when you want a “real life” effect (e.g. news, theater plays, nature videos) can benefit from a higher frame rate.
60fps: This is pretty close to the capacity of how much information the human eye can process. Going beyond this level normally doesn’t make much sense unless you intend on slowing down the video during playback. It’s often used for high-paced scenes and when you want a “real life” feel to your video. It’s not a great cinematic experience since it looks like real life and reveals too much detail (e.g. flaws in makeup, costumes, scenery). Thus viewers don’t feel the same “movie magic” as they would with 30fps (or even 24fps) movies. However, this is great for fast action gaming.

For our use case, the 30 fps frame rate works great as a rule of thumb. As our media corpus evolves, we’re planning to add 60fps variants to some media files. It’s worth mentioning that it makes sense to reduce the frame rate for the lowest quality bitrates. For Pinterest we re-encode videos at 15fps, because we’ve found users are less satisfied if a video stalls than if it has a slightly jerky look.

*Choosing the best frame rate for your videos will help achieving a desired visual effect.*

Segment size and duration

There are many recommended segment durations found on the internet. For a long time Apple’s recommended segment duration was 10s, and they recently revised it to 6s for HLS streams. The segment duration that works best for Pinterest is 4s. We used the following guidelines to make this decision :

Some of the video players (most notably iOS AVFoundation) load a few video segments before the playback can actually start. For high-quality streams and long segments, this can result in a very high PWT.
The longer the segment is, the longer the playback will take to adjust to changing network conditions.
Making segments very short (e.g. 1s) will result in a lot of requests to the server which increases network and processing overhead.
As a rule of thumb, each segment byte size should be below 1MB. This results in a lower eviction rate by most of the CDN providers.

Video and audio codecs

When deciding which codec to use, there are two limiting factors:

First is the technology you use for streaming. For example, HLS forces you to stick to h.264 video codec and one of a few supported audio codecs. DASH can be more flexible because it’s codec-agnostic.
The second factor is capabilities offered by users’ devices. Pinterest is available globally, and many of our users have older devices with limited processing power. For their sake, we use the most widely supported configuration–h.264 codec, main profile, at level 3.1 accompanied by HE-AAC v1 and v2 codec for audio channels.

Fine tuning

A few final minor adjustments included selecting the quality of the stream used for the first segment playback. To minimize PWT we use a medium-quality stream for the first four seconds of playback instead of going with a high-quality one.

Another setting forces our encoder to place an IDR frame at the beginning of each segment. This makes it so the video player doesn’t have to load entire segments before it starts rendering frames and greatly reduces PWT and Time-to-resume metrics.

We also ensure our streaming technology supports when trick mode, is enabled. Both HLS and DASH support this feature and enable a performant scrubbing, minimizing the Trick Mode Delay metric.

The final optimization is making sure our delivery infrastructure is extremely fast. To guarantee this we use several CDN providers and select the one that’s currently the fastest a for each user.

Final thoughts

Making video playback performant in a large scale product requires a lot of tuning. We highly encourage you to play with different streaming technologies and settings to see what setup works best for your product. We’d also love to discuss the results with you, so feel free to reach out to our Video & Image Platform team!

Acknowledgements: Video & Image Platform team members: Rui Zhang, Jared Wong, Tianyu Lang, Nick DeChant as well as other contributors: Josh Enders, Kynan Lalone from Traffic team and Dom Bhuphaibool, Bin Liu and Edmarc Hedrick from Core Experience team.