Shot-based encoding

For the remainder of this tech blog, we assume the reader is familiar with the basics of adaptive streaming, such as

Multiple coded representations of the same visual content in different resolutions and/or qualities, using a basic unit of processing, referred to as “streaming segment”

Delivery of encoded segments from a server, as requested by a streaming client, that belong to different representations in order to accommodate varying channel conditions (bitstream switching)

Temporal alignment of segments among different coded representations of the same visual content to allow bitstream switching

Interested readers can refer to a number of available adaptive streaming tutorials, such as this Wiki page [7].

Chunked encoding

In a previous Netflix tech-blog [8], published in Dec. 2015, we described how encoding on the cloud benefits greatly from “chunked” encoding. This translates into breaking a long video sequence, e.g. of 1 hour duration, in multiple chunks, each of a certain duration — for example, 20 chunks, each 3 min. long. We then perform encoding of each chunk independently with a certain encoding recipe, concatenate or “assemble” the encodes and thus obtain an encoded version of the entire video sequence.

Among the advantages of chunked encoding, the most important is that it allows for a robust system to be built on the cloud using software video encoding. If and when cloud instances fail to complete a certain encode, it requires re-processing the corresponding chunk only, instead of restarting an entire hour-long video encode. One can also see the reduction in end-to-end delay, since different chunks can be encoded in parallel; thus achieving almost infinite scalability in the overall encoding system.

There are some penalties that come with chunked encoding — namely the fact that a video encoder operating over the full hour-long sequence, especially in two-pass mode, can preview what is following and therefore do better long-term bitrate allocation; thus achieving better overall quality at the same bitrate. Yet, the advantages that come from chunked encoding outweigh these penalties.

Per-title and per-chunk encode optimization

At Netflix, we have been constantly improving video quality for our members all over the world. One major milestone in our continuous efforts has been “Per-title encode optimization”, described in great detail in our techblog, posted in Dec. 2015 [9]. Per-title encode optimization introduced the concept of customizing encoding according to complexity, which translates to proper resolution and bitrate selection for each video sequence we have in our catalog. This provided significant improvement over our previous fixed resolution/bitrate ladder generation, by taking into account the characteristics of video — amount of motion, level of detail, colorfulness — and optimizing coding efficiency by selecting encoding parameters that better fit each title. Another important milestone has been “per-chunk encode optimization”, introduced in Dec. 2016 as part of our “Mobile encodes for downloads” initiative, explained in more detail in this Netflix tech blog [10]. The concept of equalizing rate-distortion slopes, discussed in more detail in a subsequent section, was also used in that work and provided significant improvements. In fact, one can consider the current work a natural extension of the “Per-title encode optimization” and “Per-chunk encode optimization”; we can call it “Perceptual per-shot encode optimization”.

From chunks to shots

In an ideal world, one would like to chunk a video and impose different sets of parameters to each chunk, in a way to optimize the final assembled video. The first step in achieving this perfect bit allocation is to split video in its natural atoms, consisting of frames that are very similar to each other and thus behave similarly to changes to encoding parameters — these are the “shots” that make up a long video sequence. Shots are portions of video with a relatively short duration, coming from the same camera under fairly constant lighting and environment conditions. It captures the same or similar visual content, for example, the face of an actor standing in front of a tree and — most important — it is uniform in its behavior when changing coding parameters. The natural boundaries of shots are established by relatively simple algorithms, called shot-change detection algorithms, which check the amount of differences between pixels that belong to consecutive frames, as well as other statistics. When that amount of difference exceeds certain fixed or dynamically adapted threshold, a new shot boundary is announced.

There are cases, such as cross-fades or other visual effects that can be applied on the boundary between two consecutive shots, which can be dealt with by more sophisticated algorithms.

The end result of a shot-change detection algorithm is a list of shots and their timestamps. One can use the resulting shots as the basic encoding block, instead of a fixed-length chunk. That provides for a few really unique opportunities:

The placement of Intra frames can now be “consistently irregular”, a term that means (a) Intra frames can be placed in a “random” place, for example the first 4 Intra frames can be at times 0, 2, 5, 7 secs. and (b) Yet, temporal positions are always aligned among encodes of the same title, in order words, the location of the first 4 Intra frames remains at 0, 2, 5, 7 secs. for all encodes of this title. The irregular placement of Intra frames results in minimum coding overhead; keep in mind that Intra frames are the least efficient among the 3 different types (I/P/B) used in video coding, and thus one wishes to minimize their presence in an encoded video. Seeking in a long video sequence now leads to natural points of interest, which are signaled by shot boundaries. There is no prediction penalty when encoding shots independently: if one instead places an Intra frame in the middle of a shot, this breaks the shot into parts that, when coded independently instead of a single unit, require more bits, since pixels after the Intra frame can’t reference their similar counterparts in frames before the Intra frame. Any significant encoding parameter change between consecutive shots is much less likely to be noticed by the human eye, since the disruption incurred by the different visual content in different shots is far more disruptive to human visual system than any possible encoding parameter — such as resolution/quality — change. Within a homogeneous set of frames, such as those that belong to the same shot, there is much less need to use rate-control, since very simple coding schemes, such as the fixed-quantization parameter (“fixed QP”) mode, supported by virtually all existing video encoders, offers a very consistent video quality, with almost minimal bitrate variation. In fact, “fixed QP” has always been used during development of video codecs, since almost all sequences used for testing in MPEG, ITU and other standards bodies, consist of single-shot video chunks.