Implementing Dynamic Optimizer in Production

As described in more detail in this blog post, the Dynamic Optimizer analyzes an entire video over multiple quality and resolution points in order to obtain the optimal compression trajectory for an encode, given an optimization objective. In particular, we utilize VMAF, the Netflix subjective video quality metric, as our optimization objective, since our goal is to generate streams at the best perceptual quality.

The primary challenge we faced in implementing the Dynamic Optimizer framework in production was retrofitting our parallel encoding pipeline to process significantly more encode units. First, the analysis step for the Dynamic Optimizer required encoding with different resolutions and qualities (QPs), requiring an order of magnitude more complexity. Second, we transitioned from encoding video chunks of about a few minutes long, to video encodes on a per-shot basis. For example, in the original system, a 1-hour episode of Stranger Things results in twenty 3-minute chunks. With shot-based encoding, with an average shot-length of 4 seconds, the same episode requires processing of 900 shots. Assuming each chunk corresponds to a shot (Fig. 1B), the new framework increased the number of chunks by more than two orders of magnitude per encode, per title. This increase exposed system bottlenecks related to the number of messages passed between compute instances. Several engineering innovations were performed to address the limitations and we discuss two of them here: Collation and Checkpoints.

While we could have improved the core messaging system to handle such an increase in message volume, it was not the most feasible and expedient solution at that time. We instead adapted our pipeline by introducing collation.

Figure 1: Collation of shots into ‘chunks’. (A) Representation of a video timeline. The dashed vertical black lines represent shot boundaries. (B) One shot in one chunk: Each shot is assigned a chunk. (C) Collate shots into a chunk: Accumulate integer number of shots within a target chunk duration.

In collation, we collate shots together, so that a set of consecutive shots make up a chunk. Now, given that we have flexibility on how such collation occurs, we can group an integer number of shots together so that we produce approximately the same 3-minute chunk duration that we produced initially, under the chunk-based encode model (Fig. 1C). These chunks could be configured to be approximately the same size, which helps with resource allocation for instances previously tuned for encoding of chunks a few minutes long. Within each chunk, the compute instance independently encodes each of the shots, with its own set of defined parameters.

Figure 2: Checkpoints.

Collating independently encoded shots within a chunk led to an additional system improvement we call checkpoints. Previously, if we lost a compute instance (because we had borrowed it and it was suddenly needed for higher priority tasks), we re-encoded the entire chunk. In the case of shots, each shot is independently encoded. Once a shot is completed, it does not need to be re-encoded if the instance is lost while encoding the rest of the chunk. We created a system of checkpoints (Fig. 2) to ensure that each encoded shot and associated metadata are stored immediately after completion. Now, if the same chunk is retried on another compute instance, encoding does not start from scratch but from the shot where it left off, bringing computational savings.