A Scalable System for Ingestion and Delivery of Timed Text
Offering the same great Netflix experience to diverse audiences and cultures around the world is a core aspect of the global Netflix video delivery service. With high quality subtitle localization being a key component of the experience, we have developed (and are continuously refining) a Unicode standard based i18n-grade timed text processing pipeline. This pipeline allows us to meet the challenges of scale brought by the global Netflix platform as well as features unique to each script and language. In this article, we provide a description of this timed text processing pipeline at Netflix including factors and insights that shaped its architecture.
Timed Text At Netflix: Overview
As described above, Timed Text — subtitles, Closed Captions (CC), and Subtitles for Deaf and Hard of Hearing (SDH) — is a core component of the Netflix experience. A simplified view of the Netflix timed text processing pipeline is shown in the accompanying figure. Every timed text source delivered to Netflix by a content provider or a fulfillment partner goes through the following major steps before showing up on the Netflix service:
- Ingestion: The first step involves delivery of the authored timed text asset to Netflix. Once a transfer has been completed, data is verified for any transport corruption and corresponding metadata is duly verified. Examples of such metadata include (but are not limited to) associated movie title and primary language of timed text source.
- Inspection: The ingested asset is then subject to a rigorous set of automated checks to identify any authoring errors. These errors fall mostly into two categories, namely specification conformance and style compliance. Following sections give out more details on types and stages of these inspections.
- Conversion: An error free inspected file, is then considered good for generating output files to support the device ecosystem. Netflix needs to host different output versions of the ingested asset to satisfy varying device capabilities in the field.
As the number of regions, devices and file formats grow, we must accommodate the ever growing requirements on the system. We have responded to these challenges by designing an i18n grade Unicode-based pipeline. Let’s look at these individual components in next sections.
Inspection: What’s In Timed Text?
The core information communicated in a timed text file corresponds to text translation of what is being spoken on screen along with the associated active time intervals. In addition, timed text files might carry positional information (e.g., to indicate who might be speaking or to place rendered text in a non-active area of the screen) as well as any associated text styles such as color (e.g., to distinguish speakers), italics etc. Readers who are familiar with HTML and CSS web technologies, might understand timed text to provide similar but lightweight way of formatting text data with layouts and style information.
Multiple formats for authoring timed text have evolved over time and across regions, each with different capabilities. Based on factors including the extent of standardization as well as the availability of authored sources, Netflix predominantly accepts the following timed text formats:
- CEA-608 based Scenarist Closed Captions (.scc)
- EBU Subtitling data exchange format (.stl)
- TTML (.ttml, .dfxp, .xml)
- Lambda Cap (.cap) (for Japanese language only)
An approximate distribution of timed text sources delivered to Netflix is depicted below. Given our experience with the broadcast lineage (.scc and .stl) as well as regional source formats (.cap), we prefer delivery in the TTML (Timed Text Markup Language) format. While formats like .scc and .stl have limited language support (e.g., both do not include Asian character set), .cap and .scc are ambiguous from a specification point of view. As an example, .scc files use drop frame timecode syntax to indicate 23.976 fps (frames per second) — however this was defined only for the NTSC 29.97 frame rate in SMPTE (Society of Motion Picture and Television Engineers). As a result, the proportion of TTML-based subtitles in our catalog is on the rise.
The Inspections Pipeline
Let’s now see how timed text files are inspected through the Netflix TimedText Processing Pipeline. Control-flow wise, given a source file, the appropriate parser performs source-specific inspections to ensure file adheres to the purported specification. The source is then converted to a common canonical format where semantic inspections are performed (An example of such an inspection is to check if timed text events would collide spatially when rendered. More examples are shown in the adjoining figure).
Given many possible character encodings (e.g., UTF-8, UTF-16, Shift-JIS), the first step is to detect the most probable charset. This information is used by the appropriate parser to parse the source file. Most parsing errors are fatal in nature resulting in termination of the inspection processing which is followed by a redelivery request to the content partner.
Semantic checks that are common to all formats are performed in ISD (Intermediate Synchronic Document) based canonical domain. Parsed objects from various sources generate ISD sequences on which more analysis is carried out. An ISD representation can be thought of as a decomposition of the subtitle timeline into a sequence of time intervals such that within each such interval the rendered subtitle matter stays the same (see adjacent figure). These snapshots include style and positional information during that interval and are completely isolated from other events in the sequence. This makes for a great model for running concurrent inspections as well. Following diagram better depicts how ISD format can be visualized.
Stylistic and language specific checks are performed on this data. An example of a canonical check is counting the number of active lines on the screen at any point in time. Some of these checks may be fatal, others may trigger a human-review-required warning and allow the source to continue down the workflow.
Another class of inspections are built around Unicode recommendations. Unicode TR-9, which specifies Unicode Bidirectional (BiDi) Algorithm, is used to check if a file with bi-directional text conforms to the specification and the final display ordered output would make sense (Bidirectional text is common in languages such as Arabic and Hebrew where the displayed text matter runs from right to left and numbers and text in other languages runs from left to right). Normalization rules (TR-15), may have to be applied to check glyph conformance and rendering viability before these assets could be accepted.
Language based checks are an interesting study. Take, for example, character limits. A sentence that is too long will force wrapping or line breaks. This ignores authoring intent and compromises rendering aesthetics. The actual limit will vary between languages (think Arabic versus Japanese). If enforcing reading speed, those character limits must also account for display time. For these reasons, canonical inspections must be highly configurable and pluggable.
Output Profiles: Why We Convert
While source formats for timed text are designed for the purpose of archiving, delivery formats are designed to be nimble so as to facilitate streaming and playback in bandwidth, CPU and memory constrained environments. To achieve this objective, we convert all timed text sources to the following family of formats:
- TTML Based Output profiles
- WebVTT Based Output profiles
- Image Subtitles
After a source file passes inspection, the ISD-based canonical representation is saved in cloud storage. This forms the starting point for the conversion step. First, a set of broad filters that are applicable to all output profiles are applied. Then, models for corresponding standards (TTML, WebVTT) are generated. We continue to filter down based on output profile and language. From there, it’s simply a matter of writing and storing the downloadables. The following figure describes conversion modules and output profiles in the Netflix TimedText Processing Pipeline.
Multiple profiles within a family of formats may be required. Depending on the capabilities on the devices, the TTML set, for example, has been divided into further following profiles:
- simple-sdh: Supports only text and timing information. This profile doesn’t support any positional information and is expected to be consumed by the most resource-limited devices.
- ls-sdh: Abbreviated from less-simple sdh, it supports a richer variety of text styles and positional data on top of simple-sdh. The simple-sdh and ls-sdh serve the Pan-European and American geography and use the WGL4 (Windows Glyph List 4) character repertoire.
- ntflx-ttml-gsdh: This is the latest addition to the TTML family. It supports a broad range of Unicode code-points as well as language features like Bidi, rubies, etc. Following text snippets show vertical writing mode with ruby and “tate-chu-yoko” features.
When a Netflix user activates subtitles, the device requests a specific profile based on its capabilities. Devices with enough RAM (and a good connection) might download the ls-sdh file. Resource-limited devices may ask for the smaller simple-sdh file.
Additionally, certain language features (like Japanese rubies and boutens) may require advanced rendering capabilities not available on all devices. To support this, image profile pre-renders subtitles as images in the cloud and transmits them to end-devices using a progressive transfer model. The WebVTT family of output profiles is primarily used by the Apple iOS platform. The accompanying pie chart shows a share on how these output profiles are being consumed.
QC: How Does it Look?
We have automated (or are working towards) a number of quality related measurements: these include spelling checks, text to audio sync, text overlapping burned-in text, reading speed limits, characters per line, total lines per screen. While such metrics go a long way towards improving quality of subtitles, they are by no means enough to guarantee a flawless user experience.
There are times when rendered subtitle text might occlude an important visual element, or subtitle translations from one language to another can result in an unidiomatic experience. Other times there could be intentional misspellings or onomatopoeia — we still need to rely on human eyes and human ears to judge subtitling quality in such cases. A lot of work remains to achieve full QC automation.
Netflix and the Community
Given the significance of subtitles to the Netflix business, Netflix has been actively involved in timed text standardization forums such as W3C TTWG (Timed Text Working Group). IMSC1 (Internet Media Subtitles and Captions) is a TTML-based specification that addresses some of the limitations encountered in existing source formats. Further, it has been deemed as mandatory for IMF (Interoperable Master Format). Netflix is 100% committed to IMF and we expect that our ingest implementation will support the text profile of IMSC. To that end, we have been actively involved in moving the IMSC1 specification forward. Multiple Netflix sponsored implementations for IMSC1 were announced to TTWG in February 2016 paving the way for the specification to move to recommendation status.
IMSC1 does not have support for essential features (e.g., rubies) for rendering Japanese and other Asian subtitles. To accelerate that effort we are actively involved in standardization of TTML2 — both from a specification and as well as an implementation perspective. Our objective is to get to TTML2-based IMSC2.
Examples of OSS (open source software) projects sponsored by Netflix in this context include “ttt” (Timed Text Toolkit). This project offers tools for validation and rendering of W3C TTML family of formats (e.g., TTML1, IMSC1, and TTML2). “Photon” is an example of a project developed internally at Netflix. The objective of Photon is to provide the complete set of tools for validation of IMF packages.
The role of Netflix in advancing subtitle standards in the industry has been recognized by The National Academy of Television Arts & Sciences, and Netflix was a co-recipient of the 2015 Technology and Engineering Emmy Award for “Standardization and Pioneering Development of Non-Live Broadcast Captioning”.
In a closed system such as the Netflix playback system, where the generation and consumption of timed text delivery formats can be controlled, it is possible to have a firm grip on the end-to-end system. Further, the streaming player industry has moved to support leading formats. However, support on the ingestion side remains tricky. New markets can introduce new formats with new features.
Consider right-to-left languages. The bidirectional (bidi) algorithm has gone through many revisions. Many tools still in use were developed to old versions of the bidi specification. As these files are passed to newer tools with newer bidi implementations, chaos ensues.
Old but popular formats like SCC and STL were developed for broadcast frame rates. When conformed to film content, they fall outside the scope of the original intent. Following chart shows the distribution of these sources as delivered to Netflix. More than 60% of our broadcast-minded assets have been conformed to film content.
Such challenges generate ever increasing requirements on inspection routines. Thrash requires operational support to manage communication across teams for triage/redelivery. One idea to solve these overheads could be to offer inspections as a web service (see accompanying figure).
In such a model, a partner uploads their timed text file(s), inspections are run, and a common format file(s) is returned. This common format will be an open standard like TTML. The service also provides a preview of how the text will be rendered. In case an error has been found, we can show the partner where the error is and suggest recommendations to fix it.
Not only will this model reduce the frequency of software maintenance and enhancement, but will drastically cut down the need for manual intervention. It also provides an opportunity for our content partners who could integrate this capability into their authoring workflow and iteratively improve the quality of the authored subtitles.
Timed text assets carry a lot of untapped potential. For example, timed text files may contain object references in dialogue. Words used in a context could provide more information about possible facial expressions or actions on the screen. Machine learning and natural language processing may help solve labor-intensive QC challenges. Data mining into the timed text could even help automate movie ratings. As media consumption becomes more global, timed text sources will explode in number and importance. Developing a system that scales and learns over time is the demand of the hour.
— by Shinjan Tiwary, Dae Kim, Harold Sutherland, Rohit Puri, and David Ronca
Originally published at techblog.netflix.com on April 18, 2016.