The Challenges of Live Linear Video Ingest — Part Three: Key Learnings
By Allison Deal, Senior Software Developer
If you’re just joining us, check out parts one and two of our live video ingest blog series before jumping into our final post. In Part One, we talked about the challenges and design requirements for our live video ingest system, and outlined how we built the system in Part Two. In the final post of the series, we’ll take a closer look at specific learnings around the most challenging issues we encountered when building our live video ingest service.
Unlike most consumer-facing systems, our live video ingest service has a steady and predictable request rate due to the consistency at which video playlists and segments are published. Specifically, the goal is to provide the highest availability live streaming service, with the highest quality of video that a viewer can consume for their bandwidth. Here are some of the specific challenges we identified and mitigated to reduce downstream rebuffering and playback errors for our viewers.
Varied Inputs Require a Robust, Flexible System
If you’ve been following along from our previous post, you know that we work with multiple vendors that provide us with encoded streams from many networks. Because there are many sources and parties involved in this process, the video files and metadata we receive are often changed in a variety of ways before the stream reaches Hulu. We follow multiple industry standards to ensure that system inputs are received in a regulated, consistent manner. However, these specifications are often implemented differently by each party.
In order to optimize the service for each input set, we have developed unique configurations that can be automatically or manually applied on a per channel, per provider, or per vendor basis. These configurations allow us to calibrate processing and specify error thresholds based on the traits of any given stream or set of streams.
Timestamp Alignment and Precision
One important functionality of the ingest system is identifying different renditions that contain the same video. The system initially incorrectly assumed that all wall clock timestamps would be aligned for the same content across the bitrate ladder, which is necessary for the client to smoothly switch between qualities. In order to mitigate this problem, we added a configuration to control timestamp precision. In some cases, this is set up to one tenth of a second to correctly align video segments across qualities. In other cases, a separate configuration is applied so that these rendition groups are identified by common video PTS (presentation timestamp) values.
Automatically Ending Ad Breaks
SCTE-35 markers are used for indicating when ad pods and programs start and end. The hardware and systems used to insert this metadata were originally designed for digital television and cable. The SCTE-35 specification, which details how these messages are sent, has evolved and expanded its scope over the years, but digital systems in the workflow aren’t always able to keep up to date with the recent versions. Different vendors often interpret the specification in ways which aren’t compatible or interoperable. The SCTE-35 specification, which details content metadata conversion for OTT compatibility, contains very loose definitions and is often implemented differently by each channel or provider. These markers are generated by each TV station and are often modified when passed through each provider and vendor before reaching Hulu. Occasionally, ad start markers may indicate inaccurate ad durations, and sometimes ad end markers are not received by Hulu at all. In order to prevent the user from experiencing an unending ad state when inaccurate markers are sent, Hulu ingest automatically times out the ad and puts the user back into a program after a configurable amount of time. The system’s ad timeline logic simply logs any late cue-in (ad end) events for later optimization of the channel’s timeout limit.
Occasionally, we see media playlists with timestamps referencing media files into the past or future. In order to ensure that we only process live video, we verify that incoming playlists and media fall within a channel’s reasonable current timestamp window before ingesting.
To Build the Best System: Fine Tune, Fine Tune, Fine Tune
Each component of our system needs to be finely tuned and optimized to minimize latency and errors. Video processing is complex, and one seemingly small error or latency can cause streams to be incorrectly ingested or not processed in time to keep up with the live edge.
Minimum Segment Duration
Video segments are split by the encoder at a regular cadence of four seconds. However, these segments are cut shorter when content transitions between a program and advertisement, regardless of duration, so that a media segment only contains ad or only contains program content. This is necessary so that we can dynamically replace original ad segments with new ads relevant to each viewer. Consecutive ad markers occurring very close together were resulting in multiple sub-second segments in a row. Often, the time it takes to transfer and process each of these segments is longer than the segment’s duration, resulting in rebuffering and poor playback quality for users. To mitigate this problem, we have worked with video encoding vendors to combine consecutive ad markers and ensure a minimum segment duration of 0.5 seconds.
Rebuffer event count over time. Minimum segment duration change was enabled just after 21:00.
Segment Publishing Timeout
Encoding vendors first attempt to post media files to Hulu’s ingest service, followed by its corresponding media playlist. In the case where the media was unable to be published within a certain amount of time, the media playlist will contain a discontinuity to indicate the segment missing, and it will not be available for the end user during video playback. By working with our vendors to set varied minimum segment publish timeouts between 150% of segment duration (for longer segments) and 250% of segment duration (for shorter segments), we decreased the missing segments in our system by 52%. This is compared to the previous configuration of using minimum timeouts equivalent to 150% of segment duration across the board.
When our packaging service detects a high number of missing segments on a channel, we alter a configuration to increase the time to wait for the segments to arrive from the encoding vendor before the system gives up and moves on to more recent video. An increase in this wait time will cause users to fall more behind the live edge, but fewer segments will be missing and users will have a more continuous playback experience, so we only enable this offset on the most problematic channels. Decreasing this publishing latency causes more missing segments, but viewers will remain closer to the viewing content in real-time. By analyzing missing segment metrics, we find that setting the wait duration to be equivalent to 100% of the segment length decreases the frequency of missing segments by 63%. Any increase beyond this amount minimally improves playback experience and pushes the users unnecessarily behind the live-edge.
Tips for Better Media File Transfer: Private Vendor Connections and Optimizing Amazon S3
Another major challenge was speeding up transfer times of media files during ingest. These media files are first s ion.
Vendor Network Connections
Hulu’s encoding vendors are located in various regions across the United States. We noticed that performance of transferring media files from vendors on the opposite coast to our ingest service was not where we wanted it to be, utilizing public internet connections, which caused latency and unpredictable performance. To overcome this challenge, we worked closely with our vendors to setup AWS Direct Connect and establish private connections between vendors’ publishing platforms and Hulu’s ingest service. This bypasses public internet, resulting in faster and more consistent file transfer speeds.
S3 File Operations
Our service uses S3 for both temporarily and permanently storing playlists and video segments. We identified sporadic S3 file operation times as a challenge to achieve consistent user playback quality. S3 upload and copy operations are critical to processing because if a video cannot be saved or moved to the correct location in time, it will not be available for end users and result in playback interruption. To remove the sporadic operations times, we continuously analyze metrics to determine a current expected median time for each file based on its size. Once an elapsed file publish time exceeds this expected time, the publish operation is immediately cancelled and retried by the publishing service. This implementation improved under-performing S3 operation times by up to 35% and eliminated nearly all cases of playback degradation.
Slowest 1% of publish operation times (milliseconds). Retry feature was enabled just before 15:00.
Although we encountered a variety of new challenges when working with multiple input sources and connections, in many cases we were able to identify and mitigate problematic aspects of our original implementation to meet our initial requirements and improve our video ingestion pipeline. Overall, our design was sufficient for our initial Live TV launch, but we’re continuously improving and adding new features and capabilities to build an even better playback experience for our viewers.
Attending Grace Hopper this year? Come say hello and join me for an in-depth talk about our live linear video ingest system on Thursday, 27th at 1PM!
Interested in joining our video team and working on similar systems? See our full list of openings now.
Allison Deal is a senior software developer at Hulu, specializing in video encoding and streaming technologies. She works on building and scaling the end-to-end live and on-demand video pipelines, with the ultimate goal of improving the playback experience for all viewers. She has been at Hulu for over three years, with prior stints at Rdio and Boeing, where she worked in Research and Development.