Video@Scale: Playback Innovation

Ning Zhang
5 min readMar 28, 2018

--

This blog series describe the work of the Video Infrastructure team at Instagram during 2017. This post talks about how we improved the video delivery and playback experience and shares our learnings. Disclaimer: This is a personal blog. I do NOT speak for Facebook, its subsidiaries and associates.

The mission of the video delivery and playback infrastructure team is to provide the best video watching experience:

  • Instantaneous: video plays immediately upon start or seek, measured in percentage of videos played within one second of play or seek request.
  • Smooth: video plays without stall, measured in stall ratio: stall time over total watch time.
  • High definition: video plays in high and optimal quality allowed by device, network and other factors, measured in High Video Quality ratio, a rather complex aggregate metrics, consisting of SSIM and other metrics.

Below is a simplified conceptual diagram of the video delivery and playback infrastructure:

On the client side:

  • When the app starts, it binds UI element and video URL through the PlayerController to the VideoPlayer to play (i.e. decode and render video content). It also issues prefetch requests for videos it thinks likely to play next.
  • The VideoPlayer plays video content from a playback buffer, to ensure smooth playback despite flaky network, at the cost of playback latency. The BufferManager monitors and fills the playback buffer from cache. If there is a cache miss, a network request is issued to the NetworkManager.
  • The NetworkManager queues and prioritizes video fetch and prefetch requests, together with other network requests, to deliver the best user experience with the least bandwidth usage.
  • The BandwidthEstimator (BWE) uses network statistics to predict bandwidth. The ABR (Adaptive Bitrate Streaming) algorithm uses the bandwidth estimate and playback buffer stats to dynamically switch among bitrates to ensure smooth playback at optimal quality.

On the server side:

  • All network requests go through at least two levels of cache: the edge cache in regional PoPs and the origin cache in Facebook data centers, to improve scalability and latency.
  • The origin cache can be populated from either transcoding servers (like Live video and on demand transcoding) or storage servers (e.g. pre-transcoded VOD and post Live content).
  • The server has complex logic to decide proactive or on-demand transcoding and delivery of the optimal bitrate and format of the video content to serve the request; and may proactively push video content to the origin and edge cache before they are requested, to reduce cache miss and playback latency.

Of course, we didn’t get to above system in one go. Here are some of the key efforts before and mostly during 2017:

  • Simple things first: We started with the video player of the client operating system (Android, iOS and browser), supported progressive download only, and pretty much treated all videos the same. It worked: we launched video features and grew user bases quickly.
  • Application Player: The system video players differ dramatically from client to client (different devices, OS’s, browsers, and different versions of the same device/OS/Browser), lack features and performance we need, and can’t keep up with our rapid growth and iteration. We need a player shipped in the app to all clients that we can improve from release to release. On Android, we wrapped ExoPlayer; on iOS, we developed our own called FnFPlayer.
  • Out of Proc Player: both system and application video players use low level media stack APIs for decoding and rendering. The Android platform was fragmented, OEMs routinely customize Android, especially the media stack, to differentiate their products. This caused lots of crashes while playing video, so we moved the video player to a separate PlayerService process and uses IPC (local socket) to communicate between the app and the player processes. Our Android lead Zen demonstrated the reliability improvement: he killed the player service using adb but the video continued playing; behind the scene, the video infra immediately detected the PlayerService termination, restarted it, and resumed the playback.
  • Refactoring and unification: like in video upload, initially there was no clear separation of concerns between product and infra code and among different layers and components of video playback. As new video features were implemented, the playback stack was usually copied and tweaked for each new feature. So we ended up with several different video players and player controllers; many products did their own fetching, prefetching and buffer management; and the same video metrics had different meanings for different product and platform combinations :-) This is definitely not good. Throughout 2017 we continuously refactored the playback stack to eliminate duplicate code, implemented clear separation between product and infra code and among the different components of the playback stack.
  • Logging and metrics: I’d like to specifically call out the logging and metrics effort as they are super important in measuring business and engineering progress and guiding investment. We defined the state machine for video playback, spec-ed out all logging and metrics calculation, replaced product specific logging with a unified logger, validated and cleaned up logs with State Based Logger. We built comprehensive dashboard, alerts, and weekly status report for our metrics. Our server lead Lukas, iOS engineer Wee, and TPM Mona played key role in driving the logging and metrics effort.
  • ABR & Innovations: another key effort and achievement in 2017 is to implement ABR. Progressive download is old technology. To provide instantaneous, smooth, and high quality video playback for almost a billion monthly active users, we adopted and improved ABR, experimented and implemented all kinds of changes, from client side bandwidth estimation and network shaping, to server side smart transcoding and CDN priming, to new video protocols, storage and compute infrastructures etc. We are at the scale and complexity that existing technologies may not be good enough so we have to experiment and innovate to continue serving our fast growing user community better.

Looking back, here are some learnings that may be worth sharing:

  • Always focus on speed and results. The industry changes so rapidly and our product grows so fast, it is hard to anticipate what’s next, so we focus on shipping high quality MVPs (most viable product: the simplest, complete, key use cases) quickly, leveraging existing technology and components as much as possible. If it works out well, we increase investment in both product and infrastructure; if not, we keep tweaking it or try something else :-)
  • Developer infrastructure is an enabler. To be able to try out new ideas concurrently and continuously, we need CI, CD, feature toggle, A/B testing etc infrastructure. Most of the changes mentioned above were carried out concurrently, across features, platforms, client and server, by many teams across Instagram and Facebook. It wouldn’t have been possible without the amazing infrastructure at Facebook.
  • Continuous innovation is a differentiator. Ideas are a dime a dozen, first mover advantage doesn’t carry you far on Internet. It is the capability to continuously innovate, disrupt, deliver, and solve the hardest problems ahead of others that differentiate winners from losers. For that, you need passionate, dedicated people and a culture to excel.

All posts of this Video@Scale series:

--

--