Video@Scale: Upload Unification

Ning Zhang
5 min readMar 16, 2018

--

This blog series describe the work of the Video Infrastructure team at Instagram during 2017. This post talks about how we unified multiple video upload code path into one well designed, layered and componentized upload infra with big metrics wins, and our learnings. Disclaimer: This is a personal blog. I do not speak for Facebook, its subsidiaries, or associates.

As video features in Instagram feed, direct, and stories were developed over the years, the video upload code path was copied and modified for each new use case, so we ended up with multiple different video upload code path on client and server :-) While the copy and paste approach decoupled and simplified the video upload code path among use cases and sped up the launch of new features, it also accumulated technical debt with obvious downsides:

  • Product code is deeply coupled with infra code, making it hard for either team to own and improve their code base.
  • For infra team, the maintenance cost for video upload code path increases almost quadratically, as bug fixes and performance, efficiency, and reliability (PER) improvements need to be replicated and customized for each code path.
  • For product teams, the inconsistent video infrastructure complicated new feature development and became a key pain point.
  • And the duplicated code increases app weight and method count, degrade overall app performance.

So we decided to “garbage collect” duplicated legacy code path into one unified video upload infrastructure that can support all existing and planned use cases. It has clearly defined interfaces to separate product from infra, and decouple the layers and components of the upload infra, so we can improve or replace each component and layer independently, and all products benefit from infra improvements automatically without code change on their side.

We defined success metrics as improvements in latency (how long does it take to upload video) and reliability (percentage of successful uploads). We did an end to end analysis of the video upload to playback flow (see below diagram) to see where time was spent and failures happened.

Here are two key findings:

  1. Most of the upload time was spent and failures happened on client during the encode and upload steps. Video files are huge so they need be compressed (encode) before uploading. Mobile phones usually have limited CPU and memory, and flaky network connections, so both encoding and uploading can take a long time. Mobile apps also get killed frequently, either by user or by system (like out of memory), so retry or restart need to happen as well to improve reliability at the cost of latency.
  2. On the server side, we used nginx servers with chunked Transfer-Encoding to receive the huge video files via multiple HTTP requests. This turned out to be a major operational headache, source of failures and latency. And it is not well instrumented for logging and monitoring.

For #1, we decided to do “segmented upload”: each video file is broken down into multiple segments, each segment is encoded and uploaded individually, so we can parallelize/overlap encoding and uploading at segment level and cut the upload time significantly for long videos. We also do smart resume & retry at segment level instead of file level to improve both reliability and latency.

For #2, we decided to deprecate our own nginx server and chunked transfer-encoding protocol and adopt FBUpload, the upload service and protocol provided by Facebook Video Infra team, for better latency, reliability, and scalability. With this migration, we also benefit from future improvements by FBUpload team.

We chose feed as the first use case for the new video upload infra, as feed has the longest videos (up to one minute) hence can benefit the most from the new infra. It took us 6 months to implement the new video upload infrastructure with upload unification, segmented upload, and FBUpload integration. The results were very encouraging: latency was cut by more than 2x, and failure rate was down by almost 5x. Once we demonstrated the metrics improvements to product teams and outlined our plan for future improvements, product teams embraced the new video upload infra enthusiastically, committed resource to migrate product code to the new infra, and saw significant product metrics wins, in both number of video uploaded/sent and time spent. The upload unification project was highlighted at various Better Engineering reviews within Facebook.

Here are some of our learnings from the upload unification project:

  • A well designed, unified infrastructure is an enabler and a multiplier. With the new upload infra, we were able to support tremendous growth of Instagram video use cases. Product teams can now focus on product development and benefit automatically from performance, efficiency and reliability improvements from infrastructure, and all Facebook apps (Facebook, Instagram, Messenger etc) benefit from the company wide common infrastructure like FBUpload.
  • Infrastructure must always focus on product benefits and demonstrate value in product metrics gains. From the beginning we focused on metrics that matter to product teams: latency and reliability, and demonstrated the value of the new infra with product metrics gains: number of video uploaded/sent and time spent.
  • Focusing on continuous and incremental gains to mitigate risk and establish credibility. The upload unification was very complex and took almost a year to finish, across infra features (upload unification, segmented upload, and FBUpload integration), platforms (Android, iOS and server), and product use cases. The combination matrix was mind-boggling. We focused on one use case only at the beginning; decoupled upload unification, segmented upload, and FBUpload integration as separate work streams; and focus on end to end scenario from client (either Android or iOS) to server. This approach allowed us to always focus on the simplest possible end to end use case incrementally, deliver real infra and product metrics gains at every step, and build up product adoption over time. This approach may have lengthened the overall project timeline, but the benefits (in team morale, risk mitigation, and credibility building) way out-weigh the cost.

I’d like to highlight a key Instagram engineering principle: Do the simple thing first. This is demonstrated throughout the project:

  • It was OK to copy and paste the video upload code for a new use case when there was no upload infra. Always focus on the end result, speed to market, and fail fast if necessary.
  • We did the simplest thing possible at the beginning: treat all video the same, encode and then upload sequentially. It worked :-)
  • We then built the unified upload infra with segmented upload and FBUpload integration. It greatly improved performance, efficiency, and reliability.
  • We then made it smart. There is no “typical” video or user: an influencer is different from someone with few followers, an iPhone user in the US is different from an Android user in India, and video in Direct is different from video in Feed. We need to treat each video differently, start with heuristics, and then move to machine learning trained models, to optimize the upload process for each video individually.

All posts of this Video@Scale series:

--

--