Video social analytics at scale using Apache Spark

zeev
Vimeo Engineering Blog
8 min readAug 13, 2020

In today’s world of social distancing, videos are more important than ever before, and social media is a big part of this new world. Vimeo provides members like you with in-depth analytics when you publish videos to social media platforms. In this blog post, we take a deep dive into how we developed semi-complex data pipelines to access the APIs of platforms like Facebook, YouTube, LinkedIn, and Instagram at scale using Apache Spark.

As Albert Einstein said once, “Wisdom is not a product of schooling but of the lifelong attempt to acquire it.” As a result, acquiring wisdom and knowledge is an expensive and long endeavor, and by having one place with all the video statistics for social media, we can shorten that journey for video creators.

Photo by United Nations COVID-19 Response on Unsplash

What problem are we trying to solve?

Imagine a situation where your goal is to retrieve statistics for every video in every social platform channel or page that the user owns. Now think in terms of Big Data, where you have millions or even billions of user videos, and you need to make multiple API calls per user video daily without overwhelming the API service.

As a result, our problem has four main layers:

  • Performance. How do we retrieve daily video data from each platform efficiently and quickly?
  • Rate limits. How do we create a sustainable framework that never exceeds the social platform’s rate limits while collecting the majority of statistics about the videos on the platform?
  • Fault tolerance. How do we ensure that nothing breaks, with minimum data loss, maximum consistency, and failure recovery?
  • API policy. How do we consistently and thoroughly follow API and platform policies without getting our token or app revoked or banned? Don’t ask me how I know :)

All of the problems above, and many others, are the ones that we were thinking about when we designed our solution and pipelines.

A quick note about the tech stack and Spark

For simplicity’s sake, the focus of this blog is mostly on the processing side — the Spark-structured streaming of the pipelines — and less about the storage, schemas, and so on. However, in a nutshell, we are using HBase, Phoenix, HDFS, Druid, ClickHouse, Kafka, Spark (> 2.4), and Airflow, and we code mostly in Scala, Java, and Python.

Spark is all about distributed computing

Before we jump into the solution, it’s important to understand Spark and Spark streaming. I’ll be brief.

In the simplest terms, Spark is a distributed cluster-computing framework where processing begins by distributing data and work across machines, or executors, in the cluster. Spark splits the data into smaller parts or logical divisions called partitions, the most basic unit of parallelism. Each one of them is sent to an executor to be processed, where only one partition is computed per executor thread at a time. However, when a transformation such as summation requires information from other partitions, shuffling occurs, where data is written to disk and transferred across the network.

Cluster mode overview

Correspondingly, every structured streaming job in Spark relies on some type of a checkpoint, which is a place to store information to recover from failures, as well as offsets for either exactly-once message delivery, also known as micro batches; or at-least-once message delivery, also known as continuous. In our case, we’re using micro batches, which you can think of as a series of small regular Spark batch jobs running based on some time interval.

Understanding the architecture

To solve the problems outlined above, you might need to derive and think about what will be the biggest bottleneck or barrier to achieve a robust solution. Most of the problems that I mentioned tend to be standard issues that come up in every new data pipeline; however, as in everything else in life, the biggest barriers are the ones we can’t control or have very little input about: social platform API rate limits.

Our solution for accessing external data services at scale is centered around structured streaming and controlling by backpressure the number of API calls. In addition, since there are a limited number of social media platforms with widely different APIs and rate limits, it makes sense to decouple the components and have a separate pipeline for each social platform.

The following figure shows a simplified overview of the pipeline architecture from getting the tokens, scheduling tokens, retrieving data from the APIs as a stream in a distributed and fault-tolerant fashion, and ingesting to storage.

The next sections go into detail about the different pipeline stages and design choices.

Token ingestion

Every data pipeline starts with a source. Our pipelines depend on social platform tokens that the user authenticates through different points on Vimeo, usually through our website or mobile app. These tokens then get ingested into our secure internal tables.

We have additional pipelines to revoke and remove tokens and information automatically based on user request.

Token scheduler

This is the first important step in controlling by backpressure and eventually limiting the number of API calls per time interval. Since our goal is to deliver daily video statistics for every social platform without overwhelming the API service, we schedule an even proportion of tokens such that it will be equally distributed throughout the day with some buffer.

For instance, if we have 240 valid Facebook tokens in the table, we retrieve 10 tokens every hour; this way we guarantee to cover all the existing 240 tokens (10 tokens times 24 hours) while having enough buffer for downstream execution to complete.

Accessing external data services and APIs using Spark

Here comes the meat of the pipeline and solution to our most troublesome problem of accessing APIs in parallel quickly without disrespecting the rate limits.

Why Apache Spark?

It goes without saying, the solution for accessing external data from APIs at scale heavily depends on a distributed computing framework. The following figure shows how the Apache Spark framework can be utilized for the API calls.

Don’t be alarmed just yet if it’s a bit confusing. We’re about to go over it in detail (but please also review the previous section about Spark and distributed computing).

Every micro batch in our stream is essentially a data frame or data set (by which I mean a distributed collection of data organized into rows and columns) of social user tokens from the output of the token scheduler stage, which enables us to transform the data frame by applying an operation or function on each token or partition of tokens to execute the API calls and compute the final results into another data frame, all in parallel across many machines and threads in the cluster.

Streaming flow and design

Before we explore the design choices and optimizations, it’s important to understand the sequence of events that you need to follow to retrieve video statistics in most social media APIs:

  1. Make one API call per user token to retrieve all the channels or pages that the user owns, such as YouTube channels or Facebook pages.
  2. Make one API call per channel or page user to retrieve all the videos that the user has.
  3. Make one API call per video to retrieve video statistics and analytics.

Each and every type of API call usually has its own rate limit criteria to prevent an API from being overwhelmed. For instance, the API for retrieving channels and pages has a limit of 250,000 calls per minute, where the video API has a limit of 150,000 calls per minute, and so on.

Therefore, to decouple the logic without exceeding rate limits and ensuring fast processing, the streaming design should look like this:

By decoupling to a streaming job per API call, we ensure that we can address and manage issues per API call at a time, minimize failures and data loss, accommodate and scale based on the API, and make future changes with minimal impact.

Backpressure: the secret sauce

This is where the magic truly happens. With backpressure, you can guarantee that your Spark streaming application executes reliability and steadily.

In Spark-structured streaming, the magic is under the source config maxOffsetsPerTrigger, which enables you to set exactly the maximum number of messages or offsets processed per trigger interval. Therefore, for the first stage of retrieving user channels or pages with a rate limit of 250,000 calls per minute, we can set maxOffsetsPerTrigger to 250000 and specify a streaming trigger of 1 minute. But for retrieving videos, we can set maxOffsetsPerTrigger to 150000, and so on for the other APIs.

Putting it all together

Although our pipelines depend tremendously on the ever evolving social media APIs and rate limits, as a result of the design outlined above, from the scheduling, backpressure, and decoupling of the components of the various APIs based on different Spark streaming applications, we managed to develop an extendable and robust framework to call external APIs by carefully designing the pipelines around the rate limits.

Looking at optimizations

We have utilized many optimizations throughout the pipelines. Here are some of them.

Network throttling

To retry per API call in case of a network TTL exception, you can do something like this:

Batch requests

In some APIs, you can send a batch request that contains multiple calls but counts as only one request against your rate limit. Facebook offers batch requests, for example. It makes good sense to take advantage of this capability whenever possible.

Smart token scheduling

In our example above, we demonstrated a pretty simple formula to schedule an even proportion of tokens throughout the day. However, the following are issues that we have addressed and optimized:

  • Data delay. Since some users might see their data refreshed later in the day, it makes sense to prioritize users dynamically based on different criteria, such as the number of videos, importance, audience size, and so on, or even randomize the order so that no one group of users has more priority than the others.
  • API overload. We can estimate the maximum total number of API calls per group of tokens and schedule more or fewer tokens at a given interval.

Token validation

Since tokens can get invalidated or expire at any moment in time, each and every Spark-streaming API job must validate and update the tokens table — the place where we store all the tokens — based on the validity of the tokens themselves.

Meet my team

None of this would have been possible without Evan Potter (back end, front end, and full stack engineer; yes, he is that good), Rohit Chaudhry (video analytics manager), Rachel Ng (product design), Michal Enriquez (product manager), Iryna Prokopova (scrum leader), Lior Solomon (data director), and me (Big Data engineer).

--

--

zeev
Vimeo Engineering Blog

Big data enthusiast who likes to build cool stuff. If you are reading this part, I’m doing something right😎