Alpaca: Airflow at JW Player

Published in

JW Player Engineering

7 min readApr 22, 2020

The mission of the Data Pipelines team at JW Player is to collect, process, and surface data from the world’s largest network independent video platform. Because our customers span a wide range of industries and use cases, the tools we develop have to be scalable, flexible and reliable for both our internal and external customers.

Batch processing is an integral part of our data processing pipelines. The other half of our pipeline is via real time processing with Apache Flink. As our scale continues to grow, so does the need for a flexible, stable and scalable batch processing platform. The focus of this blogpost is how we built a platform on top of Apache Airflow called Alpaca. Alpaca has been a huge improvement to how we process batch data at JW Player, allowing us to create new datasets and batch ETL jobs faster than ever before.

What is our data and scale?

Our data is mostly an event stream of Pings. A single Ping represents an event that happened on a specific JW Player. This event can range from playing a video, clicking on an ad, or if that video was recommended. Since we have tens of thousands of instances of JW Player across the open web and millions of viewers, we see around 3 billion of these Pings per day.

Where we came from

Usage-Mini is the name of our previous batch processing platform. It’s a custom application built off Luigi. Using Luigi meant that a lot of the difficult work of operating a data pipeline, scheduling and state management, had to be developed and maintained by us:

Calendar scheduling — running at a specific time rather than a cron interval
State/lifecycle management
Successes, failures, retries
Backfills, historical reruns
Run history

While Luigi was great for our current needs at the time, we began to quickly outgrow its usefulness. A big pain point we had with Usage-Mini was the speed at which we could create or modify datasets. Simple asks like adding columns to a table could take over a week to do on our largest tables.

The other pain point was not having an easy way to manage the state of jobs. We didn’t have an easy way to reprocess, backfill, or scale up jobs other than manually intervening in the backend database.

What makes Airflow great?

Refactoring our Luigi based system wouldn’t fix the fundamental problems that we had with it. Over the past summer, our Data Science team started adopting Airflow for their data processing needs and we decided to do a proof of concept ourselves. This is what we love about Airflow:

DAGs (Directed Acyclic Graphs) as code — Expressing workflows as Python is a lot easier than using YAML.
High observability — The Airflow UI makes managing jobs much simpler than having to manage state via the backend database cli. We also have detailed job history and logs all in one place.
Native calendar scheduling — Airflow has great built in tools and macros to specify when your DAG should run. No more cronjobs.
Scalability — We’re quickly able to parallelize and scale out our DAGs via Airflow’s distributed execution.
Decoupled — We can deploy changes without disrupting the entire system. That includes redeploying the scheduler!
Fault tolerance — Built in tools for retrying, backfilling, and reprocessing.
Sensors — We heavily rely on sensors polling databases to launch jobs. We also use ExternalTaskSensors to separate our compute and loading DAGs for better reliability and performance.
Plugins — Airflow at JW Player is a cross team effort between the Data Pipelines team and the Data Science team. The plugins feature allows us to work collaboratively on new features and easily distribute them.
Strong open source community — Airflow has a lot of great developers and momentum, allowing us to use excellent community developed tools.

Migrating to Airflow

Usage-Mini, our Luigi based orchestrator, is an application. Alpaca is a platform built on top of Airflow. We provide a shared set of tooling and utilities that allows our developers to quickly create new datasets from modular components.

Alpaca is:

Fully containerized — Our base Airflow image is based off of Puckel’s Docker-Airflow. Additionally, our custom operators are containerized.
Completely on Kubernetes — The Airflow UI, Executor, and Scheduler all live within one pod. All Airflow tasks (besides sensors) are launched as their own Kubernetes jobs.
Focused on modularity — The base of our custom operators are written in code, and use cases are written as YAML configs. This allows us to quickly create new jobs from a range of generic tools. We provide a range of operators that developers can use to quickly create custom Airflow tasks.
Batteries included — We designed Alpaca for use by anyone and not just Data Engineers. However as Data Engineers we built guarantees into the platform around idempotency, fault tolerance, testing, and monitoring. This lets end users focus on the core of the job they want to create.
Lightweight — We use S3 for storing configs and writing out intermediate data, as opposed to storing them in memory. And instead of passing data through XCOM we pass S3 paths. This means that our Kubernetes jobs can run with fewer resources.
Designed for frictionless development — DAGs, configs, and operator images are all deployed separately via Buildkite.

What makes Alpaca unique?

Custom Operator Images

We want Alpaca to be as flexible and composable as possible. To accomplish this, we provide a range of base operator images. These images allow us to run Airflow tasks that create pyspark-emr jobs, run database queries, and execute Python scripts.

On top of these images, developers specify what their task does via a flavor. A flavor can be thought of as an implementation of a specific base image. In most cases, flavors are completely config driven. We provide all of the code in the base image, and then let the config specify the use case.

We see this as a best of both worlds approach. Configuration files are great for specifying job related parameters, such as which database to run a query on. Configs are also easy to develop for the end user, and we can enforce the config schema with tools like Yamale. We also don’t want end users to have to touch the internal operator code. However, using configuration as code doesn’t work well when we want to define DAGs and complex logic. We offload the code to the base image and DAGs definition, and use configs for the rest.

All tasks run on Kubernetes via a single operator

We have a unified platform called the Deployment System for managing Kubernetes deployments. Additionally, we have an API called Gantry which is used to launch and manage these deployments. In order to launch Airflow tasks on the Deployment System, our Data Science team created the GantryJobOperator. Built off Airflow’s BaseOperator, it interfaces with the Gantry API, allowing Alpaca to launch Airflow Tasks as Kubernetes Jobs. All the operator needs is a reference to the DockerHub image for the operator along with configuration options.

A single operator with a uniform interface means that users have to learn a lot less about Airflow to make a new DAG. The developer just has to worry about their job specific code and not how multiple different Airflow operators work, or how to fix them if things go wrong. They’re even free to test their job code before even making a DAG. Check out Bluecore’s blog post for further explanation on why using a single operator is a great idea.

The Lifecycle of an Alpaca Job

We want to load data from AWS S3 into Snowflake, a cloud-based data warehouse. This would be a DAG composed of a S3 Prefix Sensor followed by our GantryJobOperator with the snowflake-loader image, and the example_flavor flavor.

An example Alpaca DAG for loading data from S3 into Snowflake

Once the Airflow Scheduler triggers the job, the S3 Prefix sensor will begin to poll the S3 bucket. When the files land, the sensor will complete and trigger the GantryJobOperator (GJO). The GJO will:

Make a request to the Gantry API to create a new Kubernetes job with the DockerHub location of the snowflake-loader image and the s3 path of the flavor’s configs.
Poll the Gantry API until the snowflake-loader job finishes.
Tail Kubernetes logs from the Kubernetes pod back to Airflow.
Retry or mark itself successful based on whether the snowflake-loader Kubernetes job exited successfully.

The Results

Significantly faster development speed — We can create new jobs and datasets within a day, rather than weeks.
Parallelized development — Decoupled components and integration with Buildkite allows us to independently improve and deploy different parts of the system.
Improved observability and control — Airflow’s built in metrics and web ui give us better insight into how our pipelines are behaving, and better tools to fix problems as they arise. This also leads to faster incident recovery.
Cross team standardization — Shared tooling is available via plugins which are upstream of Alpaca. This reduces duplicated effort and siloing of information. The Data Pipelines team uses tons of Airflow tools developed by the Data Science team.
Lower barrier to entry — Non-data engineers have been able to quickly and easily add new DAGs to Alpaca.

The Future: Tables as a service

With Alpaca we’ve built a platform on top of Airflow that allows us to quickly build new jobs from config driven, batteries included, custom operators. We didn’t build these with just the Data Pipelines team in mind, but with the goal that anyone should be able to create their own Airflow DAGs without having specific data engineering knowledge. We built Flink as a Service at JW Player to allow any user the ability to create real time data streams using Apache Flink, and now we want to do the same with Airflow. While we’ve had users already create Alpaca DAGs, we can go even further in building an even easier to use platform on top of what we’ve already created.

Part 2

In the next post we’ll be taking a deeper dive into how our custom operators work, what they can do, and how we developed them to be easy for anyone to create new jobs.