Apache Flink as a Service at JW Player

Steve Whelan
JW Player Engineering
6 min readJan 3, 2020

JW Player is the world’s largest network-independent platform for video delivery and intelligence. Our global footprint of over 1 billion unique users creates a powerful data graph of consumer insights and generates billions of incremental video views.

At JW Player, the Data Pipelines team’s mission is to collect, process, and surface this data. We then develop tools so that this data is easily accessible, scalable, and flexible for internal and external customers.

In this post, we will discuss the limitations of our batch pipeline and how the adoption of Apache Flink helped us overcome them.

How it Was

Traditionally, our data pipelines revolved around a series of cascading Apache Spark batch processing jobs. Over time, two pain points emerged:

  • The chaining of jobs introduced latency
  • Our orchestrator application became more and more complex such that no one outside the Data Pipelines team could use it.

Latency

We found that the optimal way to run these jobs was to chunk incoming data into 20 minute batches. Under normal conditions, data took about one hour to surface to our end users, both internal and external. Additionally, other datasets were only produced on a daily basis.

In many cases, this latency was acceptable. However, it was particularly problematic around releases. After a release, it could be an hour or longer before we surfaced the data points needed to validate the changes that went out.

Development

Our batch pipeline was built utilizing Spotify’s Luigi. Over time, we built large DAGs with complex fan out patterns, and as complexity grew, adding a new job to the platform became increasingly difficult. Writing jobs required detailed knowledge of the orchestrator, so much so that only members of the Data Pipelines team could do it. We were responsible for both maintaining the platform and creating the jobs running on it, meaning our team evolved into one giant bottleneck.

As we reflected on these pain points, we thought, there must be a better way.

How We Wanted It to Be

At JW Player, we make data driven decisions. As a result, we are always collecting more data and offering aggregations across more dimensions. Facing the aforementioned pain points, we came to realize that it is not feasible for a single engineering team to be responsible for both a data processing platform and the jobs running on it. We needed to turn our data processing into a self-service model. Additionally, we wanted to offer our data at a lower latency. Particularly for releases, how could we evaluate changes within minutes instead of hours?

In designing a self service data processing platform, we narrowed the requirements down to the following:

  1. Low barrier to entry — a user should have to write zero code to create a job
  2. SQL driven — since SQL is a widely proficient skill at JW Player, the user should be able to define the job using SQL queries
  3. Highly Configurable — in order to support a wide range of teams with varying degrees of data engineering knowledge, the platform needed to be simple but highly configurable in order to support advanced users who want fine grained control of their jobs
  4. User Controlled — the user should have complete control over the starting and stopping of their job.
  5. Low Latency — the platform should stream data in realtime

How it is Now

Our team already had a real-time platform built on Apache Storm. But due to some stability issues and a complex development process, we did not iterate on it much. It ran a few legacy jobs that worked and we just left it alone. Given the declining activity of the Storm community, we decided it wasn’t a platform we wanted to keep building on — we needed something new.

Given the requirements and the decline of Storm, we needed to evaluate other streaming technologies. In doing so, Apache Flink stood out from the rest. It hit all our requirements, including:

  • Realtime streaming execution
  • An extendable codebase enabling the creation of highly configurable abstract layers
  • Support for standard ANSI SQL
  • Out of the box connectors for various sources/sinks
  • An active development community

The Self Service Model

We started designing the self-service platform with a single question: “how will non-Flink Developers create Flink jobs?” For this platform to work, users had to be able to create a job without having to learn Flink’s internals or read through all its documentation.

Luckily, Flink is very extensible. We were able to build a layer of abstraction on top of the framework. This layer allows for dynamic configuration of the sources, sinks and serializers/deserializers. And its support for ANSI SQL meant a user could define their job in terms of SQL rather than in code.

To create a job, a user provides two files:

  • A yaml configuration file defining the sources and sinks
  • A file of sql queries to execute

These files are currently submitted via a git repository. They are merged and deployed into our Flink as a Service platform, which is essentially a packaged jar application. We’ve then built a simple REST API for the user to control the starting and stopping of their job. The actual Flink jobs themselves are launched onto AWS EMR clusters.

A Use Case

Prior to the Flink as a Service platform, JW’s Video Player team would analyze video player data the day after a release to validate the new code was behaving as expected. However, following the launch of the platform, a member of the Player team built a job to aggregate our player data (which we call pings) in realtime into a Datadog dashboard that the team could use to monitor the impact of player releases.

Within minutes of the release, the dashboard is populated with data produced by the new release version. The team can spot spikes in error rates or player setup times across various dimensions such as region, browsers or operating systems. Given the numerous permutations of possible player setups and browser versions, testing every single one is not realistic. Being able to spot anomalies quickly helps the Player Team hone in on potential edge cases and resolve issues quickly.

The Job

The above is the job configuration yaml for the Player Team’s job. There are 3 operators, each representing a SQL query. An operator defines the following:

  • source
  • deserializer
  • sink
  • serializer (if needed)

The job consumes an Avro Kafka topic, executes a SQL query on it and stores that datastream in what we call an “Internal Table”. This allows for storing intermediate results that can then be queried by downstream operators. The job then aggregates data from the Internal Table and produces metrics to be sent to Datadog. The DatadogAppendStreamTableSink is a custom sink written by the Data Pipelines team.

This yaml, along with the SQL queries, is all that’s needed to get the job off the ground. The user can define as many Operators with as many Sources and Sinks as they need.

In order to give our users as much control as they want, there are over 100 configuration options they can use for their job. But for those less hands-on, over 75% have sensible default values. Additionally, we have containerized the whole platform so users can develop locally.

Since creating the platform, Flink has introduced a SQL client which is still in Beta as of v1.9. Its similarly yaml configuration driven and something we are looking to evaluate in the future.

Conclusion

Flink’s flexibility and active community made it the ideal solution for the problems outlined above, and it has helped us achieve our goals of accessible and scalable data. Now that we have the Flink as a Service platform, teams can author their own jobs and get real-time insights into their data in a way that was never before possible — a great step forward for the Data Pipelines team and JW Player as a whole.

--

--