Keystone Real-time Stream Processing Platform

Keystone Stream Processing Platform is Netflix’s data backbone and an essential piece of infrastructure that enables engineering data-driven culture. While Keystone focuses on data analytics, it is worth mentioning there is another Netflix homegrown reactive stream processing platform called Mantis that targets operational use cases. We’ll discuss Mantis and its important role in the Netflix ecosystem in a future post.

Today, the Keystone platform offers two production services:

  • Data Pipeline: streaming enabled Routing Service and Kafka enabled Messaging Service, together is responsible for producing, collecting, processing, aggregating, and moving all microservice events in near real-time.
  • Stream Processing as a Service (SPaaS): enables users to build & operate custom managed stream processing applications, allowing them to focus on business application logic while platform provides the scale, operations, and domain expertise.

In this post, we’ll go over some of the challenges, design principles, our platform mindset, high level architecture, and finally our visions and core values the platform offers to Netflix.

Anatomy of a single streaming job:

…and the platform manages these jobs:


Challenges

1. Scale

Netflix services 130 million subscribers from 190+ countries. The streaming platform processes trillions of events and petabytes worth of data per day to support day to day business needs. This platform is expected to scale out as subscribers continues to grow.

2. Diverse Use-cases

Keystone Routing Service: this service is responsible for routing any events to managed sink per user configuration. Each delivery route is realized by an embarrassingly parallel stream processing job. Users may define optional filter and/or projection aggregations. Events are eventually delivered to a storage sink for further batch/stream processing with at-least-once delivery semantics. Users may choose different latency and duplicate tradeoffs.

Stream Processing as a Service: SPaaS platform has only been in production for about a year, yet we have seen tremendous engineering interests, as well as a wide variety of requirements. Below is a summary of some common asks and tradeoffs.

  • Job State: ranging from complete stateless parallel processing to jobs requiring 10s of TB large local states.
  • Job Complexity: ranging from embarrassingly parallel jobs with all operators chained together to very complex job DAG with multiple shuffling stages and complex sessionization logic.
  • Windows/Sessions: window size ranging from within a few second (i.e. to capture transaction start/end event) to hours long custom user behavior session windows.
  • Traffic pattern: traffic pattern varies significantly depending on each use case. Traffic can be bursty or consistent at GB/sec level.
  • Failure recovery: some use cases require low failure recovery latency at seconds level, this becomes much more challenging when job both holds large state and involves shuffling.
  • Backfill & rewind: some jobs require replay of data either from a batch source or rewind from a previous checkpoint.
  • Resource contention: jobs may be bottlenecked on any physical resources: CPU, network bandwidth, or memory, etc. User relies on the platform to provide insights and guidance to tune application performance.
  • Duplicates vs latency: application may have different tradeoff preference in terms of duplicates vs latency.
  • Ordering of events: most use cases do not rely on strict ordering assumptions, however some do require it.
  • Delivery/Processing semantics: some use case is ok with losing some events in the pipeline, while other ones may require much higher durability guarantees. Some stateful streaming job also expects exactly-once processing guarantee where the computed states should always be consistent.
  • Audience: our user ranges from very technical distributed system engineers to casual business analysts. Some team may also choose to build a domain specific platform service on our platform offerings.

3. Multi-tenancy

Keystone supports thousands of streaming jobs, targeting wide problem spaces ranging from data delivery, data analytics, all the way to enabling microservices architectural patterns. Due to the diverse nature of the the streaming jobs, in order to provide meaningful service level guarantees to each user, the infrastructure need to provide runtime & operational isolation, while at the same time, minimizing shared platform overhead.

4. Elasticity

Although majority of the streams have fixed traffic pattern, we have to design the system to prepare for sudden changes (i.e. spikes due to a popular show coming online or unexpected failure scenarios), and be able to adapt and react to them in an automated fashion.

5. Cloud Native Resiliency

Netflix operates its microservices fully in the cloud. Due to the elastic, constant changing, higher failure probability characteristics of the cloud. We need to design the system to be able to monitor, detect and tolerate failures all the way from network blips, instance failure, zone failure, cluster failure, inter-service congestion/backpressure, to regional disaster failures, etc.

6. Operation overhead

The platform currently services thousands of routing jobs and streaming applications. It’s cost prohibitive to rely on platform team to manually manage all of the streams. Instead, the user should be responsible to declare the lifecycle details of the jobs, and the infrastructure should automate as much as possible.

7. Agility

We’d like to be able to develop and deploy changes quickly, multiple times a day. We’d also like to allow our users to confidently use the service with the same level of agility.


Platform Mindset & Design Principles

1. Enablement

One of the primary goals of the platform is to enable other teams to focus on business logic, making experimentation, implementation, operation of stream processing jobs easy. By having a platform to abstract the “hard stuff”, removing complexities away from users, this would unleash broader team agility and product innovations.

On a high level, we strive to enable our user to:

  • Quickly discover and experiment with data, enable data driven innovations to drive product
  • Fast prototyping of stream processing solutions
  • Productionize and operationalize services with confidence
  • Gain insight into performance, cost, job lifecycle states etc to be able to make informed local decisions
  • Provide tooling to enable users to self-serve

2. Building Blocks

To enable user to focus on business logic without having to worry about the complexity involved in a distributed system or mundane details of some pre-existed solution, it is our goal to provide a rich set of composable operators that can be easily plugged into a streaming job DAG.

Furthermore, streaming jobs themselves can become building blocks for other downstream services as well. We work with some of our partner teams to build “Managed Datasets” and other domain specific platforms.

From our platform downwards, we also strive to integrate deeply with Netflix software ecosystem by leveraging other building blocks such as container runtime services, platform dynamic configuration, common injection framework, etc. This does not just help us to build a service based on other existing solutions, it also make development & operation environment familiar to our users.

3. Tuneable Tradeoffs

Any complex distributed system inherently comes with certain limitations, thus designings of such system should take considerations of various tradeoffs, i.e. latency vs duplicates, consistency vs availability, strict ordering vs random ordering etc. Certain use cases may require different combinations of these tradeoffs, so it’s essential that platform should expose the knobs and allow individual user to customize and declare the needs to the system.

4. Failure as a First Class Citizen

Failure is a norm in any large scale distributed system, especially in the cloud environment. Any properly designed cloud-native system should treat failures as a first class citizen.

Here are some important aspects that impacted our design:

  • Assume unreliable network
  • Trust underlying runtime infrastructure, but design automatic healing capabilities
  • Enforce job level isolation for multi-tenants support
  • Reduce blast radius when failure arise
  • Design for automatic reconciliation if any components drifts from desired state or even if disaster failure occurs
  • Handle & propagate back pressure correctly

5. Separation of Concerns

Between users and platform: user should be able to declare the “goal state” via platform UI or API. The goal states are stored in a single source of truth store, the actual execution to move from “current state” towards the “goal state” is handled by the platform workflow without interaction with users.

Between control plane and data plane: Control plane is responsible for workflow orchestration/coordination while data plane does the heavy lifting to make sure things happens and stay in desired state.

Between different subcomponents: Each component is responsible for their own work and states. Each component lifecycle is independent.

Runtime infrastructure: stream processing jobs are deployed on open sourced Netflix Titus Container runtime service, this service provides provisioning, scheduling, resource level isolations (CPU, Network, Memory), advanced networking etc.


Our Approach

With considerations to aforementioned challenges and design principles, we closed on a declarative reconciliation architecture to drive a self-servable platform. On a high level, this architecture allows user to come to the UI to declare desired job attributes, the platform will orchestrate and coordinate subservices to ensure goal states are met as quickly as possible, even in face of failures.

This following section covers the high level architecture and lightly touches various areas of the design. We’ll share more in depth technical details and use cases in future follow up posts.

1. Declarative Reconciliation

The declarative reconciliation protocol is used across the entire architectural stack, from control plane to data plane. The logical conclusion for taking advantage of this protocol is to store a single copy of user declared goal states as durable source of truth, where all other services will reconcile from. When state conflict arises, either due to transient failures or normal user trigger actions, the source of truth should always be treated as authoritative, all other versions of the states should be considered as the current view of the world. The entire system is expected to eventually reconcile towards the source of truth.

Source of Truth Store is a durable, persistent storage that keeps all the desired state information. We currently use AWS RDS. It is the single source of truth for the entire system. For example, if a Kafka cluster blows away because of corrupted ZK states, we can always recreate the entire cluster solely based off the source of truth. Same principles apply to the stream processing layer, to correct any processing layer’s current states that deviates from its desired goal states. This makes continuous self healing, and automated operations possible.

Another advantage we can take from this protocol design is that operations are encouraged to be idempotent. This means control instructions passed from user to control plane and then to the job cluster, inevitable failure conditions will not result in prolonged adversary effect. The services would just eventually reconcile on its own. This also in term brings operational agility.

2. Deployment Orchestration

Control plane facilitates orchestration workflow through interactions with Netflix internal continuous deployment engine Spinnaker. Spinnaker internally abstracts integration with Titus container runtime, which would allow control plane to orchestrates deployment with different tradeoffs.

A flink cluster is composed of job managers and task managers. Today, we enforce complete job instance level isolation by creating independent Flink cluster for each job. The only shared service is ZooKeeper for consensus coordination and S3 backend for storing checkpoint states.

During redeployment, stateless application may choose between latency or duplicate trade-offs, corresponding deployment workflow will be used to satisfy the requirement. For stateful application user can choose to resume from a checkpoint/savepoint or start from fresh state.

3. Self-service Tooling

For routing jobs: through self service, a user can request a stream to produce events to, optionally declare filtering / projection and then route events to managed sink, such as Elasticsearch, Hive or made available for downstream real-time consuming. Self service UI is able to take these inputs from user and translate into concrete eventual desired system states. This allows us to build a decoupled orchestration layer that drives the goal states, it also allows us to abstract out certain information that user may not care, for example which Kafka cluster to produce to, or certain container configurations, and gives us the flexibility when it’s needed.

For custom SPaaS jobs, we provide command line tooling to generate flink code template repository and CI integration etc.

Once user customizes and checks in the code, the CI automation will be kicked off to build docker image, register the image and configurations with platform backend, and allow user to perform deployment and other administrative operations.

4. Stream Processing Engines

We are currently focusing on leveraging Apache Flink and build an ecosystem around it for Keystone analytic use cases. Moving forward, we have plans to integrate and extend Mantis stream processing engine for operational use cases.

5. Connectors, Managed Operators and Application Abstraction

To help our users to increase development agility and innovations, we offer a full range of abstractions that includes managed connectors, operators for users to plug in to the processing DAG, as well as integration with various platform services.

We provide managed connectors to Kafka, Elasticsearch, Hive, etc. The connectors abstract away underlying complexity around custom wire format, serialization (so we can keep track of different format of payload to optimize on storage and transport), batching/throttling behaviors, and is easy to plug into processing DAG. We also provide dynamic source/sink operator that allows user to switch between different sources or sinks at runtime without having to rebuild.

Other managed operators includes filter, projector, data hygiene with easy to understand custom DSL. We continue to work with our users to contribute proven operators to the collection and make them accessible to more teams.

6. Configuration & Immutable Deployment

Multi-tenancy configuration management is challenging. We want to make configuration experience dynamic (so users do not have to rebuild/reship code), and at the same time easily manageable.

Both default managed and user defined configurations are stored along with application properties files, we’ve done the plumbing to allow these configurations to be overriable by environment variable and can be further overridden through self-service UI. This approach fits with the reconciliation architecture, which allows user to come to our UI to declare the intended configs and deployment orchestration will ensure eventual consistency at runtime.

7. Self-healing

Failures are inevitable in distributed systems. We fully expect it can happen at any time, and designed our system to self heal so we don’t have to be woken up in the middle of night for incident mitigations.

Architecturally, platform component services are isolated to reduce blast radius when failure arises. The reconciliation architecture also ensures system level self-recovery by continuous reconciling away from drift behavior.

On individual job level, the same isolation pattern is followed to reduce failure impact. However, to deal and recover from such failures, each managed streaming job comes with a health monitor. The health monitor is an internal component runs on in Flink cluster which is responsible for detecting failure scenarios and perform self-healing:

  • Cluster Task Manager drift: if Flink’s view of the container resources persistently unmatched with container runtime’s view. The drift will be automatically corrected by proactive termination of affected containers.
  • Stall Job Manager leader: if leader fails to be elected, the cluster becomes brainless. Corrective action will be performed on the job manager.
  • Unstable container resources: if certain task manager shows unstable pattern such as periodical restart/failure, it will be replaced.
  • Network partition: if any container experiences network connectivity issues, it will be automatically terminated.

8. Backfill & Rewind

Again, failures are inevitable, sometimes user may be required to backfill or rewind the processing job.

For source data that is backed up into data warehouse, we have built functionality into the platform to allow dynamically switching source without having to modify and rebuild code. This approach comes with certain limitations and is only recommended for stateless jobs.

Alternatively, user can choose to rewind processing to a previous automatically taken checkpoint.

9. Monitoring & Alerting

All individual streaming jobs comes with a personalized monitor and alert dashboard. This helps both platform/infrastructure team and application team to diagnose and monitor for issues.

10. Reliability & Testing

As platform and underlying infrastructure services innovate to provide new features and improvements, the pressure to quickly adopt the changes comes from bottom up (architecturally).

As applications being developed and productionized, the pressure for reliability comes from top down.

The pressure meets in the middle. In order for us to provide and gain trust, we need to enable both platform and users to efficiently test the entire stack.

We are big believers in making unit tests, integration tests, operational canary and data parity canary accessible for all our users, and easy to adopt for the stream processing paradigm. We are making progress on this front, and still seeing lots of challenges to solve.


Now and Future

In the past year and half, the Keystone stream processing platform has proven itself beyond the trillion events per day scale. Our partner teams have built and productionized various analytical streaming use cases. Furthermore, we are starting to see higher level platforms being built on top.

However, our story does not end here. We still have a long journey ahead of us to fulfill our platform vision. Below are some of the interesting items we are looking into:

  • Schema
  • Service layer to enable more flexible platform interaction
  • Provide streaming SQL and other higher level abstractions to unlock values for different audiences
  • Analytics & Machine Learning use cases
  • Microservices event sourcing architectural pattern

This post presented a high level view of Keystone platform. In the future, we will follow up with more detailed drill downs into use cases, component features, and implementations. Please stay tuned.