Bridging the gap between Kubernetes clusters and Data Science use cases.
Joint work with:
(Ankit Shah, Gal Sivan, Brian Wu, Chinmay Nerurkar and Moussa Taifi)
Scheduling and managing batch jobs is a central competency for Data Science (DS) teams. This is an even more crucial component of a data scientist’s toolkit in the ad tech space. The dynamic nature of ad-tech brings growing challenges and opportunities. According to PWC research, digital ad spend has increased by 23% to $50 Billion in the first half of 2018, compared with the previous year. To deal with such growth, DS teams need flexible tools to take advantage of this opportunity. Most machine learning products include long running data pipelines. These transform data, build models, and upload them to a machine learning (ML) serving layer. Then they close the product loop with monitoring and constant product iteration.
In this post we present our k8s-workqueue system. A pluggable scheduling mechanism for Kubernetes workloads. This system combines familiar features of traditional VMs Cron jobs, standalone Docker containers, with the newer Kubernetes API. The simplicity of this system provides DS teams with an easy transition path. From research pipeline, to prototype, to production workload.
With the adoption of Kubernetes as the leading container management platform, many efforts related to recurrent scheduled jobs have emerged in the industry. As many engineering and data science organizations discover, the “one size fits all” does not work for all use cases. For example, the current spectrum of schedulers includes a set of open source projects that aim to fill the gaps in the Kubernetes scheduling use cases.
On the far left side of figure 1, we find projects such as Argo and Brigade that are Kubernetes centric projects that were specifically built to run on k8s clusters. Moving to the right on that spectrum are projects such as Luigi and Airflow, which was adapted very recently to run and integrate with k8s concepts and resources.
In addition, moving right once more, we have the native Kubernetes CronJob which provides a way to trigger jobs on a schedules, using k8s native techniques, but without the functionality of a full workflow management framework (i.e. WMF). All of these projects are successfully used in production at various companies and organization.
On the far right we find the workhorse of data pipeline prototyping, the VM Cron job that launches processes locally on a machine, and can be used successfully to schedule jobs. However these methods do not maximize the utilization of the underlying resources efficiently in a multi use-case scenario.
The contrast between the two ends of the spectrum is considerable when moving a job from a prototype running on a data scientist’s development machine, to a production setting.
This is the central motivation for building a middle ground scheduling system, that sits between the data engineering heavy frameworks, and the free form Cron job methods.
In this blog post we present a custom system we built to bridge between these two worlds we called k8s-workqueue.
Target Use Cases and SLAs
The targeted use cases for this system percolated from the specific needs of our Data Science team which require both a high level of flexibility in terms of application functionality, as well as access to shared compute resources, data sources, and monitoring systems.
The kind of use cases we target with the k8s-workqueue system are the following:
- The final head of upstream data pipelines managed by other more reliable data tools. This class of jobs includes simple load and dump data jobs for data science prototype development. This kind of jobs also includes pre-prod pipelines that need to be built for internal reporting, dashboards and reports from upstream core pipeline data sets.
- Production-level medium to low frequency ML Training and Model deployment. This category requires an approximate hourly, daily or weekly model training/update frequency. These jobs benefit from the logging of job statuses and the provided monitoring backends, which allow the jobs to publish metrics (e.g. number of models updated, records read, etc), and to build simple alerting pipelines around model freshness and relevance.
- A/B tests for prototypes to compare model variations in production. This job family requires full flexibility in terms of the functionality of the job itself. This is because it needs to manage a set of internal code paths to create and upload different models variants while sharing much of the data pre-processing tools included in the job.
SLAs and Features
From an SLA perspective we make sure to explain to the users that the behavior of this scheduling system tries to mimic the linux CRON job system as much as possible.
The good part of this strategy is that many members of the DS community are deeply familiar with the CRON job concepts. The bad part is that the users need to manage and/or build their apps so that they avoid having to support things like job retries, job chaining, job workflow conditional execution, and data source sensing.
The k8s-workqueue Job Boundary Definition
Contrary to existing methods for managing jobs on Kubernetes, we opted to operate in a space that takes the benefits of k8s in terms of resource and namespace isolation, while keeping the flexibility and familiarity of setting up a Cron job on a VM.
The central API boundary that we ended up focusing on is the job schedule tuple composed of (Job ID, Job Docker Image Tag, CPU Request, Memory Request, Cron Schedule) as shown in the Figure 2.
Backend System Architecture and Developer/Data Scientist Workflow
As an overview of the k8s-workqueue system we provide a brief description of the backend that powers this system. The overall system is divided into 5 components:
- Docker Continuous Integration (CI) pipelines management
- App Scheduling/Triggering
- Kubernetes Cluster Resources
- Job Status Monitoring
- Logging Channels
In this blog post, we focus on the k8s-workqueue specific components (in blue in the above diagram) which are the backbone of this project: The App Scheduling/Triggering mechanisms and the Job status monitoring.
As it is shown on the figure 3, k8s-workqueue is but a small part of a larger system. The workqueue sits on top of a shared engineering platform developed and maintained by a large number of internal teams including SYSOPs, DevOPs and DBAs teams. This crucial extensibility of the platform allowed us to move quickly when building this workqueue component for our DS team.
App scheduling mechanism
- Workqueue Handler WebService and Workqueue State DB: A job webservice to register and update/deploy jobs, with an associated DB backend to keep the state of job schedules.
- ETL-Timer: A timer process to collect the eligible jobs and publish their job schedule information to a job messaging queue.
- Job Broker: A job broker that continuously consumes messages from the job message queue and generates Kubernetes “Run to Completion Jobs” using the job schedules’ information found in the messaging queue. The programmatic interface the Kubernetes API allows the job broker to configure and generate k8s objects in a very scalable and flexible way.
Job monitoring component
- Workqueue Watcher and Job Statuses DB: A passive workqueue job watcher that listens to the k8s api stream for any pod/container lifecycle state change on a specific k8s namespace. This process records any pod state change to a job DB as well to a Prometheus gateway. Users subsequently can build dashboards and alerts from these two sources.
These components all run inside a k8s cluster as long-running deployments with full production support.
Developer and Data Scientist Workflow
From a developer perspective, we aimed to simplify the development, testing, deployment and monitoring to a minimal set of possible workflows.
As shown in figure 3 the interaction is composed of the following:
- The developer is able to fully test their application on their developer machines/laptops with the help of Docker.
- When a new feature is ready, the code can be pushed to a repository which triggers a build, and a full run of the unit tests (Step A). In case the push is a git merge to the master branch, a production image is produced and uploaded to our internal image repository.
- For scalability tests, the job owners can manually trigger their jobs on a development cluster that mirrors the production cluster in terms of workqueue components (Step B).
- Once the job is in a production state, the owner can create or update the job schedule tuple composed exclusively of the (Job ID, Job image tag, CPU request, Memory request, Cron schedule) (Step C).
- Both during the dev and prod job runs the job owner has access to short term and long term logs/metrics through easy to use endpoints (Step D).
- Finally, the job owner creates a set of self-service dashboards and alerts to monitor their jobs based on their jobs’ metrics (Steps E.1, E.2).
One of the most powerful features of this workflow is that during the dev/scalability tests, the job owner is able to replicate with high fidelity the production run configuration. We explicitly constraint the dev/scalability job triggering mechanism to use the same job schedule tuple with no additional modifications.
This allows the developer to be able to replicate what happens on their laptops, dev cluster and prod clusters without having to adapt the configuration during the promotion of the job, or to include external dependencies that are not part of the core goals of their job.
5 Lessons Learned
1. Dev-Prod Parity
- Providing a method to developers/data scientists to exactly replicate a job on both their laptops, a dev cluster and a prod cluster, with no additional assumptions, is a big win.
- Docker allows this dev-prod parity to happen on VMs already to some extent, but moving workloads to k8s clusters usually introduces subtle assumptions about the runtime.
- The k8s cluster components can introduce new concepts that cannot be easily replicated on a developer machine, minikube is one way to replicate a full k8s cluster but the use cases we were targeting did not warrant full local cluster functional mirroring.
- We explored the opposite design direction, where we constrained the interface that users have when interacting with the cluster.
- Going through the workqueue system reduces dramatically the number of knobs that the Dev/DS member has to customize a job, because they represent a subset of the intersection of the three runtimes as shown in figure 4.
2. De-Coupling Apps from Scheduling Frameworks
- One of the benefits of the previous Dev-Prod Parity principle is that we tried to decouple apps from the scheduling framework (i.e. the k8s clusters and other WMFs APIs).
- Contrary to some of the other high quality and fully featured WMFs, which provide job management tools such as retries, job dependency management and conditional workflow selection, the k8s-workqueue system does not attempt to provide these features.
- The benefits of not providing these features, and sticking with a pure “launch and forget” method, is that developers do not need to learn a new set of Operators, Sensors, and custom DSLs (domain specific languages) that some of the more featured WMFs provide.
- If some use cases need to handles these scenarios, the target applications are built with these concerns in mind, which allows the developer to test these scenarios locally on their dev systems fully without the need to inherit from additional external framework in their application.
- This can be seen as the good ol’ composition over inheritance method to provide pluggable job scheduling without deep-coupling of external frameworks.
- This method has drawbacks of course due to the unsupported features. When apps outgrow their k8s-workqueue runtime, we redirect them to more fully featured systems that can handle the SLA levels required by these applications.
- This allowed us to scale the number of apps that can share the underlying k8s compute infrastructure, while servicing a range of different DS/ML use cases.
3. Avoid Tooling Viscosity
- One of the concerns we faced when attempting to use other more featured WMFs is the viscosity (aka: “thick, sticky, and semi-fluid”) of the data job testing process.
- By providing a minimal interface to the workqueue system the developer/DS member is free to include/exclude components as they see fit.
- Instead of making integration tests rely on other components that manage the workflow, the developer can fully manage simple scenarios on their dev machine, without having to touch multiple systems, with a net effect of increasing the fluidity of the development process.
4. Not All Use Cases Can Be k8s-workqueue Jobs
- By explaining the limitation of the Cron methods, we were able to determine the apps that can live with this flexibility-reliability tradeoff.
- The velocity of some projects have increased greatly due to the independence of the code paths, when the SLA requirements fit the k8s-workqueue guarantees.
- Other jobs that needs higher level of support are redirected to more fully featured internal WMF systems.
5. Simplicity Drives Adoption
- The main goal of this system was to simplify the process of the bringing prototypes into production. This allowed us to train and rapidly onboard non-engineering specific team members on production cutting edge technology and tools.
- The minimal interface between the developer and the scheduling backend/cluster made it very intuitive for the users to onboard.
- The ability to incrementally include pieces of the system, by being able to fully test all their pipeline components on their laptops/dev machines, encouraged experimentation and adoption of this system’s functionality.
In this blog post we presented our k8s-workqueue system, a pluggable scheduling mechanism on Kubernetes clusters that combines features of traditional VMs, standalone Docker containers, and the Kubernetes API. The simplicity of this system has provided our Data Science team members with an easy transition path from research to prototype to production workloads. We hope that other teams, in the growing Kubernetes open source community, will be able to integrate some of the lessons we learnt along the way in their own data science platforms. There you have it, thanks for reading.
Parts of this work was presented at PAPIs Latam 2019.
We would like to thank these people that helped us make this project a success: Andrew Davidoff, Ashutosh Sanzgiri, Chinmay Nerurkar, Rebecca Conneely, Mark Ha, Donny Yung, Weston Jackson, Ron Lissack.
Hidden Technical Debt in Machine Learning Systems D.Sculley et al. Google
Scaling Machine Learning as a Service Li Erran Li et al. Uber
Lessons Learned from Building Scalable Machine Learning Pipelines Moussa Taifi, Tian Yu, Yana Volkovich
We are Hiring!
If these challenges interest you, checkout our open roles: https://xandr.att.jobs/job/new-york/data-science-platform-engineer/25348/12859712