Telemetry-Aware Scheduling: Enhancing Kubernetes Scheduling Capabilities
Story Update: In November 2019, the Telemetry Aware Scheduling project was open sourced, helping enable intelligent pod placement on a workload-by-workload basis powered by timely platform metrics. Learn more & join us!
***
Containers are increasingly being used for far more demanding workloads, such as IoT, artificial intelligence (AI), machine learning, software defined networking (SDN), and network function virtualization (NFV). In turn, container orchestration tools like Kubernetes must evolve to keep pace, ensuring maximum utilization of computing resources. Swati Sehgal, Cloud Software Engineer at Intel, shares her path into open source, and her collaboration with the cloud native community around telemetry-aware scheduling to address these mounting requirements.
Tell us a little bit about yourself and your journey into tech.
Computer science sparked my interest when I was in school — I learned C++ as my first programming language. I decided to pursue my Bachelors of Technology in Computer Science in India from ITM University. Three years into my studies, I earned an honors scholarship to study Computer Engineering in Ireland, where I was introduced to embedded systems, distributed systems, wireless communications and networking modules. My final year project focused on building an Android-operated remote webcam, which is still one of my proudest achievements to date. As a result, I pursued my Masters in Computer Science at Trinity College in Dublin, and then joined Intel as a Cloud Software Engineer in the Cloud Native Orchestration team four years ago, based in Shannon, Ireland.
Can you tell us more about your team, and its focuses?
My team and I work on delivering cloud native orchestration solutions for customers for networking, compute and telemetry use cases, which demonstrate Intel® Architecture capabilities across platforms. Our overall goal is to enhance Kubernetes for network function virtualization (NFV) workloads, as Kubernetes lacks certain features that are required to enable it for such workloads.
What excites you about the world of Kubernetes?
Kubernetes is one of the largest, fastest-growing open source projects with widespread adoption among start-ups, platform vendors and enterprises as well as commercial offerings. Every major infrastructure provider is either already supporting it, or working to support it. It’s great to be part of this ecosystem as it allows me to have widespread impact and influence.
What kinds of projects are you currently focused on?
I’ve worked on a number of projects to address the networking gaps in Kubernetes, such as Node Feature Discovery to expose hardware capabilities like SR-IOV and boot card in a Kubernetes cluster, and resource management features, like huge pages and an Intel® QuickAssist Technology (QAT) device plug-in.
I’m currently looking into advanced scheduling capabilities in Kubernetes — specifically, telemetry-aware scheduling.
What is the aim of telemetry-aware scheduling? What is it trying to achieve?
Containers are becoming the desired deployment model for a wide range of workloads such as big data, IoT, artificial intelligence (AI), machine learning, software defined networking (SDN), and network function virtualization (NFV) solutions. As a result, container orchestration tools like Kubernetes need to evolve to meet these stringent networking and resource management requirements — particularly, requirements for optimum utilization of compute, network and storage. Therefore, automation based on policies, as well as predictive and reactive closed loops, is becoming an indispensable requirement in the orchestration layer. In order to achieve operational efficiency, better service performance, reductions in capital and operation expenditures (capX and opX), and improved customer experience, automation is critical.
The scheduler is a key component in Kubernetes — it is responsible for placement of a pod on a suitable node in a cluster. It tracks resource utilization of pods and tries to uniformly spread pods over nodes. However, it fails to balance node resource utilization and provides no mechanism to take cluster telemetry into consideration while it makes those scheduling decisions. Telemetry-aware scheduling focuses on Kubernetes scheduling capabilities for performing those automated actions as well as intelligent placement of workloads by becoming more aware of telemetry and workload SLAs. Combined with autoscaling, this sort of decision-making could make the disruptive activity of adding and removing nodes from a cluster less costly.
This project, which was developed as a scheduler extender, focuses on extending the Kubernetes scheduling capability to take telemetry into consideration. We were approached by a customer who was interested in scheduling capabilities based on specific networking telemetry. In response, we started looking into potential solutions with a focus on solving this problem more broadly, rather than focusing solely on network telemetry. It’s important that the capabilities we work on are applicable to a variety of use cases, rather than a highly specific use case.
Can you talk about some of these use cases?
Telemetry-aware scheduling can be used for closed-loop automation and service assurance. It empowers closed-loop automation systems by changing the system behavior in response to telemetry output. For example, let’s say there’s a piece of hardware that’s showing a memory fault. A scheduler with an extender can schedule pods away from that node, allowing a form of a closed loop and empowering the cluster to react quickly in response to telemetry changes. Similarly, for service level agreements (SLAs), the scheduler can inspect metrics related to SLAs on a particular node or group of nodes and ensure that the workload is scheduled on a unit that meets the terms of the agreement.
Another use case is resource utilization and management. Telemetry-aware scheduling can be used in scenarios where node resources can cause instability in a system while underutilization and inefficiencies can increase hardware cost and impact the system. This capability can judge utilization of a node in a cluster and define utilization in a way that your workload can understand and avoid extreme cases of inefficient resource utilization.
Have you been working with others in the community?
This project drew a lot of inspiration from another project called usage based scheduling, which was authored by Ravi Santosh Gudimetla at Red Hat. Even though Kubernetes takes into account CPU memory, it excluded best effort pods, and the aim of the usage based scheduling project was to take into account current utilization of nodes in a cluster including best effort pods. We’ve taken this idea a step further by integrating it with Prometheus, which is an open source monitoring tool that introduces telemetry into the equation. We shared our work with Ravi, along with a range of use cases that we’d like to address, and we’re talking about how to collaborate moving forward.
We also raised the concept in one of the Kubernetes SIG Scheduling meeting, which is the SIG responsible for components that make pod placement decisions. We also presented this concept at the Contributor Summit in Barcelona, with extremely positive reception. The community has been especially interested in using telemetry-aware scheduling for dynamic resource utilization in cases like identifying anomalous behaviors and predicting upcoming failures in a node.
Are there any key achievements or milestones you’d like to share?
Presenting our work at KubeCon in Barcelona was a big milestone for us, and I really appreciated the opportunity to share our work at such a large conference, connect with the community, gauge their interest, and get their feedback to help inform our next steps! I think this project is of great value to the community because it addresses a wide range of use cases, and I was excited to hear how receptive the community is to it. There we discovered that SIG Instrumentation is also interested in the work we’re doing.
In Barcelona, I was also excited to share my work around a small integration of telemetry-aware scheduling with the descheduler — the descheduler is a project I discovered during my other work, which Ravi also contributes to.
Telemetry-aware scheduling addresses a wide range of use cases and I’m looking forward to seeing it adopted in the community and used by customers.
What advice would you give to others thinking of diving into open source, or the Kubernetes community?
Being on the Cloud Native Orchestration team has been a great learning experience. When I started, I had no clue how to participate in an open source community. Over the last four years, I’ve learned how to engage in the community, and how to write production code. For those who are curious or interested in the open source community — and Kubernetes, specifically — I would encourage them to put themselves into it. It can be a bit intimidating, but everyone is very approachable if you just ask. I would suggest joining SIG meetings, subscribing to the Kubernetes Slack channel, and first listening and understanding how things work — it’s perfectly acceptable to observe at first, you don’t need to speak up or contribute until you get a feel for things.
by
Community & Developer Advocate