Hyperpilot open sourced 100% of its products

Timothy Chen
6 min readMar 10, 2018

--

Today we open sourced all of the products we have worked on last year, and in this post I want to quickly highlight what are the products we worked on.

Hyperpilot remained stealth mode for the entire last year, so let me explain a bit what we were going after. Our mission is to bring intelligence to infrastructure that can drastically improve efficiency and performance. We see devops and system engineers are constantly challenged to make lots of choices around their container infrastructure with limited information and using a very manual process. These choices will include all the way from vm configs (instance type, region, etc), container configs (resource request/limit, count, affinity, etc) to app level configuration choices (jvm, etc). Operators or developers often make a static choice and most of the future maintainers has no idea why a certain choice was made. Also worse, is that operators tend to overprovision in all possible ways that leads to very inefficient usage of their infrastructure. Therefore, we worked on three products that was looking to help operators to have the tools to continuously help them make better choices, and in the future automate these recommendations in their environment. In the following sections I will explain in high level three products that we worked on that is now freely available to use and contribute to.

HyperConfig: intelligent configuration search

If you have used the cloud, deployed docker containers with Kubernetes or Mesos, then you know that one of the first problem you face is to figure out what’s the best resource configuration of each component you chose. For example, what VM instance type should I use? How many nodes should I deploy to? What container cpu and memory request/limit should I configure? All of these questions imply different trade offs between cost and performance. Using VM size as an concrete example, picking a large VM instance type will cost a lot more, but may give you a better application performance. Choosing a VM that is too small, leads to performance and SLA problems. And it is also not obvious what the correct choice is, since if you take a MySQL tpcc benchmark and run through every AWS instance type, the best performance and cost ratio choice doesn’t follow a linear and predictable pattern:

Also doing an exhaustive search is very time and cost prohibitive. Luckily this isn’t a brand new problem and there are quite a few research solutions out there, but there isn’t a generic open source solution that we can find that supports a generic load test output.

Therefore, we created hyperconfig that is inspired by the work from CherryPick, and we made it to suggest a set of AWS instance types for different criteria based on a generic load test result.

Instead of exhaustively searching through every instance type, HyperConfig uses a well known optimization technique called Bayesian Optimization, to find a near optimal results with only running a lot fewer sample points. And since the samples can also be ran in parallel, it greatly reduces the time and cost to take to find a near optimal instance type. Note that HyperConfig won’t be able to guarantee find the most optimal one, but in practice we found it to find a close enough selection.

For more information about how to run our demo and details about the code, please refer to sizing section of the analyzer.

HyperPath: resource bottleneck analysis

One common problem we see operators have, is to be able to find the root cause of a performance problem that shows up in their Kubernetes cluster. This is a very difficult task as the source of the performance issue can come from many different sources of your infrastructure. However, if we narrow down the problems to just resource bottlenecks, then one can develop a system that attempts to diagnosis what known resource bottleneck does an application experiencing when the application performance suffers. HyperPath is focusing on detecting cpu/memory/network/IO bottlenecks and also diagnosing if the problem comes from container(s) limit or a node limit.

How HyperPath roughly works is that it assumes it can access both application SLO metric (e.g: 95th percentile latency) and also resource metrics which includes container cpu/men/net/IO and node level similar metrics. With these data source, it will attempt to correlate what resource metrics that has exceeded some threshold is the most likely root cause for the change in application metric, and rank the top few metrics with the highest correlation score.

In the following demo, you will see that we are able to detect CPU and other resource bottleneck that occur during app latency has exceeded its SLO threshold:

For more information and source code, please refer to the diagnosis section of the analyzer.

Best effort controller: oversubscription for best effort jobs on Kubernetes

It is a well known fact that all operators over provision resources for their applications. One of the most important reason operators do so intentionally is to accommodate spikes that can occur either unpredictably or infrequently. This also leads low cluster utilization as peak usage doesn’t happen all the time. We can’t simply allocate too small of a amount of resources and hope to use the cloud or container orchestrator auto scalers as they might take minute(s) to scale backup during spike traffic. How do we utilize the overprovisioned resources then? One way is to launch best effort (BE) workloads next to them, and have a way to make sure these workloads can be controlled or killed when spike events happen in a timely manner.

Christos Kozyrakis and David Lo’s work around Heracles was aiming to solve this problem, and have also evaluated their work in a Google search workload. For more information about the details of how it works, please refer to the original paper. But at a very high level, it creates a node controller on every node, and this controller has a sub controller for each resource (cpu, memory, network, IO, caching, etc) that watches its utilization. It then uses the main application SLO metric as a input signal to determine when and how to scale resources for each workload. When app metric is performing greatly, we can start giving more resources to the BE jobs, and inversely when app metrics suffers.

At Hyperpilot, we developed the Heracles algorithm and also made it work on top of Kubernetes. In the following video you can see the BE controller in action when we run Spark with BestEffort QoS class next to a microservice.

When Spark is running next to the microservice without the BE controller, you will see latency hike due to the interference from the Spark Job. Notice that even setting BestEffort to the Spark job doesn’t avoid the interference issue as resources are contended in different resources than what Kubernetes is monitoring. With the BE controller enabled, we start seeing the latency to be controlled within the SLO threshold and BE jobs can still make progress without simply being killed. In this demo we can see a 2–3x utilization increase.

For more information about the codebase, please refer to here.

I hope these projects can show how leveraging data from kubernetes and application can make a difference in terms of cost and performance.

Feel free to reach out if you have any questions to me (tim at hyperpilot dot io).

--

--

Timothy Chen

Entrepreneur focusing on solving problems with infrastructure and data