Running our infrastructure on autopilot!

Ibrahim Attwa
4 min readNov 6, 2018

--

The challenge to most business nowadays is, how to balance the faster product delivery while keeping cost in check. As an early stage startup keeping our cost under control is of paramount importance from a longevity and profitability standpoint. So, we decided to take a close look at one of the major spend items in our budget, infrastructure. The critical question really was are we utilizing our infrastructure efficiently and if not, how can we improve to achieve better outcomes? Our mission as a company is to help businesses and organizations get the most out of infrastructure without sacrificing service performance. Our product uses machine learning to monitor critical metrics and continuously balances application performance with infrastructure utilization. You can check our website for more details.

So, we wanted to dogfood our product, and draw some learning. As a company one of our internal KPIs is to make infrastructure utilization more than 70% across the three dimensions we care about: CPU, memory, and I/O. I’m sharing our journey and findings in the course of answering those questions and how we adjusted to improve infrastructure utilization.

The setup

Our product has two pieces, an AI system consists of time series predictions, scalability decision analysis, optimization, and a feedback loop to learn from these decisions. The other piece is a real-time system where our Agent scans, collects and streams data for analytics and robust predictions pipeline .The system was designed to be highly available with very low latency for quick response to scalability needs. The product as such could be broken down into:

  • Services: A total of 100 services distributed between deployment, stateful sets, and Jobs
  • Infrastructure: Our nodes ranged from 6 nodes in slow times all the way to 50 nodes cluster(s) at peak data crunching times

Since this is a Kubernetes cluster we started out using a combination of HPA to scale pods horizontally e.g. add replicas based on observed CPU utilization and VPA to automatically manage container requests to allow proper scheduling onto nodes where sufficient resource amount is available. To this point, we thought things were under control, but far from it!

We had containers OOM terminated, inaccessible nodes due to containers highjacking CPU, unexpected eviction of pods, failed deployments, just to name a few. To address this, we had to either over-provision infrastructure to ensure critical services are always running “expensive”, or to continuously look at low-level metrics to keep our utilization at the proper levels. However, with more than 100 microservices each generating around 20 metrics with different resolutions, the old way to create scalability rules around these metrics seemed like a “mission impossible”.

Walking the walk

We knew that the first real cluster to use our autopilot capability will be ours. The way that works start with deploying an Agent to the cluster to gain insights into resource distribution and utilization. The Agent which is just another pod uses kublete APIs to gather metrics, events and other information and streams it back to our backend every second. That information is then sampled and fed onto our AI to analyze, and build usage & prediction models. The AI system, would predict, generate future scalability decisions based on actual usage metrics, which are then passed to the autopilot to apply. The chart below shows how the resource utilization of our cluster looked like at the beginning of this journey.

With %12 for Memory and %15 of CPU utilization of Cluster resources we had some obvious work to do!.

Engaging the Autopilot

We decided to turn the Autopilot on for a set of namespaces and observe the result. Things seemed to work for a little while, but we ran into turbulence! It took the team a number of days and attempts to fix our code and stabilize things. Here is how the flight went:

Crashing the Plane!

The first attempt, we discovered an issue with resource units which caused the AP to allocate too much resources for containers leading to debleated resources, many unschedulable pods and eventually crashing the cluster. To make matters worse, our agent didn’t have enough smarts to recover containers from bad decisions that caused containers to crash due to OOM.

Flying Too High

The 2nd attempt was better, though wasn’t totally free of air bumps! The autopilot spun up too many replicas and instances of the same container, but the cluster was able to hold. Furthermore AP was busy working at a very high frequency to execute scalability decisions every single hour, which caused a lot of disruption for the production environment.

Safe Landing

The 3rd attempt was the charm, after fixing the unit conversion bugs, adjusting the frequency, and enabling the agent to recover OOMs in less than 10 seconds, things started to stabilize.

As you can see from the chart above, we went from %15 memory utilization to almost %62, and from %12 CPU to %28. This reduced our monthly bill by %50 while maintaining the performance of our services. The biggest win for us was the learnings and ideas we discovered throughout this; to provide DevOps with full control over their environment. Things like maintenance windows, safety buffers, and much more which will be shared soon.

If you want to check the health of your cluster, gain some insights and enjoy a hands free ride, I encourage you to try and let us know what you think.

--

--