I created Kube Eagle, a prometheus exporter which proved to be a great tool to gain more visibility of your cluster resources in small to medium-sized clusters . Ultimately it helped me saving hundreds of dollars by using appropriate machine types and tweaking my applications’ resource constraints so that they are better aligned to my workloads.
Before diving into the advantages and features of kube eagle let me start explaining the use case and why we needed a better monitoring.
I was in charge of managing multiple clusters each having between 4 and 50 nodes. There were up to 200 different microservices and applications running in a cluster. In order to utilize the available hardware resources more efficiently the majority of these deployments were configured with burstable RAM and CPU resources. This way pods may take available resources if they actually need them, but they don’t block other applications from being scheduled on that node. Sounds great doesn’t it?
Even though our overall cluster CPU (8%) and RAM usage (40%) was relatively low, we often faced problems with evicted pods. Those pods got evicted because they were trying to allocate more than the available RAM on a node. Back then we only had one dashboard to monitor the kubernetes resources and it looked like this:
Using such a dashboard one can easily find nodes with the high RAM and CPU usage so that I could quickly identify the overutilized nodes. Regardless, the real struggle begins when you try to solve the cause of that. One option to avoid evictions would be to set guaranteed resources on all pods (request and limit resources are equal). The downside is, that this leads to a much worse hardware utilization. Cluster wide we had hundreds of gigabytes RAM available, yet some nodes were apparently running out of RAM while others still had 4–10GB free RAM.
Apparently the kubernetes scheduler didn’t schedule our workloads so that they are equally distributed across our available resources. The kubernetes scheduler has to respect various configurations, e. g. affinity rules, taints & tolerations, node selectors which may restrict the set of available nodes. In my use case there were none of these configurations in place though. If that’s the case the pod scheduling is based on the requested resources on each node.
The node which has the most unreserved resources and can satisfy the requested resource conditions will be chosen to run the pod. For our use case this means the requested resources on our nodes do not align with the actual usages and this is where Kube Eagle comes to rescue as it provides a better resource monitoring for this purpose.
Most Kubernetes clusters I worked with were only running Node exporter and Kube State Metrics to monitor the cluster. While node exporter reports metrics about disk usage, IO stats, CPU & RAM usage, the kube state exporter exposes metrics about Kubernetes objects — such as requested & limit CPU / RAM resources.
We need to combine and aggregate the usage metrics with the resource request & limit metrics in Grafana to get really valuable information for our problem. Using the given two tools this is unfortunately way more cumbersome than it sounds, because the labels are named different, and some metrics are lacking metadata labels to combine and aggregate them easily. Kube Eagle does that for you and this is the suggested dashboard created based on it’s metrics:
We were able to identify and solve multiple issues with our resources, which ultimately allowed us to save a lot of hardware resources:
- Developers who deployed their microservices did not know how much resources their services actually need (or didn’t care). We had no monitoring in place to easily identify misconfigured resource requests because this requires to see the usage along with the resource requests and limits. They can now use the newly provided prometheus metrics to monitor the actual usage and adopt their resource requests & limits.
- JVM applications take as much RAM as they can. The garbage collector only releases memory if more than ~75% RAM is in use. Since most services were burstable in RAM, it was always consumed by the JVM. Thus we had much higher RAM usages on all these Java services than expected
- Some applications had way too large RAM requests which caused the Kubernetes scheduler to skip these nodes for all the other applications, even though the actual usage was lower than on all other nodes. The large RAM request was accidentially caused by a developer which specified resource requests in bytes and added another digit by accident. 20Gb RAM instead of 2Gb RAM have been requested and no one ever noticed it. The application had 3 replicas and thus 3 different nodes were affected by that over allocation.
- After adopting the resource constraints and rescheduling the pods with appropriate requests, we achieved a pretty much perfectly balanced hardware utilization across all nodes. We noticed we can shutdown a couple nodes because of that. Then we noticed that we were using wrong machine types (CPU optimized instead of high memory machines). After changing the machine type we could drain and remove more nodes again.
Burstable resource configurations in a cluster provide the benefit of utilizing the available hardware more efficiently, but it can also cause troubles as the kubernetes scheduler does pod scheduling based on resource requests. In order to achieve a better degree of capacity utilization without running into issues you need a decent monitoring solution. Kube Eagle (prometheus exporter and it’s grafana dashboard) has been created for this purpose and can assist you with this challenge.