Inspector Murphy, The Kubernetes quality enforcer
The quality enforcer bot whose prime directive is to terminate bad apps & services (yes it’s robocop… kinda… next stop skynet… actually yes)
- Failed pods remaining in kubernetes for up to 5 months, taking up resources (disk, memory and generally making kubectl proxy very busy)
- Old ReplicaSets remaining even after new deployments
- Misbehaving pods/apps causing havoc in test environments (demonstrating low quality) driving up logz.io costs
- Directive 1: Burn failed/old pods and ReplicaSets with fire
- Directive 2: Remove apps exhibiting bad behaviour from environment
A daily cron-job (deployed to kubernetes) which uses the Kubernetes .net client to get all pod objects. It then passes the pod object through a set of rules. If the rule is applicable, an action is performed. Simples. Link to full repo coming soon.
To delete a pod, I use the following method on the kubernetes API.
await _kubernetes.DeleteNamespacedPodWithHttpMessagesAsync(new V1DeleteOptions(), pod.Metadata.Name, namespace);
For misbehaving apps, I scale replicas to zero.
var deploymentName = GetDeploymentNameFromPod(item);
var newReplicaCount = await ReplaceCurrentScaleValuesWithZero(item, deploymentName);
P.s. This was possible because we follow the same logging conventions for kubernetes apps… see Logging is great but..
What teams see?
In total, a selected few apps were responsible for 2636 errors every 30mins. This is now been trimmed to nearly nothing since Inspector Murphy went on duty (unless of course something breaks). Also the number of failed pods went from 1105 to zero.
The kubernetes infrastructure also benefited from Inspector Murphy. Available memory went up on worker nodes (this is just one box), in total 3 out of the 14 boxes benefited.
Reduction in overall CPU usage from Kube-controller manager (spike was from calling delete on 5532 ReplicaSets on dev and 4000 ReplicaSets on staging). The ReplicaSet rule was added later on (hence different times in the graphs)
We saw less warnings from Logz.io. Since we no longer host our own elasticsearch for Logging, every gigabyte matters.
Rules, rules and more rules!
- Pods incorrectly named like “JG-Demo-xxx” or “JG-Luis-XXX” (yes some of my pods fall will be killed too)
- Pods in constant crash loop backoff
- Pods with expired Vault tokens
- Pods not conforming to standards
- Pods missing team name label
- Integrate with Skynet (aka JG Genesis) to prevent deployments in the first place and track quality of an app
- Keep cloning myself into bots
We also need to address the root cause of many failed pods and old ReplicaSets. This stems from us using Kubernetes early and now revision history limits have been introduced as defaults. For example changing resource type from “extensions/v1beta1” to “apps/v1beta1”, introduces a default of history limit of 2.
Edit This Page Garbage collection is a helpful function of kubelet that will clean up unused images and unused…kubernetes.io
Are your Kubernetes ReplicaSets slowing you down? With a quick little clean up, our CPU load went down by 10%! Here's a…www.weave.works
Here at Shazam we have been deploying our apps with Kubernetes for a while. In order to make the management of the deployed application…blog.shazam.com