Inspector Murphy, The Kubernetes quality enforcer


The quality enforcer bot whose prime directive is to terminate bad apps & services (yes it’s robocop… kinda… next stop skynet… actually yes)
Problem
  • Failed pods remaining in kubernetes for up to 5 months, taking up resources (disk, memory and generally making kubectl proxy very busy)
  • Old ReplicaSets remaining even after new deployments
  • Misbehaving pods/apps causing havoc in test environments (demonstrating low quality) driving up logz.io costs
Container statuses across different environments (1105 failed containers in total)
Solution
  • Directive 1: Burn failed/old pods and ReplicaSets with fire
  • Directive 2: Remove apps exhibiting bad behaviour from environment
How

A daily cron-job (deployed to kubernetes) which uses the Kubernetes .net client to get all pod objects. It then passes the pod object through a set of rules. If the rule is applicable, an action is performed. Simples. Link to full repo coming soon.

Meets the first directive, marks failed Pod for deletion

Fulfills part of the second directive, checks for high error counts from apps

To delete a pod, I use the following method on the kubernetes API.

await _kubernetes.DeleteNamespacedPodWithHttpMessagesAsync(new V1DeleteOptions(), pod.Metadata.Name, namespace);

For misbehaving apps, I scale replicas to zero.

var deploymentName = GetDeploymentNameFromPod(item);
var newReplicaCount = await ReplaceCurrentScaleValuesWithZero(item, deploymentName);
await _kubernetes.ReplaceNamespacedDeploymentScaleWithHttpMessagesAsync(
newReplicaCount.Body,
deploymentName,
item.Metadata.NamespaceProperty);

P.s. This was possible because we follow the same logging conventions for kubernetes apps… see Logging is great but..

What teams see?
Pods been scaled down/terminated in kubectl
Inspector Murphy notifying teams via slack that their pod was terminated
Why?
Reduce overall errors in Logz.io

Remove misbehaving apps from dev/staging

In total, a selected few apps were responsible for 2636 errors every 30mins. This is now been trimmed to nearly nothing since Inspector Murphy went on duty (unless of course something breaks). Also the number of failed pods went from 1105 to zero.

The kubernetes infrastructure also benefited from Inspector Murphy. Available memory went up on worker nodes (this is just one box), in total 3 out of the 14 boxes benefited.

Memory freed up when the first rule ran across all namespaces (Deleting failed pods)

Reduction in overall CPU usage from Kube-controller manager (spike was from calling delete on 5532 ReplicaSets on dev and 4000 ReplicaSets on staging). The ReplicaSet rule was added later on (hence different times in the graphs)

Cpu reducing after second rule (delete old ReplicaSets) was executed on dev and staging only.

We saw less warnings from Logz.io. Since we no longer host our own elasticsearch for Logging, every gigabyte matters.

What’s next?

Rules, rules and more rules!

  1. Pods incorrectly named like “JG-Demo-xxx” or “JG-Luis-XXX” (yes some of my pods fall will be killed too)
  2. Pods in constant crash loop backoff
  3. Pods with expired Vault tokens
  4. Pods not conforming to standards
  5. Pods missing team name label
  6. Integrate with Skynet (aka JG Genesis) to prevent deployments in the first place and track quality of an app
  7. Keep cloning myself into bots

We also need to address the root cause of many failed pods and old ReplicaSets. This stems from us using Kubernetes early and now revision history limits have been introduced as defaults. For example changing resource type from “extensions/v1beta1” to “apps/v1beta1”, introduces a default of history limit of 2.

Useful links: