Keiko: Automated Kubernetes at Scale

David Masselink
keikoproj
Published in
6 min readNov 15, 2019

--

Keiko is an open-source project that automates the operation of Kubernetes clusters at scale using Custom Resources that run as an integrated part of the cluster rather than via external tools and management frameworks. Keiko custom resource can be used together, or individually to automate specific aspects of cluster operation.

Intuit initially developed the Keiko custom resources to automate our Kubernetes infrastructure consisting of 160 clusters and 6000 nodes running thousands of services in production. These clusters serve live user traffic 24x7 and are upgraded every two to four weeks to meet stringent security and compliance requirements. Here is a talk fromAWS Community Day around the challenges faced daily while operating this massive infrastructure.

Keiko is a new project, so we hope to quickly expand the scope of automation as well as supported cloud environments with help from the community!

The Problem

Given Intuit’s business needs, it was clear that simply running a few large, manually managed clusters was not going to scale. Different business units had different security and compliance needs and required isolation from each other. This brought up some basic questions around automating the operation of Kubernetes clusters.

  • How to regularly upgrade clusters?
  • How to configure and manage critical cluster services for logging, monitoring, policy enforcement, forensic analysis, etc?
  • How to ensure healthy and responsive nodes and services?
  • How to optimize cost?

Each Keiko custom resource is designed to automate a specific operational concern.

Addon-Manager automates the deployment and lifecycle management of critical cluster-wide services. Examples of such services include metrics, logging, networking, autoscaling and ingress.

$ kubectl get addons

NAME VERSION STATUS REASON AGE
cluster-autoscaler v0.1 Succeeded 17h
event-router v0.2 Succeeded 17h
external-dns v0.2 Succeeded 17h
minion-manager v0.8 Succeeded 17h
node-reaper v0.2 Succeeded 17h
pod-reaper v0.2 Succeeded 17h
upgrade-manager v0.1 Succeeded 17h
wavefront-s3-adapter v0.2 Succeeded 17h

Instance-Manager enables multi-tenant clusters via a Kubernetes-native grouping of nodes. In this context, a “tenant” could be a development team focused on a particular application or service. This abstraction allows each team the illusion of an independent cluster while solving noisy neighbor concerns.

Instance-Manager also supports updating the instance group configuration, and works with upgrade-manager to upgrade the nodes themselves.

$ kubectl get instancegroupsNAME               STATE   MIN   MAX   GROUP NAME
instance-group-1 Ready 3 6 instan-NodeGroup-LSXLFVMIK2WQ
instance-group-2 Ready 3 6 instan-NodeGroup-RJJ6AJNFAYIH
$ kubectl get nodesNAME STATUS ROLES AGE
ip-10-10-10-10 Ready instance-group-1 17h
ip-10-10-10-11 Ready instance-group-1 17h
ip-10-10-10-12 Ready instance-group-1 17h
ip-10-10-10-20 Ready instance-group-2 17h
ip-10-10-10-21 Ready instance-group-2 17h
ip-10-10-10-22 Ready instance-group-2 17h

Upgrade-Manager defines a “RollingUpdate” custom resource which makes rolling-updates of instances in an AutoScaling group simple and safe to carry-out with confidence. It also allows for custom pre- and post-drain scripts for nodes.

$ kubectl get rollingupgradesNAME                      STATUS       TOTAL NODES   NODES PROCESSED
rolling-instancegroup-1 Completed 3 3
rolling-instancegroup-2 Completed 3 3

Lifecycle-Manager carries out the important work of draining and decommissioning instances as well as deregistering ALB/ELB targets in order to eliminate any such negative impacts of rebalance or termination events.

$ aws autoscaling terminate-instance-in-auto-scaling-group --instance-id i-0d3ba307bc6cebeda --region us-west-2 --no-should-decrement-desired-capacity{
"Activity": {
"ActivityId": "5285b629-6a18-0a43-7c3c-f76bac8205f0",
"AutoScalingGroupName": "my-scaling-group",
"Description": "Terminating EC2 instance: i-0d3ba307bc6cebeda",
"Cause": "At 2019-10-02T02:44:11Z instance i-0d3ba307bc6cebeda was taken out of service in response to a user request.",
"StartTime": "2019-10-02T02:44:11.394Z",
"StatusCode": "InProgress",
"Progress": 0,
"Details": "{\"Subnet ID\":\"subnet-0bf9bc85fEXAMPLE\",\"Availability Zone\":\"us-west-2c\"}"
}
}

$ kubectl logs lifecycle-manager
time="2019-10-02T02:44:05Z" level=info msg="starting lifecycle-manager service v0.3.0"
time="2019-10-02T02:44:05Z" level=info msg="region = us-west-2"
time="2019-10-02T02:44:05Z" level=info msg="queue = https://sqs.us-west-2.amazonaws.com/00000EXAMPLE/lifecycle-manager-queue"
time="2019-10-02T02:44:05Z" level=info msg="polling interval seconds = 10"
time="2019-10-02T02:44:05Z" level=info msg="drain timeout seconds = 300"
time="2019-10-02T02:44:05Z" level=info msg="drain retry interval seconds = 30"
time="2019-10-02T02:44:05Z" level=info msg="spawning sqs poller"
time="2019-10-02T02:44:12Z" level=info msg="spawning event handler"
time="2019-10-02T02:44:12Z" level=info msg="hook heartbeat timeout interval is 60, will send heartbeat every 30 seconds"
time="2019-10-02T02:44:12Z" level=info msg="draining node ip-10-10-10-10.us-west-2.compute.internal"
time="2019-10-02T02:44:42Z" level=info msg="sending heartbeat for event with instance 'i-0d3ba307bc6cebeda' and sleeping for 30 seconds"
time="2019-10-02T02:44:45Z" level=info msg="completed drain for node 'ip-10-10-10-10.us-west-2.compute.internal'"
time="2019-10-02T02:44:45Z" level=info msg="deregistering i-0d3ba307bc6cebeda from arn:aws:elasticloadbalancing:us-west-2:00000EXAMPLE:targetgroup/targetgroup-9b26c8689f3b53a1ef0/53e66aede612f044"
time="2019-10-02T02:44:45Z" level=info msg="sending heartbeat for event with instance 'i-0d3ba307bc6cebeda' and sleeping for 30 seconds"
time="2019-10-02T02:45:15Z" level=info msg="setting lifecycle event as completed with result: 'CONTINUE'"

Governor is a collection of tools which improves the stability of large kubernetes clusters by proactively terminating stuck pods as well as misbehaving nodes.

It is delivered as a single Docker image and contains both a node-reaper as well as pod-reaper. Node-reaper allows for scaling group worker nodes to be force terminated so that healthy replacement nodes can come up in their place. Pod-reaper automatically tries to force terminate pods which are stuck in “Terminating” state for a configurable period of time.

$ kubectl get podsNAME              READY   STATUS        RESTARTS   AGE
test-pod-7k4r5 1/1 Terminating 0 18h
test-pod-bgctm 1/1 Terminating 0 17h
test-pod-dqmjc 1/1 Terminating 0 18h
test-pod-jzn52 1/1 Running 0 17h
$ kubectl get nodesNAME STATUS ROLES AGE
ip-10-10-10-10 Unknown instance-group-1 17h
ip-10-10-10-11 NotReady instance-group-1 17h
ip-10-10-10-12 NotReady instance-group-1 17h
ip-10-10-10-20 Ready instance-group-2 17h
ip-10-10-10-21 Ready instance-group-2 17h
ip-10-10-10-22 Ready instance-group-2 17h

Kube-Forensics provides security capabilities by allowing a cluster administrator to dump the current state of a running pod, as well as all of its containers, so that security professionals can perform off-line forensic analysis.

This functionality is especially valuable if a suspicious pod is found and admins want to remove it ASAP but without losing details of its history to determine adversary capabilities and attack vector details.

Active-Monitor provides a “HealthCheck” custom resource, which is a specially wrapped Argo Workflow, for the purpose of deep cluster monitoring.

While liveness and readiness probes can indicate first-order health of a pod, they can’t determine the health of other critical configuration. For instance, can new pods successfully make DNS lookups? Do they have access to the necessary role-based access controls (RBAC) or volumes?

Active-Monitor allows for the collection of custom metrics as well as auto-remediation steps to be carried out upon a “HealthCheck” failure.

Minion-Manager allows for cost-optimization by enabling the intelligent use of spot instances in kubernetes clusters running on AWS.

This tool factors on-demand prices, spot-instance prices and the current state of AutoScalingGroups to make its optimization decisions. It is able to save significant $, especially development and test clusters which often experience bursty usage patterns.

If the goals of Keiko resonate with you, please take a closer look at the project on GitHub.

--

--