Anatomy of a production Kubernetes cluster

Andy Hume
7 min readFeb 10, 2020

--

At Brandwatch we’ve been running Kubernetes in production since October 2016. We started out on GKE, which gave us a huge early advantage in relation to running and maintaining the core components of Kubernetes. It was where we proved that Kubernetes was a good fit for many of our existing services, and that we could effectively scale a diverse and growing set of microservices.

Three years down the road, the historical quirks and backgrounds of our engineering teams mean we are now running Kubernetes on AWS (EKS), on GCP (GKE), and in our own colocation data centres in London. We currently run six production clusters across these sites, installing and maintaining a standard stack of cluster-level services and tools on each one.

We’re now running over 150 production services across six clusters. This is a write-up of the basic cluster specification and operational features.

Versions and upgrades

Probably the biggest advantage of GKE and EKS is in relation to running and maintaining the core services of Kubernetes. The master node is a managed service, and upgrades to the master (and worker nodes themselves if you want) are a single click or command away. We typically upgrade to the latest minor release of Kubernetes once it reaches 1.x.2 or 1.x.3. This means a minor upgrade about once every three months and then patch upgrades on one or two occasions after that, or when critical vulnerabilities are announced.

Depending on the kind of services running in the cluster we usually carry out node upgrades by provisioning a new node pool with the upgraded version, cordoning off the old nodes, then moving services between them manually. Although both GKE and EKS can do rolling upgrades of nodes automatically, even with well configured Pod Disruption Budgets there’s an element of risk in setting off an automated upgrade and watching it roll to completion. We’ve found we’re more comfortable moving workloads over in a more controlled and manual way — particularly as we can’t always guarantee that a team’s configurations are as resilient to node draining as they could be.

Manifests

We store all our stage and production manifests for all clusters in a single Git repository that we call Kuber. Having a single repo of manifests comes at the cost of keeping them separate from application source code, but it’s a trade-off that drastically simplifies the mechanics of how changes are rolled out. Having a single repo gives us a central hub around which we can build robust deployment pipelines and change control tooling for all our clusters.

Some application repos publish Helm charts alongside container images as part of their CI builds. In these cases the Kuber repo only contains a reference to the required helm chart and the environment specific values.

Deploying

Our deployments run as per-service jobs on Concourse CI. Changes to configurations are committed to Git, either manually via a PR, or by automated systems. We have an in-house artifact metadata and deployment service called Slipstream, which makes its own commits into the Git repo for things like updating container image digests. Concourse picks up these changes, authenticates to the appropriate cluster, and pushes all the configs for the changed service.

Each service defines a service configuration file (kuber.yaml) that contains a bunch of metadata about the service: it’s name, description, the team that owns it, links to metrics, logs, runbooks, and a description of the commands required to deploy it. These commands are basically a dumb wrapper around kubectl. It has served us surprisingly well, and allows teams to do relatively custom things as part of deployment pipelines.

We also support rollout of Helm templates through this configuration file (we don’t use Helm itself in production — only it’s chart/templating elements), as well as other supporting mechanisms such as creation of imagePullSecrets, and restarting pods when config/secrets have changed.

The manifests for all of the standard cluster-level services we run are managed with Kustomize. The base configs for these do in fact live in a separate repository, which allows us to version them independently and then roll out upgrades to individual clusters through Kuber as necessary. Since Kustomize is now part of kubectl, no special tooling or additional steps need to be taken in deploy pipelines. A simple kubectl apply -k has us covered.

Namespaces

All our user-facing applications run in the default namespace, and we haven’t got to the point where we’re scaling clusters for true multi-tenancy. At this point we’re more likely to spin up another cluster to provide that kind of isolation.

We do however run most of the standard cluster-level services in separate namespaces, prefixed with system-, mostly just to keep them out of developer’s way and to make it straight-forward to rip out a set of services if we need, with kubectl delete ns.

Monitoring

We provision a standard Prometheus stack into every cluster. This includes Prometheus, Alertmanager, kube-state-metrics, node-exporter, all via Core OS’s Prometheus Operator. The Prometheus Operator has been incredibly valuable compared to our early attempts to run all these independently via Helm charts. We’ve had to do very little configuration of any of these components to make them reliable and appropriately scaled. We run two Prometheus instances for high availability, and expose the metrics to our centralised Grafana installations hosted on-premises in our data centre.

Logging

As with metrics, we consider it a requirement to have monitoring/logging and observability data available to view and query from a single central place. It doesn’t really work if during an incident engineers are trying to remember which tool they need to access metrics or logs for a particular service, or if they can’t easily line up metrics from different services alongside each other.

Until recently on GKE we were using the legacy logging add-on with fluentd shipping logs to Google Stackdriver. We have log sinks for a subset of these which export them via PubSub into the Elastic Stack (ELK) services in our data centre.

However, as we’ve begun running clusters on different platforms we wanted a standard way of shipping logs that worked across all our clusters. We now run a standard filebeat daemonset which ships logs directly from the cluster to Logstash in our data centres.

Ingress

We run a standard deployment of the nginx-ingress-controller, with a consistent configuration across all clusters. It applies a standard set of response headers (including security headers such as `X-XSS-Protection`, `X-Content-Type-Options`), strips some internal-only headers from responses, and sets up some other proxy settings that our services tend to require, though these can be overridden on a per-service basis.

It also handles SSL redirects, and intercepts errors from backends to return standard 404 responses, and some other error types. The nginx-ingress-controller Pods run in the system-ingress namespace.

We also find it convenient to use the nginx-ingress-controller as a proxy in front of Google Cloud Storage buckets. A number of our applications are single page web apps, so hosting these via a simple object storage API is simple, resilient, and cheap.

Secrets

Hashicorp Vault is a popular application for managing cryptography and storing secret information. We use Vault in various ways including for public key infrastructure, one-time password generation, and storage of runtime application secrets.

Our clusters request secrets from Vault using the provided Kubernetes auth mechanism, which allows applications to request secrets from a cluster-specific path in Vault. These are typically requested directly by applications as they start-up, or by an init container that shares some mounted disk storage with the main application container.

Developers are able to easily create new secrets in Vault, but not able to update or read them back out. Once stored, only the appropriate Kubernetes clusters can read application secrets.

TLS

We use JetStack’s cert-manager to issue TLS certificates from Let’s Encrypt for publicly exposed endpoints. Developers define Certificate manifests alongside their application manifests and the certificates/keys are made available to them in just a few seconds, via either HTTP or DNS challenges.

cert-manager can also connect to Vault to request certificates from our internal certificate authority. Few applications do this currently, but on the odd occasion we require this facility it’s very simple to make these certificates and keys available.

Service Mesh

We don’t have any end-to-end service mesh infrastructure in our clusters currently, although we do run Linkerd in front of some high concurrency/throughput services where resiliency and low latencies are particularly important. Linkerd’s P2C load-balancing helps us get significantly better utilisation from underlying compute resources for services that have sharp rises and falls in requests.

We also use Linkerd in stage clusters for per-request routing to test builds. Teams build versions of their application from PRs and then deploy these into test namespaces. Linkerd can then route certain requests through these services allowing both humans and automated test runners to test new code while making use of the existing stable services in the stage cluster.

We do plan to add service-mesh style proxies at some point, for resiliency and observability reasons. But it hasn’t been a priority so far and most of our clusters aren’t dealing with the kind of scale where the benefits really come in to play.

Cluster auto-scaling

All our cloud-based clusters make use of auto-scaling node-pools/groups.

This is useful primarily for ensuring that clusters are optimally sized for the workloads that our teams schedule on to them, and we don’t get a situation where the cluster has “run out” of capacity for new or scaling services.

The Cluster Autoscaler works less well when responding to a horizontal pod autoscaler scaling individual workloads up, mostly because of the time it takes to boot and add a new worker to the cluster. Most services that autoscale based on metrics can deal OK with that lag by queueing up work, or cancelling and retrying slow requests. Having said that we often pre-scale workloads when we know a spike is coming.

Preemptible & Spot Instance node pools

In GKE the majority of our worker nodes are preemptible VM instances.These are instances that cost about 30% of the full instance price but can be force terminated at any time (and at least once every 24 hours). Our stateless workloads are well suited to running on these kind of machines and we haven’t suffered any resiliency or availability problems with VM instances themselves.

--

--

Andy Hume

Director of Engineering, Application Infrastructure at @brandwatch. Previously @twitter, @guardian, @clearleft, @microsoft, @multimap.