How We Manage Kubernetes Clusters at Mintel

Nick Badger
Mintel Tech Blog
Published in
5 min readMar 4, 2020

We have been running Kubernetes for a while now and have recently made some improvements in how we manage our clusters.

When we first adopted Kubernetes, it became clear that we would spend a large portion of our time just trying to keep up — the release cycle was getting ever shorter, and we felt a pressure to track the latest release.

Rightly or wrongly, it seemed we had at least one person assigned to “cluster upgrades” full-time, which obviously raised some eyebrows!

At this point, the obvious thing for us was to move to a managed solution — so we did just that, and ended up on Google Kubernetes Engine (GKE).

This move alone goes some lengths into simplifying infrastructure. Generally we saw a reduction in cluster-creation code (terraform), but understanding how upgrades happen and reading release notes is still very much a part of running GKE.

GKE makes it easy to spin up multiple clusters in different environments, but getting our manifests shipped to those clusters was still a push task — our clusters were essentially still Pets, and the operational tasks we wanted to free ourselves from were still there.

Our GitOps Adoption

Software development boils down to something pretty simple in my view:

gather-requirements, write-code, commit, test, ship, repeat!

What does this mean when we apply it to Kubernetes clusters? Well, together with a few of the tools out there (flux-cd/argo-cd), it creates a pretty powerful deployment strategy.

The goal is to ship value in less time. As an SRE team, we want to be focusing on adding useful functionality to our infrastructure — improving areas such as observability, security, safer deployments (and rollbacks), and simplifying processes.

By not focusing on how to get these changes to to our clusters, we free up time to push forward in these other areas.

How our clusters are implemented

We deploy our core clusters on GKE via Terraform & Terragrunt.

We run a couple of flux deployments per cluster: one is geared towards the initial bootstrap of the cluster (namespaces, resource-quotas, security policies) and the other focuses on application deployments.

We split our workloads into 2 areas, one called “Core Cluster” and the other “Apps”, currently all within a single Git mono-repo.

The “Core Cluster” workload represents functionality that runs on all clusters, even different environments. This would include workloads to deal with Authentication, Monitoring, Log collection, and Secrets management.

The “Apps” workload represents our internal or external (client-facing) applications, and are essentially the services written by our development teams, although the SRE team typically owns the manifests.

Continuous Integration

We have 2 main CI pipelines that we work with — mostly running on GitlabCI, but the underlying logic is fairly agnostic. Keep in mind all tests are run as part of Git PR’s (Gitops!).

Changes to Kubernetes manifests require a PR to promote changes to our different environments. Each commit runs a series of tests as well.

Schema validation is achieved using kubecfg and Open Policy Agent (OPA). OPA checks using conftest is an easy way to get started and integrates nicely into CI. OPA tests will let you validate anything in the manifest — for example, we use it for validating image-registries, readiness/liveness probes and apiVersions.

In addition we also create golden tests, which is a way to validate that the generated output matches the expected output — important when you render your manifests using tools such as kustomize or jsonnet, and is a useful way to check whether upgrading such tools has an impact on your manifests.

We then deploy our “Core Cluster” manifests to KinD via CI, and validate that the pods actually start, the cluster is healthy. In addition, we make use of PopEye to check for any issues in the cluster, which also scores our cluster health.

When a PR is accepted into an environment, flux will sync the manifests to the cluster, and Kubernetes will do its thing (roll updates etc).

Application Code, code that our developers own, also uses a shared CI pipeline which our team maintains.

We try to standardize on CI stages such as build, test, report, publish. Within each stage, we inject jobs such as container scanning (trivy), dockerfile linting (hadolint), and more recently Static Application Security Testing.

Secrets and GitOps

Bitnami’s Sealed Secrets is an operator allows one-way encrypted secrets. There are similar tools out there, but essentially you end up a Kubernetes SealedSecret resource that contains your encrypted data. You commit it to Git, and only the cluster operator can decrypt it, at which point it creates (and manages) a Kubernetes Secret resource type.

There are a couple of downsides here:

  • Your pull-requests now contain a lot of binary data, which is painful :)
  • You still need to commit the source of the secret, otherwise how would you ever re-generate the sealed-secret? To achieve this we gpg-encrypt the original source-secret as well (but it’s for humans only).

We did find that we ended up with a number of wrapper scripts around this process, which in turn made it more painful to manage.

Hashicorp Vault is another option that we’ve recently adopted. We have nearly migrated away from sealed-secrets, and instead run Vault in Kubernetes using the bank-vaults operator.

If you have the experience in your team to manage Vault, then it really is worth it. It’s not just about managing secrets — it can do so much more.

Running Vault is probably not something you should do unless you research and understand some fundamentals first. As a starting point, read up on the concepts of Vault sealing and unsealing.

A couple of benefits we get from Vault are the ability to self-service secrets (developer teams can do this), as well as securely inject of secrets into applications at runtime.

I would recommending reading these blogs by Hashicorp and Banzaicloud.

To wrap up, what’s next for us?

In general, it’s about making our infrastructure easier to manage by investing time in the right tools.

Recently we’ve been focusing on improving our Grafana dashboards to use grafonnet-lib — this in turn makes it easier to build and re-use widgets across projects, as well as speed up upgrades when underlying metrics change.

On the horizon, we hope to look into Google’s config connector to simplify resource creation and management. Additionally, we’re exploring progressive application delivery (perhaps via flagger) to look into rollout/rollbacks using monitoring and canary-style deployments.

--

--