The SafetyCulture journey to Kubernetes — Part 2

Tim Curtin
SafetyCulture Engineering
6 min readDec 21, 2018

Welcome to part 2 of the SafetyCulture journey to Kubernetes.

In this part we will take a deep dive into three aspects of our current implementation of Kubernetes on Amazon EKS:

  • Rollout of EKS and our experience to date (including our service deployment pattern)
  • Working with Helm, Helmsman, and open source charts
  • Near-term roadmap (canary and service mesh)

The SafetyCulture rollout of EKS

One of the first decision points we needed to make was how we would deploy our cluster. The natural choice here for us is to use Terraform:

  • We were already heavily invested in Terraform throughout engineering, so it is the most logical place for a Kubernetes related VPC, subnets, security groups, autoscaling groups, route53 records, and EKS clusters to be defined.
  • We use Terraform for service deployments (to ECS, though this will largely be replaced by helm charts), for datastore infrastructure, for deployment of critical subnet and VPC definitions, in addition to many other use cases (consul, pagerduty, etc.)
  • By reusing CI tooling and existing Terraform experience, we can rapidly create new VPCs, subnets, and peer these to our existing VPCs using AWS VPC Peering, and utilise the available EKS Terraform resources

Service deployments using Helm and Helm Charts

Early in the decision process we also had to determine how we were going to deploy our services to Kubernetes, and primarily we wanted to solve two major issues:

  1. Remove the burden placed on product teams to create, update and keep Terraform aligned for services, security groups and load balancers
  2. Give teams an easy-to-use deployment pattern that provided features out of the box (e.g., standardise port definitions, health checks, DNS records, load balancer creation, and more)

Helm is the natural choice for this, with templating available to Helm charts through the use of Go Templating (soon to be superseded by LUA in Helm 3). Helm charts give the flexibility we require, and, they have the added benefit of replacing a ton of duplicated (and sometimes insecure) Terraform code that exists throughout our 150+ microservices 🎉

We settled on defining our helm charts based on the types of health checks (or lack thereof). By doing this, we can consolidate to three centralised patterns for service deployments:

  • http: standard service chart for those services with a rest API
  • grpc: provides catered GRPC health checks for services that require it
  • noservice: services with no API endpoints, fitting a simple kafka consumer model (and some internal tools)

Once a service chart is chosen and the required modifications made to the service pipeline, service deployment can proceed

SafetyCulture helm deployment model with Buildkite

Chart setup

These three charts are setup in their own CI pipelines and we utlise Chart Museum to publish these to an S3 bucket

With a bit of shell scripting over the top, we had a working chart pipeline that we have continued to iterate upon as we march toward a production rollout.

Kubernetes and EKS Platform Versions

EKS now supports 1.10.11 by default (i.e. eks.3) and has recently provided the capability to self-upgrade to 1.11.5 (eks.1).

The platform versioning was initially slow going, but there is an increase in pace following some recent events in the upstream open-source Kubernetes codebase (e.g., https://github.com/kubernetes/kubernetes/issues/71411)

  • Platform versioning is the way in which EKS provides Kubernetes minor version upgrades within the applicable major version (e.g., 1.10.11 is eks.3)

Working with Helm and Helmsman

We work with Helm for our own internal charts, and that fits our needs, but how do we approach public charts (e.g., Nginx Ingress Controller, Prometheus, Grafana, etc)?

This post will not dive too deeply into the inner workings of Helmsman, but our experience so far has been extremely positive. Shoutout to the team currently responsible for its development

  • Written in Go (our preferred language at SafetyCulture Engineering)
  • Provides the concept of a ‘state file’ (quite similar to Terraform in this respect). Helmsman supports the concept of ‘merging’ state files so that multiple states can be defined.
  • Worked really well with our helm setup previously mentioned (only some minor modifications were required, why reinvent the wheel?)
  • Allowed us to utilise native features from helm (e.g. value merging using multiple value overrides)
  • Works well with any charts (open source, private, etc)

Our current setup that retrieves all open source (stable) charts, is as follows:

SafetyCulture helm pipelines with Helmsman

Did we try any other options for helm chart deployments?

Yes, quite a few of them.

We dived into Flux, Jenkins-X, Kubecfg and Ksonnet, all of which are very capable tools but did not fit our current requirements.

In the case of the former, we needed to deploy yet more pieces into the cluster, increasing the complexity and moving parts of our deployment pipelines. And in the case of the latter, learn (and teach others) an entirely new language.

Helmsman was declarative in its approach (.yaml and .toml) and did not require any additional cluster installation. 🏆

Open source helm charts

As a team we made the decision early on that if we could, we would utilise open source helm charts.

In order to provide stability to our clusters, we only deploy charts from the stable track unless given a valid reason to install from the incubator track (we have not come across a reason for exemption to date).

Most of the available charts are highly customisable via value overrides, and for those that are not, we have provided contributions back to improve them (❤ open source)

We have also contributed a missing chart back to the stable track ourselves and will continue to maintain and improve this as we need

Near-term roadmap

Next on the list for us is a complete production rollout to all services. Whilst we have a small number operating in production (Since October 2018) we want to increase this over the next 3–6 months.

Huge thanks to Shaun Mansell, Lachlan Cooper and all other SafetyCulture engineers who have helped get the implementation to the point that it is today.

A note on Service Mesh

But what about a Service Mesh?

To date, we have not deployed any type of service mesh to our cluster. With the use of ingress controller records, and if required, direct use of service (internal kubernetes) DNS records, we have not yet had the need, despite our ever increasing number of microservices.

This is not to say we will not reach a point that this becomes a worthwhile implementation point for us, and the space is constantly expanding:

With a constant rate of change occurring in the service mesh space, we are currently in a ‘wait and see’ pattern on how an introduction of a service mesh would benefit us beyond the Service Discovery + ingress pattern that is currently deployed.

Are you an Engineer looking for your next challenge, and loves working with the latest technologies? Check out what it’s like to work at SafetyCulture.

--

--