The SafetyCulture journey to Kubernetes

Tim Curtin
SafetyCulture Engineering
2 min readAug 14, 2018

Welcome to the first in a multi-part series about the SafetyCulture journey to using Kubernetes (on Amazon EKS) as our microservice orchestrator.

Where we are today

We make heavy use of AWS ECS today (both running our own hosts, as well as dabbling with Fargate tasks in some areas where it makes sense).

We run a varied collection of Microservices including:

  • nodeJS,
  • Golang,
  • Python, and
  • Java (for our data analytics)

And we have at last count , approximately 300 containers running in production across 120+ microservices.

Why move

Our history with Kubernetes so far has mostly been dipping our toe in the water for local development (Minikube, a couple of Kops clusters for testing, etc), with no real push to getting our production microservices footprint migrated.

Moving to Kubernetes presents quite a few challenges during the initial phase:

  1. How exactly do we move to Kubenertes (K8s) without impacting production workloads (traffic migration via weighted dns? scheduled outage?)
  2. What are the decisions we need to make early on, to set ourselves up for success (logging, monitoring, autoscaling, deployments (helm, kubectl))
  3. What’s the roadmap look like for all the awesome developer flexibility K8s can provide (on demand environments, canary deployments out of the box, service mesh)

The Vision

Architecturally we need to plan how we incorporate the existing VPCs with our persistent data layers into the migration.

This will continue to evolve as we focus more on how to maintain a reliable orchestration platform for our services.

Initial architecture diagram

EKS

Starting with EKS provides us with the benefits of not having to deploy and maintain the Kubernetes control plane while we are still establishing the patterns for how this will be used in SafetyCulture engineering.

Coming up in the next blog posts in this series

  • How we used Helm within our Buildkite CI environment to simplify our deployments to a single downstream pipeline (from 1 per service) AND how this helped us cut deployment times to <50 seconds
  • The power of Prometheus and Grafana to centralise our monitoring and observability
  • Service mesh, and where that road will lead us (particularly with GRPC streaming from mobile clients through to the backend services)

Are you an Infra Engineer looking for your next challenge, and loves working with the latest technologies? Check out what it’s like to work at SafetyCulture.

--

--