The SafetyCulture journey to Kubernetes

Tim Curtin
Aug 14, 2018 · 2 min read

Welcome to the first in a multi-part series about the SafetyCulture journey to using Kubernetes (on Amazon EKS) as our microservice orchestrator.

Where we are today

We make heavy use of AWS ECS today (both running our own hosts, as well as dabbling with Fargate tasks in some areas where it makes sense).

We run a varied collection of Microservices including:

  • nodeJS,
  • Golang,
  • Python, and
  • Java (for our data analytics)

And we have at last count , approximately 300 containers running in production across 120+ microservices.

Why move

Our history with Kubernetes so far has mostly been dipping our toe in the water for local development (Minikube, a couple of Kops clusters for testing, etc), with no real push to getting our production microservices footprint migrated.

Moving to Kubernetes presents quite a few challenges during the initial phase:

  1. How exactly do we move to Kubenertes (K8s) without impacting production workloads (traffic migration via weighted dns? scheduled outage?)
  2. What are the decisions we need to make early on, to set ourselves up for success (logging, monitoring, autoscaling, deployments (helm, kubectl))
  3. What’s the roadmap look like for all the awesome developer flexibility K8s can provide (on demand environments, canary deployments out of the box, service mesh)

The Vision

Architecturally we need to plan how we incorporate the existing VPCs with our persistent data layers into the migration.

This will continue to evolve as we focus more on how to maintain a reliable orchestration platform for our services.


Starting with EKS provides us with the benefits of not having to deploy and maintain the Kubernetes control plane while we are still establishing the patterns for how this will be used in SafetyCulture engineering.

Coming up in the next blog posts in this series

  • How we used Helm within our Buildkite CI environment to simplify our deployments to a single downstream pipeline (from 1 per service) AND how this helped us cut deployment times to <50 seconds
  • The power of Prometheus and Grafana to centralise our monitoring and observability
  • Service mesh, and where that road will lead us (particularly with GRPC streaming from mobile clients through to the backend services)

Are you an Infra Engineer looking for your next challenge, and loves working with the latest technologies? Check out what it’s like to work at SafetyCulture.

SafetyCulture Engineering

Building something that truly impacts people's lives

Tim Curtin

Written by

SafetyCulture Engineering

Building something that truly impacts people's lives

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade