How we DevOps @ Onfido

Our team, technology stack, recent projects and challenges

Published in

Onfido Product and Tech

6 min readMar 8, 2018

I’m Harvey and I lead our DevOps team. Our team ensures that our engineering teams can deploy and run their services on a platform that’s stable, reliable, secure and scalable.

We’re tight-knit with our security team so we’re pretty much a DevSecOps team — and that’s what makes working in this team unique. Richard, Director of Security, and Pawel, Security Engineer, ensure we implement best practices in everything we do — they’re involved from the genesis of a new implementation to the deployment. They help us think deeply about threat models and our security posture.

*Thinking deeply about (pool-related) threat models and security postures. Also we all randomly chose to wear these jumpers on the same day. No joke.*

What does our production service look like?

Here’s a high-level view of the platform we run for Onfido’s production service:

Services developed by our engineering teams are deployed, using Jenkins, on top of a foundation built on AWS and Kubernetes, configured and provisioned via Terraform and Ansible, and monitored with Datadog, Sentry and ELK.

What problems are we currently solving?

Our work falls into several larger themes and, at any one time, we’re working on different projects across them.

Observability

It’s vital to know what’s happening with our infrastructure and applications and have rich data to diagnose any issues. As our infrastructure and product grows, we’ve got to have a great understanding of the live system.

We work hard to design for resiliency and redundancy, but ultimately, software can be fragile. It’s our job to make sure we’ve got the data to put things right rapidly if (and when!) they do go wrong.

We use a lot of tools to keep us informed:

Datadog for infrastructure metrics, dashboards and alerting
Datadog’s APM product for service-level tracing
ELK stack for log aggregation and search
Sentry for error reporting
PagerDuty and Slack bots for error escalation

We’re always looking for ways to improve the depth of data we have available and to use that data for investigation and self-healing.

Fast, frictionless software delivery

Our engineers want to get new features into production as pain-free and as quickly as possible, so deployment tooling is a big part of our job. We’re not there to deploy services for other teams, but to give other teams the tooling that enables them to deploy rapidly and reliably.

Since we rolled out Kubernetes and Jenkins Pipeline, we’ve brought our deploy times for taking a brand new service to production down from several days to a few hours; and we’re hoping to improve that even further! Being able to deploy new services that rapidly is crucial for microservices work, where a service is a primary unit of change.

Design for redundancy and scalability

There’s a lot of work that goes into making sure that our services are always “production ready”. We’re continuing to invest energy in minimising single points of failure within our infrastructure, ensuring it’s deploying across multiple zones, that auto-scaling rules are appropriate, that scale-out and -in are rapid and that this is cost-efficient: using the right blend of reserved, on-demand and spot instances.

What else have we recently been working on?

Kubernetes migration

As we’ve grown, we’ve diversified our technology stack (from a humble Ruby on Rails app toward a significant focus on machine learning), entered new markets, grown the size of the engineering team and run millions of checks. These pressures led us to start to slice apart more of our services into independently deployable services.

It was natural to use Docker to package these services, and eventually, we selected Kubernetes as a container scheduler to ease running containers in production (this was in late 2016). We now run all production services on Kubernetes, but that’s taken a lot of work:

Build k8s clusters as code. We started by building it the hard way (not advised but you definitely learn a lot); then we used kraken-lib which uses terraform, ansible and coreos to provision your cluster. Then eventually, kops got released and it was a no-brainer to use that to deploy our clusters on AWS. Using kops to provision our clusters and outputting as Terraform means its much easier to manage and code changes can be peer reviewed across teams before we implement them.
Migrating existing applications: while we quickly moved to put new services onto our cluster, this wasn’t a trivial task for the original Onfido application (first commit: 2013). We spent significant time simplifying the setup of this application, running it in staging and dynamically shifting traffic to make sure we were happy with its behaviour on Kubernetes

K8s Warden

We wanted a more secure way of protecting our secrets. Using a combination of AWS KMS and a command line tool (k8s-warden), we encrypt our secrets locally before pushing them to S3. We add an entrypoint for warden in our base Dockerfile, so when a service is deployed, it is able to decrypt the secrets using an AWS role for KMS decrypt at runtime. This also means the secret is stored encrypted within Kubernetes, which means that even if someone gained access to our k8s cluster, they wouldn’t be able to expose our secrets.

AWS Account Split

Originally, we ran isolated environments within separate VPCs inside the same AWS account. We weren’t really happy about this, and wanted to completely segregate our Production stack away from others (e.g. Staging, Management…)

This has a number of benefits:

Easier billing accountability. For workload-specific accounts e.g. “development”, “security” or “business analytics”, we can bill at the account level, rather than segregating resources within an account by tags
Simpler security model: we apply permissions at the account level
Easier provisioning: we can spin up entirely separate environments e.g. for ad-hoc testing or demos really easily and consistently

The accounts are completely configured using Terraform so we can easily replicate stacks for different environments and be sure they are built in the same way. Our Security team are happy as accounts and access are segregated and our Finance team are happy too, we now know exactly how much we spend on development!

Terraform Automation

Running terraform on your local machine is all well and good but what if you wanted to automate this process? We shifted all of our terraform projects over to jenkins-pipeline so changes can be peer reviewed in our source control (bitbucket) then deployed once approved — automatically. This means we have logs of each terraform plan and apply and can track changes. We don’t need to worry about which version of terraform we are running locally anymore!

That’s not all — we’ve got some big projects in the works for the rest of 2018, including multi-region deployments, data pipelines, machine learning industrialization, changing how we do continuous delivery, improving our hack day project and more!

We’re not sharing this just to make noise

We’re sharing this because we’re looking for people that want to help us solve some of these problems. There’s only so much insight we can fit into a job advert so we hope this has given a bit more and whet your appetite. If you’re keeping an open mind about a new role or just want a chat — get in touch or apply — we’d love to hear from you!