Convergence to Kubernetes

Standardisation to Scale

I asked our CTO, as part of preparing this content for a conference presentation, about what he thought was interesting about our use of Kubernetes and he replied:

Teams don’t realise how much they haven’t had to do.

His comment was inspired from having recently read Factfulness: it’s harder to notice smaller but continual improvements and we consequently fail to recognise the progress we’ve made.

Our move to Kubernetes is significant though.

We have close to 30 teams that run some or all of their workloads on our clusters. Approximately 70% of all HTTP traffic we serve is generated from applications within our Kubernetes clusters. It’s probably the single largest convergence of technology since I joined (as a result of uSwitch’s acquisition by Forward) in 2010 when we moved from .NET and physical servers to AWS, and from a monolithic system to micro-services.

It’s been one of the quickest changes I’ve seen. In late 2017 all teams ran all their own AWS infrastructure. They were responsible for configuring load-balancers, EC2 instances, ECS cluster upgrades and more. In a little over a year that’s changed for all teams.

Kubernetes has been so useful, and our convergence so fast, because it helped overcome a real organisational problem: the ever growing cloud and organisational complexity, and the difficulties of scaling teams. We didn’t change our organisation because we wanted to use Kubernetes, we used Kubernetes because we wanted to change our organisation.

Engineering staff may not recognise the change but our data does. More on that in a bit.

Many years ago I attended a Clojure conference and saw Michael Nygard present a talk titled “Architecture Without an End State”. It’s a superb presentation. Clean, ordered, straight-line system architecture is lampooned promptly when he draws a comparison between late-night infomercials and large-scale software architecture: your current system are the blunt knives that mash and squash rather than slice. Before you could contemplate making a salad you need new knives.

The analogy is targeted at the fondness of organisations to undertake 3 year projects: design and prepare for a year, rollout in year 2, payoff in year 3. In the presentation he points out that such projects are often undertaken continually, rarely reach the end of year 2 (often as the result of acquisitions and changes in direction/strategy) and so most architectures can be better described as:

A “steady state” [that] is a superposition of ongoing wavefronts of change.

uSwitch is a fine example of such an idea.

The move to AWS was prompted by many things: an existing system that was unable to cope with peak traffic demands, the movement and speed of the organisation inhibited by too rigid a system and highly-coupled, functionally structured, project-based teams.

Our response was not to stop the world, migrate everything and get back to it. We created new services, proxied to them from the existing load-balancer and gradually strangled the old application. We focused on demonstrating value: A/B testing the first version of our new service in production within the first week. We eventually started to organise teams around long-running products, filled with people from engineering, design, data science and other necessary disciplines. And the performance of the business responded. It felt revolutionary in 2010.

Over the years we added more teams, services and applications, and gradually strangled ever more of the monolith. Teams were able to progress fast because they had all necessary skills embedded and were decoupled from each other. We minimised the amount of coordination needed to release. Only the load-balancer configuration needed multiple teams to commit to.

Teams were free to choose their own process, tools, languages. They were given ownership over the problem and, as they were closest, would be best suited to deciding how to respond. AWS made it easy to support this change.

Our intuition was from software engineering: loosely coupled teams would need less regular communication and require less coordination, relatively expensive activities. It was fantastic to see this properly covered in the recent, and excellent, Accelerate book.

The result was, as described by Michael Nygard, a system composed of many wavefronts of change: some systems were automated with Puppet, some with Terraform, some used ECS and others used straight EC2.

In 2012 we were proud to have an architecture that could evolve so frequently, letting us experiment continually, discovering what worked and doing more of it.

In 2017, however, we finally recognised that things had changed.

AWS is significantly more complex today than when we started using it within uSwitch in 2010. It provides an incredible amount of choice and power but not without cost. Any team that interacts with EC2 today must now navigate decisions on VPCs , networking and many, many more.

Anecdotally we felt that we’d observed this effect: teams were reporting spending more days on infrastructure related activities like upgrading instances in their AWS Elastic Container Service Clusters, their EC2 machines, migrating from Elastic Load Balancers to Application Load Balancers etc.

In mid-2017 I presented at an internal away-day on the need for us to standardise with a view to increasing the overall quality of our systems. I borrowed the oft-used Iceberg metaphor to describe how we build and operate software:

Operating software iceberg. Screenshot of my immaculate Google Slides diagram ;)

My argument was that most teams within our organisation should be focused on building services or products: their decisions should focus on solving a problem, application code, frameworks and libraries etc. in that order. Much other software still sits beneath the waterline: logging integration, observability tools, secrets management amongst others.

Each application team, at the time, owned mostly all the iceberg. Teams had decisions to make across the whole spectrum: which language, application framework, metrics library and tool, operating system, instance type, storage.

At the bottom of our pyramid was Amazon Web Services. Not all AWS services are equal though. Some are Backend-as-a-Service (BaaS) for things like authentication, data storage and data warehousing. Other services, like EC2, are relatively low-level. I wanted to look at data to understand if the anecdotal reports had support: that people were spending more time having to interact with relatively low-level services thus spending their time on relatively low-value decisions.

I categorised services and used CloudTrail to get as much historical data as I could and use a combination of BigQuery, Athena and ggplot2 to visualise how things had changed for technology staff over time. Growth in services like RDS, Redshift etc. would be encouraging (and expected) whereas growth in EC2, CloudFormation etc. would not.

Use of Low-level AWS Services (EC2, IAM, STS, Autoscaling etc.) over time. Data covers January 2015 to January 2017.

Each point in the scatter plot shows the 90th (red), 80th (green) and 50th (blue) percentiles for the number of low-level services used by people each week, plotted over time. I’ve added smoothing lines to help visualise the trend.

Despite a general trend within uSwitch engineering towards higher-level abstractions for deploying software, like using containers and Amazon ECS, the number of AWS services people needed to regularly use increased substantially; teams and people were not being sufficiently abstracted from the complexity of running their applications. Over the course of two years 50% of people had seen it double, 20% of people had seen it nearly triple.

This placed an additional constraint on the scaling of the organisation. Teams wanted to be self-sufficient but this made hiring difficult. We needed people that were strong application and product developers and an ever-expanding depth of knowledge of AWS.

We wanted to scale our teams further but maintain the principles of what helped us move fast: autonomy, work with minimal coordination, self-service infrastructure.

Kubernetes helps us achieve this in a few ways:

  • Application-focused abstractions
  • We operate and configure our clusters to minimise coordination

Application focused abstractions

At the core of Kubernetes are concepts that map closely to the language used by an application developer. For example, you manage versions of your applications as a Deployment. You can run multiple replicas behind a Service and map that to HTTP via Ingress. And, through Custom Resources, it’s possible to extend and specialise this language to your own needs.

These abstractions help application teams be more productive. The ones I’ve described above are pretty much all you need to deploy and run a web application, for example. Kubernetes automates the rest.

In my iceberg picture I showed earlier these core concepts sit at the waterline: connecting what an application developer is trying to achieve with the platform underneath. Our cluster operations team can make many of the lower-level, lower-value decisions (like managing metrics, logging etc.) but have a conceptual language that connects them to the application teams above.

In 2010 uSwitch operated a traditional operations team that was responsible for running the monolith and in relatively recent history had an IT team that was partly responsible for managing our AWS account. I believe one of the things that constrained the success of that team was the lack of conceptual sharing.

When your language only includes concepts like EC2 instances, load-balancers, subnets, it’s hard to communicate much meaning. It made it difficult/impossible to describe what an application was; sometimes that was a Debian package, maybe it was something deployed with Capistrano etc. It wasn’t possible to describe an application in language shared by teams.

In the early 2000s I worked at ThoughtWorks in London. During my interviews I was recommended Eric Evans’ Domain Driven Design book. I bought a copy from Foyles on my way home, started reading it on the train and have referenced it on most projects and systems I’ve worked on ever since.

One of the key concepts presented in the book is Ubiquitous Language: emphasising the careful extraction of common vocabulary to aid communication amongst people and teams. I believe that one of Kubernetes’ greatest strengths is providing a ubiquitous language that connects applications teams and infrastructure teams. And, because it’s extensible, this can grow beyond the core concepts to more domain and business specific concepts.

Shared language helps us communicate more effectively when we need to but we still want to ensure teams can operate with minimal coordination.

Minimise Necessary Coordination

In the Accelerate book the authors highlight characteristics of loosely-coupled architecture that drives IT performance:

the biggest contributor to continuous delivery in the 2017 analysis… is whether teams can:
Make large-scale changes to the design of their system without the permission of somebody outside the team
Make large-scale changes to the design of their system without depending on other teams to make changes in their systems or creating significant work for other teams
Complete their work without communicating and coordinating with people outside their team
Deploy and release their product or service on demand, regardless of other services it depends upon
Do most of their testing on demand, without requiring an integrated test environment

We wanted to run centralised, soft multi-tenant clusters that all teams could build upon but we wanted to retain many of the characteristics described above. It’s not possible to avoid entirely but we operate Kubernetes as follows to try and minimise it:

  • We run multiple production clusters and teams are able to choose which clusters to run their application in. We don’t use Federation yet (we’re waiting on AWS support) but we use Envoy instead to load-balance across the different cluster Ingress load-balancers. We can automate much of this with our Continuous Delivery pipeline (we use Drone) and other AWS services.
  • All clusters are configured with the same Namespaces. These map approximately 1:1 with teams.
  • We use RBAC to control access to Namespaces. All access is authenticated and authorised against our corporate identity in Active Directory.
  • Clusters are auto-scaled and we do as much as we can to optimise node start-up time. It’s still a couple of minutes but it means that, in general, no coordination is needed even when teams need to run large workloads.
  • Applications auto-scale using application-level metrics exported from Prometheus. Application teams can export Queries per Second, Operations per Second etc. and manage the autoscaling of their application in response to that metric. And, because we use the Cluster autoscaler, nodes will be provisioned if demand exceeds our current cluster capacity.
  • We wrote a Go command-line tool called u that standardises the way teams authenticate to Kubernetes, Vault, request temporary AWS credentials and more.
Authenticating to Kubernetes using u command-line tool

I’m not arguing that Kubernetes has increased our autonomy, although that may be the case, but it’s certainly helped us maintain high levels of self-service and autonomy while reducing some of the pain we felt.

Adoption of Kubernetes has been rapid. The plot below shows the cumulative number of Namespaces (which approximate teams) running on our production clusters, with the first recorded in February 2017.

Growth in Namespaces/teams over time

Early adoption was driven from necessity: we focused on absorbing infrastructure pains for teams that were small and focused on their product.

The first team to move was convinced after their application server ran out of disk space as a result of a misconfigured logrotate. Migrating to Kubernetes took a few days and, in return, they could worry about other things.

Recently teams have been moving to take advantage of better tooling. Running on the Kubernetes clusters gives you easier integration to our secrets management system (Hashicorp Vault), distributed tracing (Google Cloud Trace) and more. We’re providing ever more high-value features to all teams.

Earlier I showed a plot for the percentiles of the number of services used by people each week, from late 2014 to 2017. Below is a plot showing the same data but extended up to present day.

Low-level service use has improved since our convergence to Kubernetes in early 2017

We’ve managed to move the needle in managing AWS complexity. It’s encouraging that 50% of people are now experiencing something close to what people had in early 2015. Our Cloud team has around 4–6 members, approximately 10% of technology staff, so it’s not surprising that the 90th percentile hasn’t moved too much at the moment but is something I’d like to see continue to reduce.

Finally, I’d like to turn to engineering outcomes, again inspired from recently having read Accelerate.

The authors discuss two measures associated with Lean: lead time and batch size. Lead time is the time between a demand for something and it being delivered. Batch size is a measure of the size of work that flows and reducing batch size is associated with a number of positive effects on work:

Reducing batch sizes reduces cycle times and variability in flow, accelerates feedback, reduces risk and overhead, improves efficiency, increases motivation and urgency, and reduces costs and schedule growth

As a more readily available proxy for batch size the authors considered deployment frequency. Their intuition is that batch size is the reciprocal of deployment frequency — if we deploy more frequently we’re releasing smaller batches.

We have data for some deployments. It’s not precise: some teams release direct from pushing to the master branch of a repository, others have different mechanisms to release. And it doesn’t cover all applications but data within the last 12 months is a reasonable range to cover.

Deploys per person per week. Data covers approximately 1 year to May 2018.

The dip around week 30 is the Christmas period but, aside from that, there’s a general trend towards increasing the deployment frequency, and consequently reduced batch size. Between March 2018 and May 2018 our release rate nearly doubled and recently we’ve had weeks releasing over 100 times each day.

Our adoption of Kubernetes is just part of a wider move towards greater standardisation, automation and better tooling. All of those factors are likely to have contributed to increasing our release frequency.

Also in the Accelerate book the authors look at the relationship between deployment rate and number of people, and how fast the organisation can move as more people are added. The authors highlight a constraint with more coupled architecture and teams:

The orthodox view of scaling software development teams states that while adding developers to a team may increase overall productivity, individual developer productivity will in fact decrease

Plotting the same data as above but showing the relationship between people and deployments shows that we’re able to increase our release frequency even as we add more people.

Our ability to release increases as we add people

At the start of this article I referenced Factfulness (well, a quote from our CTO that was inspired by it to be precise). Our move to Kubernetes is one of the most significant, and speediest, convergences of technology across our engineering teams. Each step feels like a small change to the extent it’s easy not to see how significantly things have improved. It’s great to have some indicative data to suggest that it’s having the desired effect: helping people focus on their product, making the high-value decisions they’re best able to.

I would never have described what we had as bad: microservices, AWS, long-lived product focused teams, developers that owned their services in production, loosely-coupled teams and architecture; all were themes I touched on in a presentation titled “Our Age of Enlightenment” I gave at a conference in 2012. We should, however, always strive for better.

I’ll finish with a reference to another book that I recently started reading called Scale. It’s an incredibly interesting book and early on there’s an interesting point about energy consumption in complex systems:

To maintain order and structure in an evolving system requires the continual supply and use of energy whose by-product is disorder. That’s why to stay alive we need to continually eat so as to combat the inevitable, destructive forces of entropy production.
The battle to combat entropy by continually having to supply more energy for growth, innovation, maintenance, and repair, which becomes increasingly more challenging as the system ages, underlies any serious discussion of aging, mortality, resilience, and sustainability, whether for organisms, companies, or societies.

I think software systems could probably be added to that list. Hopefully our recent work keeps entropy at bay a little while longer.

My thanks to Tom Booth, Michael Jones, Naadir Jeewa and Dan North for their feedback on writing this article and presentation, and Shannon Wirtz for helping me visualise the data.

Like what you read? Give Paul Ingles a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.