Kubernetes as the simplest complex solution
Outlining my current thinking of the tradeoffs around administering production software with Kubernetes
Kubernetes has become a bizarrely hot topic of recent. One must have opinions of it; is it good? bad? a sign of Google being predatory in the industry? The new OpenStack? At any rate, it seems that no matter where people are they have some sort of notion of it.
One of the notions I’ve come across most commonly is that Kubernetes is “too complex”, and we should do “simpler things”.
The challenge is not without merit; Kubernetes is a fundamentally new way of approaching software delivery and it is indeed very complex. To application developers “serverless” is perhaps less of a conceptual jump. However, while it’s complex I don’t think it’s more complex than other production software management.
First, it’s worth picking apart exactly what I mean by complexity. I’ve found the best definition to be that of Wikipedia:
Complexity characterises the behaviour of a system or model whose components interact in multiple ways and follow local rules, meaning there is no reasonable higher instruction to define the various possible interactions.
This definition is useful because it illustrates the bounds of what we think of as things that are “simple” versus “complex”. “Simple” in this case simply means “a magic black box that I do not have to understand but an use anyway”. Lots of things fit into this pattern:
Cares are engineered on a worldwide basis with multiple companies collaborating to engineer and composite components into a frame. They have several computers, carefully machined engine components, sensors and a vast array of other complexity.
The vast majority of the time, however, we do not think about this complexity. We interact with the steering wheel, accelerator and brake while singing our hearts out to Carly Rae Jepsen.
Cars are both complex and simple.
The internet is a stack of complex engineering on who’s foundation we, web developers are building every day. But the internet is extraordinarily complex, surviving thanks to its focus on redundancy, resilience and self repair.
The vast majority of the time, however, this is opaque to us. Instead we simply consume Twitter, Netflix and push our code to and from the magic servers in “the cloud”
The internet is both complex and simple.
And so on
Broadly, there are countless examples of things that are ostensibly simple, but hide an amazing level of complexity. Perhaps the best example from a non-technical context is Thomas Thwaites quest to build a toaster from “scratch”.
Given this it stands to reason that while things can be complex, they can also be simplified — so long as they’re predictable.
Though the nature of Kubernetes is complex it’s worth evaluating the nature of the problem that Kubernetes tries to solve — that is, running software in production.
Ostensibly, running software in production is simple. Take something that exists locally and put it on the production machine:
This model is fairly common in shared hosting, and popularised by
FTP and other tools — it’s basically drag and drop between computers.
However, there are a set of risks to production and it makes sense to invest more up front if those risks are known so as to mitigate them. Specifically, investing enough that makes the risks coming true an irrelevant cost to the business; “buying predictability”.
Those risks include:
In our modern age of Cloud this is one that we sometimes forget, but machines do invariably fail. In the early ages of Google teams got so sick of unscrewing and rescrewing in drives that they velcrod them in instead — simply replacing them as they failed.
Computers of all kinds fail. The only way to keep a service up is to copy it to several computers; to back it up. Then when a computer dies, we can fetch that software from another computer and put it back up in front of users.
However, users have to come to expect long downtimes are simply not a thing. They’re not likely to tolerate an 8 hour outage while we drive down to the DC to switch a machine over. So, to address this we run multiple copies of our software such that users can always get to at least one healthy copy, no matter what:
Generally speaking we do not build the entire stack of our software; such effort would be ridiculously expensive. Instead we composite together a set of software that lets us do the things we need to do; things like:
Each of those things; the Virtual Machine, Linux Kernel, Webserver and PHP interpreter can be configured to run in thousands of different ways. We can get significantly better performance and reliability out of our application by knowing the respective knobs and dials on this software, and being able to customize it to our particular requirements.
Keeping track of this stack of configuration is not, strictly speaking, application development — rather it’s “infrastructure”. But the difference is meaningless to the user and we need to factor in whether we want to be able to optimize things for the best experience.
As much as we try, software invariably goes wrong. Even in mission critical, extremely careful environments software occasionally behaves in ways that we did not anticipate. We’ve broadly, as an industry, accepted that perfection is impossible.
However, given this truth it’s important that when the software does go wrong we can understand how and why it went wrong, and how we need to change it to prevent this circumstance in future.
There are a swathe of techniques for addressing this; logs, time series data, transaction tracing, application runtime introspection, inspection between the app and the kernel. However, those things are only good to us if we know that the application did not behave as expected and if we captured enough information at the time the application failed to help us reproduce and resolve the issue.
Production has real people
It’s easy to forget as we’re building an application to specification that on the other end of this application will be real people with their own complex lives. Those people trust us to take care with the data they supply to us and will be extremely upset (and express their displeasure with legal action) if we allow a third party access to their data.
Accordingly, we need to be careful with what has access to a production system and how what’s accessing it behaves, such that we can guarantee users a level of safety and security for their data.
Sometimes things change
While the vast majority of software stays fairly stable, innocuous and out of the way software that’s under heavy development; especially consumer grade or customer facing software is required to change regularly.
Google has managed to organise a bizarre array of the world’s information within 20 years, Twitter has managed to connect strangers within 10 and TED managed to collect some of the leading creative thinkers of our time and give us access to their ideas for our own worlds.
This change means new risks introduced all the time, and new requirements of our software. Unless we can change the software quickly and easily we’ll lose market share to others who can.
The aforementioned issues have been solved in different ways for perhaps 30–50 years in excellent ways. In a sense, while computing is one of the areas being innovated most quickly it’s also one with a bizarrely stable heritage.
However, the way in which we’ve solved those problems has changed over time.
At first, specialists were required with each part of the application stack; specialists to rack and flash machines, other specialists to connect those machines to computer networks, other specialists to configure those machines and finally software developers to build the software that will be shipped on those machines.
The invariably human process meant that any change took time; the amount of time required to convince other humans that change was required, for them to learn and upskill themselves to implement the change, test it and release it to the next set of specialists. However, more recently with the advent of “Cloud” much of that has changed. Software has, in this sense, eaten those jobs. Rather than those specialists solving each of those jobs on a case by case basis they instead write software that automates the process quickly, reliably and painlessly.
The process of being able to do this is a set of nested, reliable abstractions that allowed users to care only about the “layer” they’re working with. It’s here that Kubernetes, and more particularly Docker provide a superb amount of value and can actually reduce the required knowledge of a given developer.
I think of the layers as follows:
At the lowest end of the stack is the development of the application. That can be written in a number of languages, expose a REST or RPC API or a full user interface and require an arbitrary amount of additional services.
However, so long as the writers of those applications stick to the APIs defined by their language and the rules defined by the “12 factor application” that application can be run on an extremely wide array of machines — perhaps most usefully to the developer, the machine the developer is most familiar with.
Ideally, application developers should have no notion of Docker, containers, kubernetes or anything else — just follow the 12 factor rules.
Application packing (Docker)
In most cases applications do not exist in a vacuum, but depend on a specific set of libraries, components or other programs that allow them to function. Compiling all of these components into a hermetic, sealed unit allows the application to behave consistently no matter what environment it is deployed to.
There are various ways of doing this:
But Container is perhaps the best. It packages everything down to the underlying operating system (if required), leaving the margin for error for the application down to an extremely small amount.
Application developers should not have to have notion of operating systems or in many case system libraries or other errata. In most cases those things are provided for them by their operating system, and where specific libraries are required users with more experience can help those users by packing them into these sealed units.
Application Production (Kubernetes)
Software invariably has to end up in front of users — that is, in production. It requires a set of supporting services, including:
- Blob Store
- Secret Store
- Another Application
Or any number of other oddities. Additionally it needs to be able to reach all of these things reliably which means they in turn need to be reliable, as well as talk to them securely.
Lastly, the problems aforementioned need to be solved; the application needs to be redundant, well configured, well instrumented and widely available to users.
Kubernetes provides a language and a system for doing this effectively. It is indeed a complex system, but similar systems such as Ansible, ElasticBeanstalk or other tooling are of similar complexity.
Developers who are able to pack software into containers should not have to be aware of how that software is deployed into a production system. So long as the application works in some capacity in the container, and so long as the application developer has followed the aforementioned 12 factor rules deploying the application to Kubernetes is fairly trivial and the problems outlined earlier simply solved.
A complexity stack
So, there are three levels of nested abstraction:
- Kubernetes, which depends on
- Docker, which depends on
- Application developer
Which means users have three opportunities to simply consume the service, rather than poke at the nasty details of the abstraction layer beneath.
There will invariably need to be users who understand the entire stack — there alway shave. But by keeping it segmented into the units before there are clear handovers between each layer and users can reliably depend on the layer below theirs.
The above is still exceedingly complex. There is no clear path for an application to “hand” their code to the docker maintainer, and no path for the docker maintainer to “hand” their containers to the Kubernetes system.
The CI/CD pipeline is where this hand off takes place. The pipeline lends itself extremely well to this chunked approach to release, where:
- The application will be compiled, linted and unit tested
- The container will be assembled and smoke tested
- The deployment will be released and tested with canaries or blue/green deployments
Buying in to a complex application stack invariably means buying into a CI/CD pipeline of some kind.
However, by buying into the pipeline the complexity of the system to the application developer is reduced to only the application. The rest of it should be a reliable black box the application developer does not need to know about; rather only using the metaphoric pedal and brake of the system — kicking off the deployments by merging to master, or via a clicky button in the UX.
Kubernetes is indeed very complex software. However, it allows us to construct a layered approach to software release which allows us to “chunk” that complexity into black boxes that users can consume without needing to understand the entire block.
In that sense Kubernetes buys us simplicity. As in the automobile, that complexity doesn’t disappear but it so long as the car behaves reliably, drivers do not care. In our case we’ve been running Kubernetes backed by BitBucket pipelines for ~2 years, and in some cases developers do not even know the system that powers their code — they just follow the magic.
I hope that makes some sense. ❤