Pragmatism vs Kubernetes

Richard Zhang
11 min readMar 3, 2019

--

Just a little preview of my sick editing skills

I spent a while thinking about the title of this — it should really be “Pragmatism and Kubernetes” because in some cases the two do indeed overlap. But I figured versus would be more fun and provoking (nice way of saying CLICKBAIT). So who is this guy and what’s he got against Kubernetes? I guess I have some explaining to do.

Now before I wrote this article, I did a quick search to see if anyone had already written something similar. This article got close and asked the right questions, but I couldn’t seem to find any which articulated what I was trying to say.

During my search, I also noted some of the more popular problems which people in the industry see Kubernetes solving. I thought breaking them down and discussing them could be a good format for the article.

A little context first, just so you know that I’m not biased in any particular way and also that there is at least some merit behind my opinions. I should prefix this post by saying that I’ve built deployment pipelines/tooling for and operated applications in the cloud (I will speak mainly from an AWS perspective) on a range of different infrastructure patterns. EC2s with application AMIs, docker in EC2, ECS, Nomad, Kubernetes, EC2s configured by Puppet and as much as I’d like to forget; EC2s deployed, configured and orchestrated by Ansible (dark days).

But enough of that, let’s get into it, shall we?

Speed

Speed, in many aspects is one of the big drivers for change in tech companies, especially now that everyone has drank the Agile Kool Aid. We need speedy deployments for short feedback loops, speedy recoveries for those extra few nines in our up-time, speedy applications for performance, the list goes on.

There are a few ways Kubernetes brings speed to the table. Applications are deployed to a cluster via pods. These are essentially a set of containers which scale together, are connected locally network-wise, and share a single IP address and port. You can think of them as docker-compose deployments if you’ve ever used those before. Seeing as these all come down to containers, your application and its environment are packaged inside docker images. Meaning very fast boot times (depending on how you initialise your application of course). This means that when you deploy a new version of your application, your deployment (at minimum) takes as long as it takes to:

  • Build your docker image
  • Upload your docker image to a docker registry
  • Determine which node to run your container on
  • Initialise your container

There are of course many ways to optimise this deployment (this also assumes that you have enough resources in your cluster and don’t have to wait for horizontal scaling), but ideally this will be the fastest way you can deploy your application. Naturally, this construct allows for speed in other aspects such as time to scale. Bringing up new instances of your application to handle extra load only requires the last 2 steps to happen (plus cluster scaling time if needed). A minimally sized docker image with a simple bootstrap script will result in a lightning fast deployment.

Having said this, here are a few points to consider:

Firstly, these features aren’t features of Kubernetes — they’re features of any containerised app. Additionally, Kubernetes is not by any means a simple system. There are so many different aspects to be considered when operating a cluster (as we will discuss later). So if speed is your objective this can be achieved with much simpler solutions. Take ECS, for example, as an easy transition from virtual machines to containers for both the developers of the applications that run on it and the engineers who operate and support it (heavily dependent on your current setup of course). It’s important to remember that adopting such a technology doesn’t only mean your infrastructure engineers need to learn to operate the platform. Devs will need to learn how kubectl, helm and something like minikube if they’d like to see how their code will run locally. All of this takes time (and thus reduces speed)

Secondly (and probably most importantly), how will speeding up your deployments actually help the business? At the end of the day all of us are paid to bring value to our businesses — will saving a few minutes off your deployment time be the best use of you and your team’s time? In some organisations absolutely, but you’ll find that in most organisations, this is not where the bottleneck lies. Lets take a look at a simplified software development life-cycle (SDLC).

  • First, a problem that needs solving needs to be identified.
  • Then, the task needs to be prioritised and assigned to developers.
  • Devs then design and develop the solution.
  • Testing and approval (if needed) is done on the proposed solution.
  • Solution is deployed and validated.

And the cycle continues.

If speed is your objective, it’s pretty clear what we need to do — find the bottleneck in your SDLC and speed it up. If you’re speeding up your deployment time by 5 minutes but it takes days to test or approve your deployments then you end up with a bunch of deployments piled up ready to test/approve. You’ve probably seen this type of concept in things like networking, database/app performance, etc. so it should be pretty easy to identify where the bottlenecks in your company lie.

I did warn you

Has upper management lost confidence in releases because they’ve experienced too many bad ones? Then you should probably work on ensuring you have better, more automated testing and monitoring to ensure that deployments go smoothly.

Does it maybe take far too long for product managers to decide what problems need to be solved? Or to validate that an implemented solution works? Then you probably need to work on gathering appropriate data to allow for these decisions to be made.

Does it take too long to determine why a solution isn’t working as well as you thought it should? Maybe you should be working on increasing the observability of your infrastructure/apps.

There are so many things that can be sped up, but you get the gist of it.

Uniform Workflows

This is another factor that seems to be mentioned a lot when talking about Kubernetes. It enforces lots of patterns in terms of the way developers interact with it and eventually run their code. One of the big ways it does this is via the APIs it exposes for them. Kubectl acts as an interface to the cluster, controlling developer access and allowing them to inspect what’s running on the cluster, grab and create secrets and most importantly deploy their apps. Helm charts and dockerfiles allow developers to template their deployments and define exactly how and where their code will run — even control how external and internal traffic is routed to it. It also forces developers to think about how many resources are required to run their code (and hopefully also to think about how to optimise this!). Everyone who decides to use Kubernetes will follow the same patterns.

Deployments and config become easily understood between developers as well as “DevOps people”. Additionally, if designed well, your monitoring and observability solutions (maybe something simple like the metrics api + prometheus + grafana) will allow the same infrastructure performance and APM metrics to be exposed from the cluster, as well as application logs. This way, developers can choose what they’d like their apps to expose and see everything in the same place.

These points are all extremely compelling and, especially if you’re in the DevOps space, might have you frothing at the mouth at this point. It’s all starting to sound a little bit like a tech utopia.

“Look dad, a service mesh!”

But before our heads get too high up in the clouds (see what I did there?), let’s take a closer look at reality. If you haven’t already started to notice, basically all of my points essentially boil down to:

Yeah, but it’s not Kubernetes that’s solving that problem.

And this one isn’t an exception . I mentioned people in the “DevOps space” being especially attracted to the above features because this is basically the end goal of all DevOps principles. An organisation where developers understand and take ownership of their deployments. Know how to setup the monitoring they need to test and debug their apps. Consider and optimise resource usage. Genereally understand where their code fits in with the infrastructure. The thing is; that’s a lot to ask for, and none of this just magically happens when you flip the Kubernetes switch on. A lot of this is a change in attitude and takes a lot of time and open-mindedness to develop trust and the skills needed to adapt. You will find that you face the same amount of (if not more) resistance from developers to adopt Kubernetes as you would any other platform you try to mandate. Especially since Kubernetes will enforce all of the above practices in one big magnificent bang instead of allowing you to introduce them one concept at a time.

Kubernetes doesn’t teach developers how to optimise their resource usage.

Kubernetes doesn’t make developers decide they want to better understand the infrastructure that their code runs on.

Kubernetes doesn’t show developers the benefits of having standardised, reproducible deployments. Or the pain of not having them.

Kubernetes does offer one solution once these things have been achieved. By no means does it magically solve your issues with having uniform workflows and DevOps practices in general. You’ll find that if you think this way — you will implement Kubernetes and then you’ll have n+1 workflows that your engineers will need to learn, maintain and support.

Scale and Cost

I grouped these last 2 points becaust they’re heavily dependent on each other.

The most common argument for scale I see is:

Well Google can use it for services of their scale, so it’s proven to be reliable at any scale

While simple, it is a perfectly sound argument and this is actually a very easy way to choose what tech your company will rely on. However , it should also be taken with a grain of salt. Yes, if your company runs at the scale of Google and has enough engineers to dedicate to running Kubnernetes clusters, then by all means, scale away. But let’s be honest, almost everyone reading this post will be working at a company that serves magnitudes less traffic than Google does. This is obvious, but engineers usually plan ahead and want to be able to scale, so why not choose a solution that we know won’t run into those issues. Well, a quick google search will show you that almost every docker orchestration system has been bench-marked and proven to run at a scale that most people will never come close to (e.g. nomad, docker swarm, ECS).

I already hear you — But it takes lots of work and tweaking to be able to do that.

Kubernetes is no exception.

Google didn’t build Kubernetes, turn it on, and set and forget it. It takes a huge amount of work from developers and cloud/infrastructure engineers alike to successfully run on top of Kubernetes (plenty of people will vouch for this). There’s often a misconception that Kubernetes is this mystical unicorn that will make your apps invincible — resilient to failure and immune to scaling issues. This is not true in the slightest and I think disappointment awaits those who think it is.

On to cost — again the idea is simple, if I can run more instances of my apps on the same amount of virtual machines, I pay less for servers. Additionally, if I have an intelligent system scaling these instances and virtual machines to fit my resource consumption, I waste less money.

Quick mafs

But as you’d probably expect, it’s not that simple. In order to get any value out of using containers, developers who are deploying their applications need to think about how many resources said applications need to run comfortably. This isn’t a simple task at all (as I touched on when discussing uniform workflows). And the act of doing so reaps benefits whether you’re on Kubernetes or not. Knowing exactly how much of each resource your application needs means you can choose the minimum necessary unit of server, virtual machine, container or whatever you’ve chosen.

In some cases, it takes more engineering effort to share your resources across the same cluster of virtual machines and achieve the same level of cost effectiveness. Imagine you run a Kubernetes cluster on top of EC2 instances with fixed cpu/memory/network resources. One day your apps which are CPU bound start scaling more than the others. Your cluster ends up being under-utilised for non-CPU resources. Even though your developers have thoroughly tested and specified what resources their application needs. An additional layer of complexity is added by needing to create a separate group of instances for each different skew of resource requirement (which again need to be known and considered by developers).

On top of this, each cluster you create requires additional resources to orchestrate your containers (this is true for all orchestration platforms). Kubernetes, as a minimum, requires 3 extra instances for the servers in production and 1 in non-production environments for orchestration and API access. If you’re running a test and production cluster, that’s an additional 4 instances. If you further split these out into other clusters (internal, external, maybe a staging environment), you end up with a non-negligible amount of extra cost just in order to run Kubernetes. This is where scale comes into it — if you’re running at a scale where this overhead is significant, choosing to implement kubernetes could definitely increase your cost instead of decreasing it. Especially if you don’t properly migrate to it and end up still having to run your “legacy” platform even after choosing to use Kubernetes. I talked about mandating legacy and touched on a bunch of these points in this post if you’re interested.

What if my application is so small in terms of resource requirements that my minimum unit of compute/memory/network doesn’t fit it sensibly? Well then to be frank, it sounds like your application shouldn’t run on servers you maintain at all — make them a serverless function. I’ll admit this may be simplifying things a little too much. If you’re at a balance where it doesn’t make sense to run serverless and doesn’t make sense cost-wise to run on individual machines, you should definitely consider containers. But even then, consider containers, not Kubernetes (I did warn you that this was the basis of most of my points)

I’d just like to finish by saying that I think Kubernetes is an incredible piece of software and some of the things you can achieve with it are mind blowing. All I ask is that we, as engineers, consider pragmatism when we make these decisions. It’s very easy to get sucked into the ecosystem it has built around it because we all like building cool shit. That’s why we do what we do (hopefully). However, much to the disappoint to a certain colleague of mine, we probably shouldn’t make big technical decisions based purely on how cool something is.

I welcome all feedback, discussion and constructive criticism — I’ve recently started using Twitter so please, tweet at me with your thoughts!

--

--