The Gravity of Kubernetes
Kubernetes has become the standard way of deploying new distributed applications.
Most new internet businesses started in the foreseeable future will leverage Kubernetes (whether they realize it or not). Many old applications are migrating to Kubernetes too.
Before Kubernetes, there was no standardization around a specific distributed systems platform. Just like Linux became the standard server-side operating system for a single node, Kubernetes has become the standard way to orchestrate all of the nodes in your application.
With Kubernetes, distributed systems tools can have network effects. Every time someone builds a new tool for Kubernetes, it makes all the other tools better. And it further cements Kubernetes as the standard.
Google, Microsoft, Amazon, and IBM each have a Kubernetes-as-a-service offering, making it easier to shift infrastructure between the major cloud providers. We are likely to see Digital Ocean, Heroku, and longer tail cloud providers offer a managed, hosted Kubernetes eventually.
In this editorial, I explore the following questions:
- Why is this happening?
- What are the implications for developers?
- How are cloud providers affected?
- What are the new business models that are possible in a Kubernetes-standardized world?
Standardized software platforms are both good and bad.
Standards allow developers to have expectations around how their software will run. If a developer builds something for a standardized platform, they can make smart estimations about the total addressable market for that piece of software.
Standardized proprietary platforms can lead to massive profit returns for the platform provider. In 1995, Windows was such a good platform that Microsoft could sell a CD in a cardboard box for $100. In 2018, the iPhone is so good that Apple can take 30% from all app sales on its platform.
Proprietary standards lead to fragmentation.
Your iPhone app does not run on my Kindle Fire. I can’t use your Snapchat augmented reality sticker on Facebook Messenger. My favorite digital audio workstation only runs on Windows — so I have to keep a Windows computer around just to make music.
When developers see this fragmentation, they groan. They imagine the greedy capitalists, who are making money at the expense of software quality. Developers think, “why can’t we all just get along? Why can’t we have things be open and free?”
Developers think, “we don’t need proprietary standards. We can have open standards.”
Growth of Apache, part of the LAMP (Linux, Apache, MySQL, PHP) Stack
This happened with Linux. These days, new server side applications are mostly in Linux. There was a time when that was controversial (see the far left hand side of chart above).
More recently, we saw a newer open standard with Docker. Docker gave us an open, standardized way of packaging, deploying, and distributing a single node. This was hugely valuable! But for all of the big problems that Docker solved, it highlighted a new problem — how should we be orchestrating these nodes together?
After all — your application is not just a single node. You know you want to be deploying a Docker container — but how are your containers communicating with each other? How are you scaling up instances of these containers? How are you routing traffic between container instances?
After Docker became popular, a scramble of open source projects and proprietary platforms emerged to solve the problem of container orchestration.
Mesos, Docker Swarm, and Kubernetes each offered a different set of abstractions for managing containers. Amazon ECS offered a proprietary managed service that took care of installation and scaling of Docker containers for AWS users.
Some developers did not adopt any orchestration platform, and used BASH, Puppet, Chef, and other tools to script their deployments.
Whether a developer was deploying their container by using an orchestration platform or a script, Docker sped up development, and made things more harmonious between developers and operations.
As more developers deployed containers with Docker, the importance of the orchestration platform was becoming clear. A container is as fundamental to a distributed system as an object is to an object oriented program. Everyone wanted to be using a container platform, but there was a struggle between these platforms, and it was hard to see which would come out on top — or if it would be a decades long struggle, like iOS vs. Android.
This struggle between the different container orchestration platforms was causing fragmentation — not because any of the popular orchestration frameworks were proprietary (Swarm, Kubernetes, and Mesos are all open source), but because each container orchestration community had invested so much in their own system.
So, from 2013–2016, there was some anxiety among Docker power users. Choosing a container orchestration framework is a huge bet — and if you chose the wrong orchestration system, it would be like opening a movie store and choosing HD DVD over Blu-ray.
The war between container orchestrators felt like a winner-take-all affair.
And as in any war, there was a fog that was hard to see through. When I was reporting on container orchestration wars, I recorded podcast after podcast with container orchestration experts, where I would ask some form of the question, “so, which container orchestration system is going to win?”
I did this until someone I respect told me that asking about who was going to “win the orchestration wars” was a less interesting question than evaluating the technical tradeoffs between the orchestrators.
Looking back, I regret buying into the narrative of a war between container orchestrators.
As the debates about container orchestrators raged on, the smartest people in the room were mostly having calm, scientific discussions — even when reporters like me were getting worked up, and thinking that this was a story about tribalism.
The conflicts between container orchestrators were not about tribalism — they were more about differences of opinion, and developer ergonomics.
OK, maybe the container orchestration war wasn’t only about differences of opinion. There is a boatload of money to be made in this space. We are talking about contracts with billion dollar legacy organizations — banks, and telcos, and insurance companies — who are making their way onto the cloud.
If you are in the business of helping telcos move onto the winning platform, business is good. If you champion the wrong platform, you end up with a warehouse full of HD-DVDs.
The time where the conflict was the worst was near the end of 2016, when there were rumors about Docker potentially forking, so that Docker the company could change the Docker standard to fit better with their container orchestration system Docker Swarm — but even in those times, it would have made sense to be an optimist.
Creative destruction is painful, but it is a sign of progress — and in the struggle for container orchestration dominance, there was a ton of creative destruction.
And when the dust cleared around the end of 2016, Kubernetes was the clear winner.
Today, with Kubernetes becoming the safe choice, CIOs feel more comfortable adopting container orchestration at their enterprises — which makes vendors feel more comfortable investing in Kubernetes-specific tools to sell to those CIOs.
This brings us to the present:
- The open source developers are rowing in the same direction, excited about what to build.
- Major enterprises (both new and legacy) are buying into Kubernetes.
- The major cloud providers are ready with low-cost Kubernetes-as-a-service.
- The numerous vendors of monitoring, logging, security, and compliance software are thrilled because the underlying software stack that they have to integrate with is becoming more predictable.
Today, the most lucrative provider of proprietary backend developer infrastructure is Amazon Web Services. Developers do not view AWS resentfully, because AWS is innovative, empowering, and cheap. If you are paying AWS a lot of money, that probably means your business is doing well.
With AWS, developers do not feel the level of lock-in that was administered by the dominant proprietary platforms of the 90s. But there is some lock-in. Once you are deeply embedded in the AWS ecosystem, using services like DynamoDB, Amazon Elastic Container Service, or Amazon Kinesis, it becomes a daunting task to move away from Amazon.
With Kubernetes creating an open, common layer for infrastructure, it becomes theoretically possible to “lift and shift” your Kubernetes cluster from one cloud provider to another.
If you decided to lift-and-shift your application, you would have to rewrite parts of your application to stop using the Amazon-specific services (like Amazon S3). For example, if you wanted an S3 replacement that would run on any cloud, you could configure a Kubernetes cluster with Rook, and start to store objects on Rook with the same APIs that you would use to store them on S3.
This is a nice option to have, but I have not yet heard of anyone actually lifting and shifting their application away from a cloud — except for Dropbox, and their migration was so epic it took two and a half years.
Certainly there is someone out there other than Dropbox who spends so much money on Amazon S3 that they will consider spinning up their own object store, but it would be a huge effort to do such a migration.
Kubernetes probably won’t be a tool for widespread lifting and shifting any time soon. A more likely scenario is that Kubernetes will become a ubiquitous control plane, that enterprises will use on multiple clouds.
NodeJS is a useful analogy. Why do people like NodeJS for their server side applications? It’s not necessarily because Node is the fastest web server. It’s because people like to use the same language on both the client and the server. Just like NodeJS lets you move between your client and server code without having to switch languages, Kubernetes will let you switch between clouds without having to change how you think about operations.
On each cloud, you will have some custom application code running on Kubernetes that interacts with the managed services provided by that cloud.
Companies want to be multi-cloud — partly for disaster recovery, but also because there is actual upside to having access to managed services on the different clouds.
At Thumbtack, the production infrastructure on AWS serves user requests. The log of transactions that occur get pushed from AWS to Google Cloud, where the data engineering occurs. On Google Cloud, the transaction records are queued in Cloud PubSub, a message queueing service. Those transactions are pulled off the queue and stored in BigQuery, a system for storage and querying of high volumes of data.
BigQuery is used as the data lake to pull from when orchestrating machine learning jobs. These machine learning jobs are run in Cloud Dataproc, a managed service that runs Apache Spark. After training a model in Google Cloud, that model is deployed on the AWS side, where it serves user traffic. On the Google Cloud side, the orchestration of these different managed services is done by Apache Airflow, an open source tool that is one of the few pieces of infrastructure that Thumbtack does have to manage themselves on Google Cloud.
Today, Thumbtack uses AWS for serving user requests and Google Cloud for data engineering and queueing in PubSub. Thumbtack trains its machine learning models in Google, and deploys them to AWS.
This is just the way things are today. Thumbtack might eventually use Google Cloud for user-facing services as well.
More companies will gradually move towards multiple clouds — and some of them will manage a unique Kubernetes cluster on each cloud.
You might have a GKE Kubernetes cluster on Google to orchestrate workloads between BigQuery, Cloud PubSub, and Google Cloud ML, and you might have an Amazon EKS cluster to orchestrate workloads between DynamoDB, Amazon Aurora, and your production NodeJS application.
Cloud providers are not replaceable commodities. The services provided by the different clouds will get increasingly exotic and differentiated. Enterprises will benefit from having access to the different services on the different cloud providers.
Distributed Systems Distribution
Services like Google BigQuery and AWS Redshift are popular because they give you a powerful, scalable, multi-node tool with a simple API. Developers often choose these managed services because they are so easy to use.
You don’t see these types of paid tools for single nodes. I don’t pay for NodeJS, or React, or Ruby on Rails.
Tools for a single node are much easier to operate than tools for a distributed system. It’s harder to deploy Hadoop across servers than to run a Ruby on Rails application on my laptop. With Kubernetes, this is going to change.
If you are using Kubernetes, you can use a distributed systems package manager called Helm. This is like npm for Kubernetes applications. If you are using Kubernetes, you can use Helm to easily install a complicated multi-node application, no matter what cloud provider you are on.
Here’s a description of Helm:
Helm helps you manage Kubernetes applications — Helm Charts helps you define, install, and upgrade even the most complex Kubernetes application. Charts are easy to create, version, share, and publish — so start using Helm and stop the copy-and-paste madness.
A package manager for distributed systems. Amazing! Let’s take a look at what I can install.
Distributed systems are hard to set up. Ask anyone who has set up their own Kafka cluster. Helm is going to make installing Kafka as easy as installing a new version of Photoshop on your MacBook. And you will be able to do this on any cloud.
Before Helm, the closest thing to a distributed systems package manager (that I am aware of) was the marketplace that you find on AWS or Azure, or the Google Cloud Launcher. Here again we see how proprietary software can lead to fragmentation. Before Helm, there was no standard, platform-agnostic way of one-click installing Kafka.
You can find a way to one-click install Kafka on AWS, Google, or Azure. But each of these installations had to be written independently, for that specific cloud provider. And to install Kafka on Digital Ocean, you need to follow a 10-step tutorial.
Helm represents a cross-platform system for distributing multi-node software on any Kubernetes instance. You could use an application installed with Helm in any cloud provider or on your own hardware. You could easily install Apache Spark or Cassandra — systems that are notoriously difficult to set up and operate.
Helm is a package manager for Kubernetes — but it also looks like the beginnings of an app store for Kubernetes. With an app store, you could sell software for Kubernetes.
What kind of software could you sell?
You could sell enterprise distributions of distributed systems platforms like Cloudera Hadoop, Databricks Spark, and Confluent Kafka. New monitoring systems like Prometheus and multi-node databases like Cassandra that are hard to install.
Maybe you could even sell higher-level, consumer-grade software like Zendesk.
The idea of a self-hosted Zendesk sounds crazy, but someone could build that, and sell it in the form of a proprietary binary, at the cost of a flat fee instead of a subscription. If I sell you a $99 Zendesk-for-Kubernetes, and you can easily run it on your Kubernetes cluster on AWS, you are going to end up saving a lot of money on support ticketing software.
Enterprises often run their own WordPress to manage a company blog. Is the software for Zendesk that much more complicated than WordPress? I don’t think so — but enterprises are much more scared of managing their own help desk software than managing their own blogging software.
I run a pretty small business but I subscribe to a LOT of different software-as-a-service tools. An expensive WordPress host, an expensive CRM for advertising sales, MailChimp for the newsletter. I pay for these services because they are super reliable and secure, and they are complex multi-node applications. I wouldn’t want to host my own. I wouldn’t want to support them. I don’t want to be responsible for technical troubleshooting when my newsletter fails to send. I want to run less software.
In contrast, I’m not afraid that my single-node software is going to break.
Software that I use from a single node tends to be much less expensive because I don’t have to buy it as a “service”. The software that I use to write music has a 1-time fixed cost. Photoshop has a 1-time fixed cost. I pay for the electricity to run my computer, but other than that I have no ongoing capital expense to run Photoshop.
When multi-node applications are as reliable as single-node applications, we will see changes in the pricing models.
Maybe someday I will be able to purchase a Zendesk-for-Kubernetes. The Zendesk-for-Kubernetes will give me everything I need from a help desk — it will spin up an email server, it will give me a web frontend to manage tickets. And if anything goes wrong, I can pay for support when I need it.
With Kubernetes, it becomes easier to deploy and manage distributed applications. With Helm, it becomes easier to distribute those applications to other users. But it’s still pretty hard to develop distributed systems.
This was the focus of a recent CloudNative Con/KubeCon Keynote by Brendan Burns, called “This Job is Too Hard: Building New Tools, Patterns, and Paradigms to Democratize Distributed Systems Development”.
In his presentation, Brendan presented a project called Metaparticle. Metaparticle is a standard library for cloud-native development, with the goal of democratizing distributed systems. In an introduction to Metaparticle, Brendan wrote:
Cloud native development is bespoke, complicated and limited to a small number of expert developers. Metaparticle changes this by introducing a set of utilities in familiar programming languages that meet the developer where they are and enables them to begin to develop cloud-native systems using familiar language features.
Metaparticle builds on top of Kubernetes primitives to make distributed synchronization easier as well. It supplies language independent modules for locking and leader election as easy-to-use abstractions in familiar programming languages.
After decades of distributed systems research and application, patterns have emerged about how we build these systems. We need a way to do locking of a variable, so that two nodes will not be able to write to that variable in a nondeterministic fashion. We need a way to do master election, so that if the master node dies, the other nodes can pick a new node to orchestrate the system.
Today, we use tools like etcd and Zookeeper to help us with master election and locking — and these tools have an onboarding cost.
Brendan illustrates this with the example of Zookeeper, which is used by both Hadoop and Kafka to do master election. Zookeeper takes significant time and effort to learn how to operate. During the construction of both Hadoop and Kafka, the founding engineers of those projects engineered their systems to work with Zookeeper to maintain a master node.
If I am writing a system to do distributed mapreduce, I would like to avoid thinking about node failures and race conditions. Brendan’s idea is to push those problems down into a standard library — so the next developer who comes along with a new idea for a multi-node application has an easier time.
Important meta-point: metaparticle is built with the assumption that you are on Kubernetes. This is a language-level abstraction that is built with an assumption about the underlying (distributed) operating system — which once again brings us back to standards. If everyone is on the same distributed operating system, we can make big assumptions about the downstream users of our projects.
This is why my mind is blown by Kubernetes. It is a layer of standards across a world of heterogeneous systems.
Functions as a service (often called “serverless” functions) are a powerful, cheap abstraction that developers can use alongside Kubernetes, on top of Kubernetes, and in some cases instead of Kubernetes altogether.
Let’s quickly review the modern landscape of serverless applications, and then consider the relationship between serverless and Kubernetes.
Functions as a service are deployable functions that run without an addressable server.
Functions as a service deploy, execute, and scale with a single call by the developer. Until you make that call to the serverless function, your function is not running anywhere — so you are not using up resources other than the database that is storing your raw code. When you call a function as a service, your cluster will take care of scheduling and running that function.
You don’t have to worry about spinning up a new machine and monitoring that machine, and spinning the machine down once it becomes idle. You just tell the cluster that you want to run a function, and the cluster executes it and returns the result.
When you “deploy” a serverless function, the function code is not actually deployed. Your code sits in a database in plain text. When you call the function, your code is being taken out of the database entry, loaded onto a Docker container, and executed.
AWS Lambda pioneered this idea of a function-as-a-service in 2014. Since then, developers have been thinking of all kinds of use cases. For a comprehensive list of how developers are using serverless, there is a shared Google Doc created by the CNCF Serverless Working Group (34 pages at the time of this article).
From my conversations on Software Engineering Daily, these functions as a service have shown at least two clear applications:
- Compute that can scale up quickly and cheaply in response to bursty workloads (for example, Yubl’s social media scalability case study)
- Event-driven “glue code” with variable workload frequency (for example, an event sourcing model with a variety of database consumers)
To create a FaaS platform, a cloud provider provisions a cluster of Docker containers called invokers. Those invokers wait to get blobs of code scheduled onto them. When you make a request for your code to execute, there is a period of time that you have to wait for that code to be loaded onto an invoker and executed. This time spent waiting is the “cold start” problem.
The cold start problem is one of the tradeoffs that you make if you decide to run part of your application on FaaS. You don’t pay for the uptime of a server that isn’t doing any work — but when you want to call your function, you have to wait for your code to be scheduled onto an invoker.
On AWS there are invokers designated for whatever requests for AWS Lambda come in. On Microsoft Azure there are invokers designated for Azure Functions requests. On Google Cloud there are invokers reserved for Google Cloud Functions.
For most developers, using the function-as-a-service platforms from AWS, Microsoft, Google, or IBM will be fine. The costs are low, and the cold start problem is not problematic for most applications. But some developers will want to get costs even lower. Or they might want to write their own scheduler that defines how code gets scheduled onto invoker containers. These developers can roll their own serverless platform.
Open source FaaS projects such as Kubeless let you provision your own serverless cluster on top of Kubernetes. You can define your own pool of invokers. You can determine how containers are scheduled against jobs. You can decide on how to solve the cold start problem for your own cluster.
The open source FaaS for Kubernetes are just one type of resource scheduler. They are just a preview of other custom schedulers we will see on top of Kubernetes. Developers are always building new schedulers in order to build more efficient services on top of those schedulers.
So — what other types of schedulers will be built on top of Kubernetes? Well, as they say, the future is already here, but it’s only available as an AWS managed service.
AWS has a new service called Amazon Aurora Serverless — which is a database that scales storage and compute up and down automatically. From Jeff Barr’s post about AWS Serverless Aurora:
When you create an Aurora Database Instance, you choose the desired instance size and have the option to increase read throughput using read replicas. If your processing needs or your query rate changes you have the option to modify the instance size or to alter the number of read replicas as needed. This model works really well in an environment where the workload is predictable, with bounds on the request rate and processing requirement.
In some cases the workloads can be intermittent and/or unpredictable, with bursts of requests that might span just a few minutes or hours per day or per week. Flash sales, infrequent or one-time events, online gaming, reporting workloads (hourly or daily), dev/test, and brand-new applications all fit the bill. Arranging to have just the right amount of capacity can be a lot work; paying for it on steady-state basis might not be sensible.
Because storage and processing are separate, you can scale all the way down to zero and pay only for storage. I think this is really cool, and I expect it to lead to the creations of new kinds of instant-on, transient applications. Scaling happens in seconds, building upon a pool of “warm” resources that are raring to go and eager to serve your requests.
We are not surprised that AWS can build something like this, but it’s hard to imagine how it could emerge as an open source project — until Kubernetes. This is the type of system that random developers could build on top of Kubernetes.
If you wanted to build your own serverless database on top of Kubernetes, you’ve got a number of scheduling problems to solve. You need different resource tiers for networking, storage, logging, buffering, and caching. For each of those different resource tiers, you need to define how resources will get scheduled to scale up and down in response to demand.
Just like Kubeless offers a scheduler for the small bits of functional code, we might see other custom schedulers that people use as building blocks for bigger applications.
And once you actually build your serverless database, perhaps you could sell it on the Helm app store as a $99 one-time purchase.
I hope you have enjoyed this tour through some Kubernetes history and speculation about the future.
As 2018 begins, here are some of the areas we’d like to explore this year:
- How do people manage deployment of machine learning models on Kubernetes? We did a show with Matt Zeiler where he discussed this, and it sounded complicated.
- Is Kubernetes used for self-driving cars? If so, do you deploy one single cluster to manage the entire car?
- What does a Kubernetes IoT deployment look like? Does it make sense to run Kubernetes across a set of devices with intermittent network connections?
- What are the new infrastructure products and developer tools that will be built with Kubernetes? What are the new business opportunities?
Kubernetes is a fantastic tool for building modern application backends — but it is still just a tool.
If Kubernetes fulfills its mission, it will eventually fade into the background. There will come a time when we talk about Kubernetes like we talk about compilers or operating system kernels. Kubernetes will be lower level plumbing that is not in the purview of the average application developer.
But until then, we’ll continue to report on it.