The Curious Case of Linux Containers

7 min readDec 4, 2015

The next person that says “Linux Containers will solve all our problems” is going to get taken out back and shot. This is a surefire way to make me instantly believe that you don’t actually understand anything about containers or the real issue of deploying and operating distributed systems at scale.

Definitions

Let’s agree on some definitions first because the word container has already been overloaded to hell:

container: functionality provided by the Linux Kernel via cgroups and namespaces that enables a set of processes to have their own resource level isolation (cpu, memory, io)
lxc: aset of command-line tools and libraries for managing containers
application/service container: a method of distributing an application or service where all the resources necessary for running in production are bundled together. Examples include: static binary, JAR file, compressed archive, file system image, system package with bundled dependencies
platform: A system which enables applications or services to be built, packaged, tested, provisioned, deployed, configured, secured, monitored, orchestrated and operated in a production environment. Basically all of the steps involved in SDLC (software development lifecycle) and then some
docker: a platform for and enabled by containers

Containers by themselves are very powerful but a platform they are not. I’m going to guess that when most people talk about how containers will solve all their problems, they really mean a platform that is enabled by containers will solve all their problems.

If that’s true — why do they care if the platform they want uses containers at all? Containers are just another technology, they are a solution for a problem. You could easily create a platform built around VMs or JAR files. In fact the industry has already been doing this for years! If you really want a platform then you don’t care what technology it uses to solve your problem only that it solves your fucking problem!!!

The Problem

So if we’re really trying to solve a problem then what is it? Naming things and defining the problem are two of the hardest things we do in computing. Startups will spend years and tons of engineering hours just trying to define what a problem is. Once you do that coming up with a solution is easy.

So I’m going to try to define the problem that I’d like a platform to solve. I’m being stingy here as this is for me. This suits my needs today and is based around problems that I’ve experienced over the years. It may not be for everyone.

What we really need is a distributed systems platform or DSP (sorry hardware geeks). This is basically a platform that runs on top of a public or private cloud (or both) and enables you to build and operate as many (micro) services that you’d like in production, both stateful and stateless.

Now we don’t care if our platform is microservice centric only that it enables services. In the small team’s world the unit of deployment is a service (and an API). We also want our platform to easily grow and scale as the business does without having to completely re-architect anytime our server count increases in magnitude. The type of problems we solve for hundreds of servers shouldn’t be that different than for thousands or tens of thousands.

Use Cases

Now rather than iterate on every individual attribute I want in my platform I am going to instead describe a few use cases that the platform needs to solve.

1) Pivoting a Stateful Service

Let’s use MySQL in this example. I’ve got a typical setup with a single master that handles writes and multiple read-only slaves. Now it goes without saying that I should be able to run databases on my platform. Managing stateful services should be core to any platform. If your platform cannot manage stateful services today then guess what, it is not a platform. You should be ashamed of yourself. You have only solved a problem that I could solve myself in two days with a bash script.

Not only should I be able to deploy a typical MySQL setup with my platform, but my platform should be expressive enough that if I lose my master database it knows how to promote one of my slave databases to the new master and communicate this to the rest of the cluster. There should be a brief service interruption lasting seconds at best and the rest of my cluster should know who the new master database is and start communicating with that master.

2) Upgrading a Stateful Service

Continuing on the MySQL example above it should be possible to do things like perform a MySQL Upgrade with my platform. If I wanted to do something like upgrade MySQL versions from 5.5 to 5.6 does my platform let me do this?

Traditionally to get this accomplished I would do something like boot additional MySQL slaves with the new version running on them, have them pull a backup from another server and do some type of conversion or upgrade operation. Then my new slaves would connect to the current master and begin replicating. My platform needs to have a way that lets me express this workflow.

Once my slaves have completed replicating the current db and are “caught up” I want to be able to programmatically perform some type of verify operation. Next my platform should shoot the old master in the head and promote one of my new upgraded slaves to the new master. I’d be happy if my platform supported just this but the ability to rollback to the old master in an emergency (data loss can be acceptable) would be a nice-to-have.

It actually gets trickier if I want to do something like a schema change but the workflow is still the same. Boot new slave databases, perform some type of DDL update or ALTER TABLE, verify and then pivot.

3) Responding to Heartbleed

The pain from heartbleed is still fresh in my mind. So let’s imagine a Heartbleed 2.0 scenario. You are running 500 services across 10,000 service containers and an OpenSSL vulnerability is announced. You need to upgrade OpenSSL across your entire cluster immediately and re-issue twenty certificates.

How will my platform enable this? It needs to:

Tell me every single service that has libssl as a dependency and that is using a tainted SSL certificate
Let me build a new libssl package and make it available. Sometimes my OS package repositories may not be fast enough
Rebuild every service container bundled with the new libssl package that we just built
Deploy all my affected service containers into my testing and staging environments and run them through a full suite of tests ensuring that there is no unexpected behavior
Replace all my affected service containers in my production environment. This could be thousands of service containers. We would need tight coordination and dependency mapping as we started replacing and upgrading giant chunks of our infrastructure as fast as possible
Identify any services using tainted SSL certificates and update their certificates via a re-deployment or configuration update

How easy does my platform make this? The last Heartbleed had everyone scrambling, let’s try not to do that again.

4) Testing with Dependent Services

In the world of services and microservices I need to understand my dependencies and be able to express and audit them. Most services do not exist in isolation. In the same way that an OS level package like Nginx may have a dependency on libc and libssl, my service has the same types of dependencies against other services and resources that need to be expressed.

Think about something as simple as an API service. The API service may be stateless but it probably depends on an accounts service that contains the business logic for customer authentication. The accounts service may depend itself on another MySQL service or talk to MySQL directly. I need to be able to express this relationship and launch all of these components together in my testing environment and verify that they work. I am not testing a simple service in isolation, I am now validating the functionality of a cluster of services as a logical unit.

api depends on accounts-service>=2.1 which depends on accounts-db>4.21

It gets worse when the reality in production is that we are often running several different versions of services at the same time. So we may need to simultaneously test against multiple service versions in our testing environment. To further complicate things what we deploy in production may end up only being a subset of what we deployed in our testing environment. In the example above I deploy three services in testing but only one service in production.

On that note, how easy would it be using my platform to spin up an exact copy of what I have in production — as it existed one month ago? What if I wanted to spin up ten copies of that environment. Fifty?

5) Bootstrap Apache Hadoop

This one goes without saying, can I bootstrap the beast known as Apache Hadoop easily using my platform. Hadoop has all kinds of crazy dependencies and coordinated steps that need to be taken to get a cluster running and it also has state!

In Summary

This is just a start for what I’d like to see solved in my ideal platform. There’s many things I left out but if we want to tackle the problem of managing a distributed system of (micro)services this is level one. These are problems and issues that many of us have today!

Did you know that even with all the advances with technology the amount of time that a housewife spends maintaining a household today is the exact same as it was in the 1950s! We have all this technology but it hasn’t actually made things that much easier. The same can be said of the cloud — it hwas enabled lots of amazing things but only a few things have actually gotten easier. I’d like to see more!

Also be warned — I have no idea what I’m talking about. Take everything in this article with a grain of salt. This message will self-destruct.

Also I’m not sure I defined the problem yet but I want a platform that manages (micro)services.