The beginners guide to modern DevOps
Coordinated Releases, Task Runners, Configuration Management, Immutable Infrastructure, Containers and Images explained for budding devs and ops.
Coordinated Releases and Snowflake Servers
Ah, those good old days when all servers were manually provisioned. An interactive shell was the most preferred methodology, and tmux/screen/shell scripting were the tools. A nice, quiet weekend would be planned for a deployment. SysAdmins, Developers and Testers would come in and be allotted responsibilities for various components that will be installed/upgraded in the day. Then the orchestra would begin. Login to each server, run commands, copy this here, symlink this there, compile packages, upgrade packages, ... Then have the Devs vet it and QAs test it. Rinse, lather, repeat for every component. A Conductor would direct this concert, waving a wand at different teams and coordinating the whole show over a large email thread and a conference hotline.
If you’re lucky, the orchestra would be over in one take. But as systems got larger, deployments got longer, multiple re-takes were required, and we started having multiple shifts as well. And as people and systems grew, this whole thing started to break. A simple symbolic link missed here, a config file forgotten to be copied there, wrong version running over there, etc. Those were extremely difficult to spot. Out of 10 servers, how would you make sure all commands have been run flawlessly? Famous software crashes like the 2012 Stock trading disruption were caused by forgetting to copy the right code to one of 8 servers, causing losses of more than $400 million in a single day!
Oh yeah, and if at all your server died, nobody has any record on how it was setup and modified across all these years, and how should it be setup again! If its lost, the whole setup is gone forever! These are the Snowflake Servers, and the orchestra is the Coordinated Release. They can happen independently too, i.e. you could setup a Snowflake server without a Coordinated release, and vice versa.
To counter this, people started writing their own scripts in various languages: Bash, Perl, Python, etc. These would automatically perform various deployment tasks so that you don’t miss out on anything. They came to be known as Task Runners, because it simply runs the same tasks a SysAdmin would do. But these scripts have two problems:
- They are not idempotent. Let’s say you have a command to create a user. Running it the first time would work. Second time it would bomb saying “User already exists”. Think about how many places this can happen — Moving files and folders, Checking out source code, Installing packages, etc. If you ran the same script twice, there is no guarantee it will run correctly. So if your script failed somewhere in the middle due to some legitimate reason, you now have more work to troubleshoot this from somewhere in the middle, rather than doing it from scratch manually.
- They are not portable. It is common for an application to use Ubuntu for development, CentOS on QA/UAT/etc and RHEL in production. On Debian it would be apt-get install, On RedHat/CentOS it would be yum, on some other machines it could be a manual compile. Even the gnu coreutils have subtle differences across platforms. And think about a mix of Linux and Unix servers too! How many conditions would you put in your scripts?
Given the above problems, some fundamental thinking was required. What came up next was drastically different from Task runners.
You would describe what is the End State your machine should be in. For example:
I want user www to be present with nobody group
I want directory /var/www to be present and owned by www
I want OpenJDK 1.6 to be present
Once you have described the End State in some DSL syntax, a configuration management tool will actually come up with the necessary tasks to get your system to that state. You can imagine it as a kind of “State Diffing” — comparing your server’s current state with the End State you want to be in, and making corrections wherever necessary. So if the user www doesn’t exist, it would create it. If the user existed but with a different group, it would simply change the group. And so on.
This led to the rise of CFEngine, Puppet, Chef, Ansible, Salt and many more configuration management tools. Different tools use different DSLs to represent the End state: Chef uses Ruby, Ansible uses YML, Puppet uses its own language, etc. An example in Chef would be like this:
user "www" do
directory "/var/www" do
package "openjdk-6-jdk" do
- Describe your Infrastructure as State
- Idempotency. When you run the above on a fresh machine, it will run tasks for every state. When you run it second time, it simply won’t do anything because the state is already met. So you can run this as many times as you want and results will be the same.
- Cross-platform support. Most in-built states (creating users, groups, directories, copying files, installing packages, etc) will automatically translate to correct tasks for the platform. e.g. “package” would translate to “apt-get install” in Debian/Ubuntu, “yum install” on CentOS, and so on.
- SysAdmins used to running tasks approaching State management would write it thus:
execute "apt-get update" do
execute "apt-add-repository ppa:nginx" do
execute "wget rvm.io | bash -s" do
- Developers with little production system administration knowledge approaching State management would write it thus:
# Show their Ruby mastery with dozens of gems and cool shorthands
# Use Design patterns and anti-patterns
# Most of the cookbooks would be:
# Nobody knows how the whole deployment works anymore
- A symptom I’ve usually seen in many projects. People learning a DSL and System Administration but end up becoming masters of none. You need to have a balance of system administration and development background to get this correct. If you are a traditional Sys Admin used to running tasks by hand, or a Developer who has not managed systems in production, you would make things worse than usual. In fact, I’ve seen automation code being larger than the application code in some projects!
The switch from Task Runners to Configuration Management was a fundamental change from Tasks to State. The next leap needed another round of fundamental re-thinking.
What if you always threw away the old server and started with a new server? And what if you were guaranteed that the new server will always be the exact same OS and version (say Ubuntu 14.04 LTS all the time)? Then you no longer need State management, because you are always starting with a clean slate. And there are no cross-platform issues, because you know for sure its always the same OS! You can start going back to Shell scripts assuming that its a fresh machine. Just create the user, create the directory, install stuff. No need to check for pre-existence or platform compatibility.
A (bad) analogy would be the switch from Object-oriented programming to Functional programming: you no longer manipulate state, you just operate upon new copies, and the results are reliable. This gives rise to the Immutable Server. You no longer patch an existing server, instead for every deployment you tear it down and bring up a new server.
With bare metal servers, this was expensive. But with the rise of VMs in the Data Center and Cloud, Immutable Servers became operationally possible. You could destroy the old VM, bring up a new one in a matter of minutes, and then run your configuration scripts to install everything.
At the same time, development VMs like Vagrant also enabled fast Immutable infrastructure development. You would spin up a Vagrant VM, test all the configuration scripts, and then destroy the VM. Rinse and repeat to get your development workflow done.
- No more patching a live machine. No more manipulating state manually in your infrastructure at all.
- No more need for “soft upgrades” and making sure that migration from one version to another is gracefully handled. You can worry less about historical assumptions and just start setting up the latest and greatest on the new server
- Enable zero-downtime deployments. Provision a new server and switch across to it immediately
- Wasted time and bandwidth. If your setup involves running a large amount of tasks, then it would take a long time to bring up a new server.
- More complicated Service Discovery and integration issues as IP addresses change.
- You can no longer have any persistent data on the server, because the server will be thrown away anytime. But this can be good or bad depending upon the context.
- OK for application servers. But for persistent data like Database, NFS, etc Immutable servers are not easy.
- What if you need to run two applications on the same server for cost reasons? Infrastructure is all about budgets too.
The Rise of Containers
We already got deployment time down from a day in Ye Glorious Days to couple of minutes for an Immutable server. Except, now that we have this, we started to deploy more often. Instead of a day once in a blue moon, we are now doing 5–10 minute deployments dozens of times in the same day, and that adds up.
Even if you didn’t actively deploy, your Load Balancer is spawning VMs up and down as per demand, and servers are getting annihilated. People started taking snapshots of their VMs to scale up faster, but these Whole Machine snapshots were humongous, painful to maintain, has vendor specific image formats and more.
People tried to run VMs inside VMs to standardize things. But that didn’t work out — it was either too slow or impossible.
Then people dug into the OS. Linux/Unix already allowed you to do the following:
- Manage a bunch of processes as a group and set CPU/Memory limits
- Have fakeroot (chroot) for some services so that they can operate upon their own independent root filesytem
- Create virtual ethernet devices (say for firewalls, bridging, etc)
- Mount file systems anywhere
Put all these together: A group of processes with virtual ethernet, fakeroot and file system mounts, purely kernel-based with no runtime overhead, needs no special hardware requirements, and can be started/stopped as a whole in a matter of seconds! And well, you have a lightweight sandbox that can act as close as a VM!
Enter Containers. People built on top of these OS facilities to make Containers. Unix had Zones since a long time ago, but with LXC containers became mainstream. LXC allowed you to create a container with a specific Linux distribution, and boot it in a matter of seconds. You could install anything that you want inside this new “machine” just as you would in a normal server. So now instead of creating and destroying VMs, people could create and destroy containers much faster.
LXC bought containers mainstream, but it still didn’t solve repeatable builds and images. Immutability was something you had to bake in by yourself.
Then came Docker. Docker bought repeatable container images to the forefront. With a bunch of instructions in a Dockerfile, you could copy your application inside, install all dependencies and package it up into an image. Think of the whole image as a Fat Application, with everything needed to run the application. Its like packaging up a whole server into a single executable.
You could now ship this single image from Development to Production. All those servers don’t need to have anything installed in them except Docker, and that’s it. Everything they run is all inside containers. This in turn gave birth to lightweight OSes which only run Docker — CoreOS, Rancher, etc.
State is not Dead, Yet.
OK, so even if you do have Immutable Servers, you will still have different configurations for different servers like Dev, QA, UAT, etc. Like environment variables, caching services, shared file mounts, which version of the app to run in which environment, IP address of every different service in the application, etc.
State is not dead in your infrastructure as a whole yet. But earlier, you had to worry about the state of installed software versions, symlinks and paths. With container images, you have reduced state into a simple bunch of environment variables.
So what about DevOps?
With Container Images comes the next paradigm shift:
Developers (or Dev teams) are now squarely in charge of their Application’s Image. They make sure everything they need is installed on their container, and if given the necessary environment variables, the container would just run. Operations is now baked into their development.
SysAdmins (or Ops teams) are no longer task runners for every deployment. They are instead building an infrastructure that enables developers to run their containers. They worry about what matters: Scalability, Performance, Resilience, Uptime, and so on of the infrastructure as a whole. They will have development cycles for building and maintaining this infrastructure.
Developers are also doing Ops.
Ops are also doing Development.