State of Cloud Instance Provisioning
I have been craving to write about this since this is what I have been up to lately at work. I spent quite some time investigating the state of instance provisioning on each cloud provider and I thought I could share these here.
If you are dealing with deploying instances (a.k.a Virtual Machines or VMs) to public cloud (e.g. AWS, Azure), then you might be wondering what your instance goes through before you can start using it.
What Is Provisioning?
You have an application and you need to run this without purchasing physical servers. You go to a cloud provider and you ask for virtual servers to run your service on. You tell the cloud provider: “I want a Debian 8 Linux machine, and I want you to add this SSH public key to the machine so that I can log in and run my app”. You get what you asked for within a matter of seconds if not minutes.
(Very ordinary, right? I am very sure this was mind-blowing decade ago when AWS EC2 came out. It is certainly still mind-blowing to me.)
All operations that occur from the moment you request for a VM to the moment you can log in to the VM is called provisioning.
Most of the provisioning magic happens at cloud provider’s proprietary/internal software that manages their physical machines in the datacenter. A physical node is picked and the VM image you specified is copied to the machine and hypervisor boots up your VM. This is provisioning from the infrastructure side and we are not going to be talking about it here.
Then the provisioning goes on… Your machine is now up (think of a Debian or Ubuntu Server image). It has no accounts, no SSH keys. It is almost like a vanilla OS image you can download from internet yourself. You can’t log in.
This is where the user-mode provisioning kicks in. Your machine runs some code and starts doing specialization on this image. Some of them could be:
- creating the OS user you wanted
- adding SSH credentials to the machine so that you can log in
- running startup scripts you provided (to install or configure stuff)
- mounting an ephemeral/scratch disk from the physical host
All this is part of provisioning and once this is all done, you have a VM prepared for you to log in and use it!
This user-mode provisioning runs only once. When it is all set, it gets out of your way and lets you run your workloads.
Key Tools for Provisioning
How does a vanilla Linux server image know about your configuration such as your credentials and set up the Virtual Machine accordingly? To understand that you should know about some tools that play key roles here:
⚒ Instance Metadata API
This is a HTTP API that runs at http://220.127.116.11/ if you have an VM running on the cloud. You make calls to it and it gives you information about your VM such as: — what is the name of the VM — what is the instance size — what region is your VM in — what are the SSH public keys assigned to the VM — what is the startup script that the VM should execute
It is a trivial and text-based API. You just make the request, you get what you want:
$ curl http://169.254.169.254/latest/meta-data/ami-id
Instance Metadata is provided to the VM by the hypervisor or other underlying infrastructure of the cloud provider. This is how your VM knows about itself and what it should do.
In case you are interested, I have a comprehensive blog post about comparison of instance metadata APIs across public cloud providers on my blog.
cloud-init is a Linux tool that runs when your instance boots to handle provisioning from within the instance. It sets up your virtual machine by configuring networking, hostname, placing your SSH credentials and optionally, by running startup scripts you provided.
cloud-init detects which cloud provider you are running on (by doing certain heuristics on the filesystem or metadata API protocol) then figures out which data source class to use.
Then it calls the data source class to get the data (such as SSH keys) it needs provisions your instance with that. Basically, the cloud-init package is what makes a server image different than a vanilla distro image.
cloud-init is written in Python and originally developed by Canonical for Ubuntu Server, however over time it has gotten populer over time and got love from other Linux distro vendors as well as cloud providers and became a widely-adopted package baked in many distro images on the cloud.
Provisioning in Public Cloud
In this section I am going to explain how each cloud provider does provisioning of the instances in user space. Some of them are similar, although some have differences interesting enough to point out.
☁︎ Amazon Web Services EC2
Obviously AWS started doing all this since it is the first IaaS provider, but the notion of provisioning is older than that as people used to have virtualization software they ran in their on-premise datacenters or servers.
The way EC2 provisions instances is plain and simple:
- Most images on EC2 (AMIs) have cloud-init baked in to the image.
- cloud-init queries EC2 Instance Metadata API and gathers data to provision the instance.
The whole AWS implementation in cloud-init is only 200 lines of code. I think having cloud-init everywhere gives EC2 a clean and unified way of provisioning Linux instances.
I love DigitalOcean and use it personally myself. When it comes to provisioning, they follow the AWS EC2 principle:
- have cloud-init package baked on all images
- use metadata API to get the data about the instance.
Since DigitalOcean has only a few images available (at least today), provisioning is not very exciting here either. The cloud-init implementation of DigitalOcean is also very small, just 110 lines of code.
The only difference I spotted is DigitalOcean has a way of reordering the steps executed by cloud-init (listed in cloud.cfg) via a cloud-init feature called the vendor-data. This data also comes from metadata API and as far as I can tell nobody except DigitalOcean uses this feature in the cloud-init codebase. They use it for keeping the root user enabled, managing /etc/hosts via cloud-init etc. (In case you want to dig deep, here is the vendor-data DigitalOcean presents and its diff with cloud.cfg.)
☁ Google Compute Engine
Google does not use cloud-init (BOOM!). I do not know why but what they came up with instead is remarkably cool:
Google wrote their own instance guest agents in Python. These are installed on all stock images on GCE and open source on GitHub. And I said agents, meaning not a single monolithic agent but a bunch of small services. You can find a list of these in your GCE instance:
# systemctl list-units | grep ^google
google-accounts-daemon.service Google Compute Engine Accounts Daemon
google-clock-skew-daemon.service Google Compute Engine Clock Skew Daemon
google-ip-forwarding-daemon.service Google Compute Engine IP Forwarding Daemon
google-shutdown-scripts.service Google Compute Engine Shutdown Scripts
Self-documenting enough… The one I really want to talk about is google-accounts-daemon, the one which creates the user accounts and places SSH keys you have given to create your VM.
Those who are GCE customers will know these two fantastic features: If they ever lose their credentials, they can drop new SSH keys to the VM from gcloud CLI or the web console at any time and restore access.
Or another killer feature: GCE has a “SSH” button on the web interface and the gcloud compute ssh <vm> command to give you on-the-fly SSH access to the instance by creating short-lived SSH keys and dropping them to the instance in 10 seconds.
What is this magic? How is this possible and so fast? Well, first you need to know this: Google Metadata API supports long polling. This means you start a long-standing HTTP request to metadata API and if the key you are watching changes, the server returns a response with the new values. Then the google-account-daemon parses the instance/attributes/ssh-keys value in the response (which contains the new SSH keys) and creates Linux users and adds SSH keys accordingly. This is how the magic works.
If you ever lose your SSH key on EC2 or DigitalOcean, you are doomed. But GCE has this (and Azure has something similar). So this is pretty cool.
I said Google does not use cloud-init, but if you bring your own custom VM image (that has cloud-init package in it) it will provision just fine as there is a GCE implementation in cloud-init. It is short (160 lines of code) and just queries GCE Metadata API to get all the data it needs to provision. If you go down this route, you won’t be getting all these cool features.
☁︎ Microsoft Azure
(Before I begin, quick disclaimer to save my butt: I work at Microsoft Azure Linux team and this is precisely the area that I work on. These are my personal opinions and it goes without saying that I tried to write this section objectively as much as I can.)
Azure has started as a Windows PaaS provider in 2010 and stayed as such until 2013, when IaaS was made generally available. When IaaS was launched, Azure did not have many Linux images, however it was picking up. (read: Microsoft is now doing Linux, can ya believe it?!! and I was there!!1) However as most of the infrastructure and the APIs were designed for Windows, provisioning on Azure Linux instances is a bit unconventional and non-trivial.
As of this writing, Azure does not have an instance metadata service. It has an undocumented HTTP API which is internally called “Wire Server”. One of the Red Hat engineers kindly documented it here. It is XML-based (compared to other metadata servers being JSON/text based) and a bit cryptic at first.
However, Azure does not use this Wire Server for provisioning. So where do we get the provisioning data from? When a new virtual machine boots for the first time, Hyper-V (hypervisor of Azure datacenters) attaches a DVD-ROM device to the instance. Then the provisioning code mounts the device and reads a file called ovf-env.xml from it. This file contains username, SSH key and/or password data (yes, Azure allows creating Linux VMs with passwords) and is used to provision the instance.
So who provisions the instance then? Well, it depends. First of all Azure has its own guest agent running on all Linux instances, called waagent. It is written in Python and open source on GitHub.
On most Linux images on Azure, waagent is the provisioning tool. However, for images like Debian and Ubuntu Server, cloud-init does the provisioning and waagent is still there. This is mostly because Azure guest agent does a lot more than provisioning. It has quite many tasks, such as some important ones:
- formatting the ephemeral disk to ext4 (Hyper-V gives it as NTFS)
- processing virtual machine extensions
- enabling RDMA (remote direct memory access) for HPC (high performance computing) workloads
- retrieving and placing instance certificates and private keys by converting them from PKCS#12 format to PEM (PKCS#12 works well with Windows, however it is not conventional in POSIX environments.)
In cases where cloud-init and are waagent both present, they coordinate and do not step on each other (although these parts are a bit hacky). Finally, Azure has to send a “provisioned” signal to the Wire Server, otherwise Azure thinks that the provisioning is still going on your VM will not be listed as started on the API or the management portal.
Some of the extra features of waagent such as the VM extensions offers flexibility to the users, which lets users to install application bundles called extensions to their VMs via CLI or REST APIs. Extensions can do things like restore access or run arbitrary scripts or install anti-malware software and such on the machine without even having to do SSH.
As I said before, provisioning in Azure is a bit non-trivial. You can see this in the cloud-init Azure data source implementation which is like 650 lines long (+another 280 lines of utility methods). This is mostly because Azure has no text-based metadata service and it has an unconventional way of delivering pieces of information and the need for signaling that provisioning has completed via a complicated protocol.
Now you know how each cloud provider brings up your virtual machine. cloud-init is huge and plays a key role in instance provisioning across the cloud providers. Instance Metadata APIs are another key to do provisioning for cloud providers.
Most people have no reason to care about these or even know about them –as long as the cloud providers are doing their job right. But I hope you now have more visibility into what your instances on the cloud go through before you get to use them.
If you have read this far please let me know in the comments about what you think!