How reliable is my virtual server?

Antonis Christofides
Django Deployment
Published in
5 min readDec 22, 2020

Digital Ocean advertises its services as “cloud computing”, and sometimes refers to its virtual servers, its “droplets” that is, as “cloud servers”. Reader Chris Pantazis asked me if this means it has less downtime than a provider that doesn’t advertise them in this way. The answer is that “cloud” doesn’t mean anything at all. In this post I explain how virtual server providers minimize downtime.

I assume you understand clearly what a “virtual machine” is. If you don’t, download VirtualBox on your computer, create a virtual machine, and run it; the best way to grasp the concept is to see it in action. We often use virtual machines as servers, in which case we also call them virtual servers.

Virtual machines run inside physical machines. Depending on the capacity (mostly the RAM and CPU) of the physical machine and the size of the virtual machines, a physical machine can run from a handful to a few hundreds of virtual machines. Virtual machine providers like Digital Ocean have many computers stacked on a rack like the one in the picture on the right, and a data centre has many racks, as seen in the picture on the left. The virtual machine you have at your provider runs on a single physical computer. If that physical computer goes down, your virtual machine also goes down.

The hardware part that breaks most often is the disk. In order to avoid machines going down because of disk failures, we use arrays of disks called RAID. Don’t confuse real RAIDs with the ones your cheap motherboard supports, and which often result in less rather than more reliability. Real RAIDs have high quality circuits and redundant power supplies, and they can reliably isolate a malfunctioning disk. A RAID array with seven 1 TB disks has a total capacity of 6 TB, and the data is distributed on the disks in such a manner that if any of the disks crashes, the other six disks can still serve the total 6 TB. Meanwhile indicators light up, people are notified, and an administrator comes along, unplugs the broken disk, and plugs in a new one in its place. The RAID then copies data from the other disks to the new disk, in order to re-establish sufficient redundancy; this is called “synchronizing the disk”. Then everything’s back to normal. The physical computer to which the RAID is connected has not noticed anything at all; from its viewpoint, the RAID is just a single external 6 TB disk.

However, I believe that large providers such as Digital Ocean are using SAN, which is RAID on steroids. A SAN is like a RAID but it divides the total disk space into several virtual disks. Computers connect to it through the network and view the virtual disks given to them as if they were physical disks. So the data center might have only one SAN with thousands of disks, and each physical computer might get one such virtual disk from the SAN. Or it might get many, one for each virtual machine it runs. I don’t really know. I’m not an expert but I guess such large SANs have redundant power supplies, redundant caches, redundant network interfaces, redundant everything. They must be like many RAIDs joined together. You can probably take down a small part of a SAN for maintenance while the rest of it is still running. They are probably designed for zero downtime.

Even if the “disk” has zero downtime, your virtual machine runs on a single physical computer. It can’t run on many computers. It can be moved from one computer to another though. If the move is planned, one way to do it with minimal downtime is to copy the virtual machine to another computer. Copying means mostly copying its RAM. During the initial copying you keep the virtual machine running; the status of the virtual machine changes and its RAM is being modified by its normal operation, so when the copying ends, the copy is not exactly the same as the original. So now you freeze the virtual machine and copy it again, but you only copy the parts that have changed. After the copy is exactly the same as the original, you reroute the virtual storage to the new instance, you reroute the IP address to the new instance, then you unfreeze the new instance. The downtime in that case can be a few seconds only. Smaller providers may not have the equipment to do this, in which case they will bring down the virtual machines, copy them, and then start again their copied instances, which will result in a downtime of a few minutes to half an hour.

Unplanned moving is different. If the physical computer has a hardware error and goes down, the virtual machines go down with it. If the machine’s disk is on a SAN, it can be attached to another physical computer very fast, and the virtual machine can be started immediately. In that case, the downtime may be a minute or two.

I’ve never worked in a data centre and I’m confident I’ve got something wrong in all that, but as a general idea I’m pretty certain it’s OK.

“Cloud” is just a buzzword. It doesn’t mean anything. People say “I have my emails on the cloud”, but I had my emails on a server using IMAP long before the word “cloud” started to be used for that purpose. Some companies may be using “cloud” to mean virtual servers, or SAN, or automatic failover of virtual servers, or whatever you like. The problem is that everyone means something different, and I think everyone would be better off if we just said directly what we mean.

Finally, this is just the theory. In practice these systems are setup and run by people, and people can screw things up. Having the best equipment is not enough; a company must also have good people and good procedures to minimize downtime, so you won’t really know until you’ve stayed with a company for a few years.

This is a republication of an old article first published in my Django Deployment site.

The picture of the data centre (original page) is © 2011 Wikipedia Commons user “123net” and licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license. The picture of the rack (original page) is © 2007 Wikipedia Commons user “jfreyre” and licensed under the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

--

--

Antonis Christofides
Django Deployment

I help scientists and engineers bring their models to the web