Highly available single point of bugs

George Shuklin
OpsOps
Published in
3 min readJul 2, 2018

There is a nasty topic with bother me for a long time.

Let’s say we have a computer system. It has redundant power supply (Tier 4, the best electricity from two independent providers, own generators, UPS, etc). It has redundant PDU, it has redundant disk arrays, even memory is in the mirror mode with multi-bit ECC. Let’s assume that even CPU are redundant (I have no idea how to do in on x86, but let’s assume so). Network is redundant too, obviously.

And we run some storage software (e.g. Ceph, gluster, etc) with ‘thee copy’ redundancy. This cluster guaranteed you get confirmation for a write operation only after datum was written at least three times.

This system is redundant, isn’t it? Can it withstand a single error? You may guess, yes, it is.

… How about a single assert statement somewhere in the monitor software triggered by some specific combination of cluster-wide configuration, with delayed occurrence (enough to propagate data through whole cluster).

Is this cluster capable to withstand a single typo in source code?

Absolutely no!

Each node runs the same code. If that code contains bug, it will manifest on all copies of all software, regardless of how many redundant CPU, PDU or other machinery this cluster has.

How, possibly, could a single software be redundant?

My engineering intuition says that ‘redundant software’ is the answer. But could you imagine redundant STONITH? How it will save the job if other (part of the cluster) decide to go rogue?

Software industry is got used to buggy software. We have CVEs, semantic versions with patch level, automatic updates, procedures and best practices to deal with buggy software. We assume that every software we run is buggy and are prepared to path it.

Except for cluster design.

Every cluster design I saw in my life assumes that there are no bugs in the cluster code. It assumes that each member of the cluster is acting independently and flawless (and only external vile forces may bring a node to the knees).

As operator I’m horrified by this picture. The larger cluster is, the more dare consequences are. All those thousands nodes are run the same ‘flawless cluster code’ which never, ever would dare to do cluster fuck to all those precise data it stores. If you curious how it looks for rapid-grayhairing operator, it looks like this:

...
0> 7fd2857aa700 -1 mon/OSDMonitor.cc: In function 'bool OSDMonitor::prepare_boot(MonOpRequestRef)' thread 7fd2857aa700
mon/OSDMonitor.cc: 2105: FAILED assert(osdmap.get_uuid(from) == m->sb.osd_fsid)

It had happened on all monitors on the cluster simultaneously. I was lucky it was a test cluster, so no gray hairs here. But what if it was in production?

No more cluster?

I knew only one database which never ever was crashed (since its wide adoption). It’s DNS.

The reason for DNS to be so resilient it not in it’s design, and not in some flawless implementation in the BIND code. No! It’s resiliency arises from multitude of implementations. Even if you crash some specific code with creepy data, other implementations will continue to work.

The key thing here that DNS protocol is absolutely independent from any implementation. You implement RFC, you are green to go.

The same idea should be applied to any kind of distributed storage of anything. More than one implementation is the absolute requirement for highly available storage solution.

… and…

How many distributed storages have this property? Many, if you count things like Winny, DHT, usenet news. None, if you want to have synchronous operations, absolute zero if you look for commercially available solutions.

Why? Is it hard to write the same cluster software twice? I suspect it’s much easier than to fill code repositories for all those folks. And yet, every new project aims to implement own protocol. No one is fancy idea to implement Gluster for the second time. No one tried to code ‘Ceph in Rust’. Nah. Each of new project will decide that they are capable to deliver cluster software without cluster fuck.

--

--

George Shuklin
OpsOps

I work at Servers.com, most of my stories are about Ansible, Ceph, Python, Openstack and Linux. My hobby is Rust.