Protocols vs. Software (Part 2) Protocol approach for resilient distributed systems

Sara Feenan
clearmatics
Published in
7 min readFeb 1, 2019

In our last post on protocols vs software, we discussed building systems with formal protocol specifications as being essential for interoperability.

This post looks at building a resilient distributed system, and lays out the argument for resilient systems to follow a protocol-first approach.

Resilience, from a systems engineering perspective, appeared around 2006, although as a concept is of course, much older. For this post, we will be using the terminology that is derived from the systems engineering context. Resilience is an emergent property of a complex adaptive system, and its scope extends across the resilience of systems, as well as the resilience of the organisations that design and operate these systems. In its most fundamental sense, resilience is the ability to provide required capability in the face of adversity. The fundamental objectives of achieving resilience include avoiding, withstanding, recovering from, and evolving and adapting to, adversity.

Heartbleed

Let’s look at the above in the context of an example. OpenSSL, a software library of cryptographic protocols, is the most widely used open-source implementation of the Transport Layer Security (TLS) protocol. In April 2014, the infamous Heartbleed bug was revealed by a Finnish cyber security company, Codenomicon, two years after an extension called ‘Heartbeat’ was proposed, accepted as a standard, and introduced into the OpenSSL source code repo in 2012. Heartbleed exploits a weakness in that extension, and its impact was far-reaching; in fact, we may never fully know the full effects of this long undiscovered security hole across millions of systems.

While the details would make for a great disaster story of our time, a key point to understand is that the fault was in the implementation, not the protocol specification. New code introduced a fatal weakness, even though the underlying design was sound. Other implementations were not affected and, indeed, since then OpenSSL has been forked twice, with LibreSSL and BoringSSL.

To quantify the risks at the time, between two open source web servers that used the OpenSSL implementation, Apache and nginx — had a combined market share of active sites on the Internet of over 66%, according to Netcraft’s Web Server Survey of the same month as the vulnerability revelation.

It is not the intention of this piece to lay fault at a small number of underfunded volunteers building critical infrastructure for the internet, it is the point of this piece to reiterate that several competing software implementations of protocol is paramount for resilient systems.

Resiliency within blockchains

The OpenSSL Heartbeat extension, is known as a ‘common mode failure’, where two or more systems fail in the same way for the same reason. Common mode failures are an important risk category that can be particularly lethal in distributed systems. This is the case across a number of industries, such as aeronautics, astronautics, reactor protection systems. Systemically important Financial Market Infrastructures are no different, whether the underlying technology is blockchain-based or not.

Proponents of blockchain argue that its redundancy and decentralisation increases resiliency. We argue that this is true only if the ecosystem assembles purposefully to mitigate the risk of common-mode failure, and this heterogeneity is vital for increased systemic resilience. In the words of Ethereum founder Vitalik Buterin:

“it’s critically important to have multiple competing implementations.”

Vitalik underlines the role of ‘diversity’ in the concept of ‘defence-by-depth and diversity’. In Vitalik’s post above, he lays out some scenarios in public blockchains that could lead to common mode failure. A situation where a majority of nodes in a public blockchain run just one or two different client implementations of the protocol, the entire network is highly exposed to failures due to technical bugs and/or malicious interference.

A good example of this is the Parity multi-sig bug. This contagion was contained across the Ethereum blockchain community due to multiple implementations of software, although more Ethereum client implementations could have further reduced the overall impact. The Ethereum community has learned that to mitigate the risk of convergence towards a single point of failure (and thus a single entity to place all our trust in), we need heterogeneity, and that means competition between different implementations. The same, of course, applies to Enterprise blockchains.

In the enterprise space, certain implementations are being pushed as a reference implementation or an industry standard. A market with very few competing implementations of widely documented protocols reduces the opportunity to find bugs or flaws in the logic of those protocol. In addition, it massively increases the risk of common mode failures. On the other hand, a market without open protocols results implementations of proprietary protocols, which easily descends towards commercial monopolies and customer lock-in, which is antithetical to the concept of resilience.

This is demonstrably true of existing communication layer protocol implementations, but extends to the commercial space. There were a number of high-profile commercial entities with their software products affected by the OpenSSL implementation of TLS with bugs, such as Oracle, McAfee and Hewlett Packard.

Commercial Models

A blockchain protocol is actually a protocol suite with a number of formal specifications, or rules about communication formats and processes. This is different from a DLT, or a blockchain client, which is simply a piece of software that is able to follow such protocol specifications. Proprietary solutions that do not explicitly define communication conventions that can in principle be implemented by other, competing, products, cannot really be said to implement a protocol at all. They just do their own thing, and will not interoperate.

Of course, a vendor that does not take a protocol-first approach for interoperability will experience a great deal of economic pressure to solve for interoperability by market share. That is, for a large ecosystem to interoperate, all participants simply have to run that single vendor’s software. This inevitably means concentrated risk around a single commercial entity which could lead to degradation of service, should they cease to exist, in addition to client lock-in.

A software solution on a drive to monopolise the market could lead to network externalities, that could be, by and large, a downside for the market, all intentions aside. Such as this piece of bespoke software that, should it be implemented, may reduce the resilience of the system by compounding vendor lock-in and concentrated risk.

The Bank of England has warned specifically about the risk of negative network externalities in a 2017 Staff Working Paper, which presented an investigation into the role blockchain and distributed ledger-based systems could play in the future of securities settlement systems. Drawing experience from existing business models, the Bank noted that network externalities can push an industry towards one (or a small number) of solutions, concluding:

“Despite the promise of disintermediation that the DL technology brings about, the industry may well simply transition from a CSD-centred monopoly to a DL-provider-centred one.”

An economically-driven software monopoly running critical market infrastructure for a global value transfer layer is a risk nobody should be willing to take, and indeed Central Banks are increasingly aware of. The recognition of this problem underpins the mission of the Technical Standards Working Group at the enterprise Ethereum Alliance, where we and others in the community are working towards multiple different — potentially competing! — implementations of a common protocol. In addition, a great deal of research and development is taking place in the open source community for improvements to the protocol stack. Collaboration is taking place across a thriving ecosystem of developers and academics to further design, spec out, formally verify, and implement protocols.

There is a clear drive globally to renovate, improve, and sometimes re-invent Financial Market Infrastructures, and quite clearly, resilience is at the heart of each of those initiatives. By designing well defined protocols with multiple competing implementations, we can work towards increasing systemic resilience. What is more, our systems benefit from the additional scrutiny, from the challenges of true interoperability, and from the opportunities offered by a thriving community of practice around open standards. Ultimately, this enables networks as well as whole ecosystems to better withstand (and recover from) adversity, while continuing to evolve and adapt.

If you would like to participate in our protocol first approach for interoperability, see details of an upcoming hackathon for our recently open-sourced decentralised interoperability framework, Ion.

Eventbrite link and signup form

This post looks at building a resilient distributed system, and builds on the argument that resilient systems should follow a protocol-first approach.

Drop us a line with any thoughts you might have on our Protocols vs. Software series.

Tweet us @Clearmatics

--

--