Chaos Engineering & The Blockchain

Vipin Bharathan
Jan 2, 2019 · 8 min read

Part 1. Principles of chaos engineering applied to blockchain frameworks.

Image for post
Image for post

The words chaos and engineering do not appear to go together. In this article, we explore why they belong together and further on the application of this engineering discipline to the Blockchain. Part 2 of this will focus on a specific implementation of chaos engineering for Hyperledger Indy.

This is the age of micro-services composing huge distributed systems at scale. Netflix, Linked-In, Medium, Amazon, Microsoft Azure, Uber, AirBnB etc. No one person nor even entire teams of architects and programmers can hope to hold the complex architecture of such distributed systems in their head. Even the static configuration of such a system consists of multiple services running on heterogeneous hardware or the cloud, interconnected by networks with multiple SLAs and user interfaces (UIs) running on numerous edge devices. Combined with this static complexity, the real-time behavior of such a system introduces the overlay of independent inputs from users and processes driving through unreliable networked system components.

These components can crash, degrade, or misbehave. Malicious or incompetent users are everywhere. It is in this age, that chaos engineering developed, first as a crude method of taking the measure of such a system; refined by practice into a philosophy and a well accepted approach with conferences, tooling and wide spread adoption.

You could argue that permissionless public blockchain networks like Bitcoin and Ethereum exist in a chaotic environment. They are already unwittingly being subject to chaos. Nodes join and rejoin the network, malicious attackers continually probe the system, network connections break. There is a difference between this chaos and chaos engineering. Chaos engineering, by surfacing this inherent chaos, is an engineering discipline that uses experimental data to uncover systemic weaknesses.

First we set the scene with some basic history and principles of chaos engineering as well as its application in existing distributed systems. There is an open source repository for chaos engineering called the chaos toolkit. Chaos toolkit is open source and generalizes chaos engineering interactions using an open API for expressing the experiments. The toolkit is extensible using the open API and several drivers are already available for Kubernetes, AWS, Azure etc. It can also be used to automate chaos engineering in continuous integration and builds.

We look at the open source chaos toolkit and see how it is being adopted for these experiments on Hyperledger Indy in our second article in this series. Hopefully, this will inspire people to look closely at their own DLT platform and create a maturing chaos experimentation suite to harden their own platform.

HISTORY

Since 2008, when Netflix started moving their servers out from the data center to the cloud, their engineers have been practicing some form of resiliency testing in production. Only later did their take on it become known as Chaos Engineering. Chaos Monkey started the practice, being known for turning off services in the production environment. Principles of Chaos formalized the discipline. Netflix’s Chaos Automation Platform runs chaos experimentation across their production micro-service architecture 24/7.

For those interested in chaos engineering as a discipline, here is a curated list of resources. There is an excellent backgrounder on chaos engineering published by O’Reilly and available for free. Since O’Reilly requires a form of registration to download a link is not provided. Our thanks to the authors who are the leaders in chaos engineering practice in many enterprises. The title is “Chaos Engineering: Building Confidence in System Behavior through Experiments”.

THE PRACTICE OF CHAOS ENGINEERING

To address the weaknesses of distributed systems at scale, Chaos Engineering can be thought of as the creation and running of experiments to uncover systemic weaknesses. The surfaced weaknesses can then be addressed or be noted as limits of the system. The evidence that such weakness has been addressed can be checked by repeating the experiment.

The first step is the measurement of the steady state of the system. The system is known through its outputs. A stable and light touch monitoring system is needed to measure the steady state of the system. Light touch means the act of measuring does not significantly change the behavior of the system. The discovery of the steady state needs the following questions to be answered.

  • What is being measured? System variables like cpu usage, memory consumption or business variables like response time and other application specific metrics. Sometimes metrics cover both aspects.

Given below is a guide to the design of the experiments and set up of a Chaos Automation Platform (ChAP) from a blockchain viewpoint. In our post, we refer to ChAP even if there isn’t much automation in the Chaos Platform.

  • A known weakness should not be the subject of an experiment. If a 1/3 attack subverts consensus (for BFT), turning off a fatal percentage of consensus members has known consequences and no insight can be gained from this sort of experiment. There could be experiments where the numbers stay short of the crucial thresholds.

Conclusion

A look at chaos engineering practice in current large scale distributed systems reveals its promise and power. Adoption in many firms, including in areas like aircraft testing and hospital systems where the practice is performed in production systems shows its usefulness, even in sensitive applications.

Design of experiments in Blockchain frameworks need a combination of specialist knowledge of the framework, exposure to the principles behind ChAP and the constitution of a team working on various levels to create a practice that grows with the platforms and increases the confidence in the specific implementation and hence drives adoption.

We take up the case study of the ChAP for the Indy platform in our next post in this series. This can help us guide our thoughts for a ChAP implementation in specific DLT frameworks.

DLT NYC

All things blockchain, with a focus on New York City and…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store