Part 1. Principles of chaos engineering applied to blockchain frameworks.
The words chaos and engineering do not appear to go together. In this article, we explore why they belong together and further on the application of this engineering discipline to the Blockchain. Part 2 of this will focus on a specific implementation of chaos engineering for Hyperledger Indy.
This is the age of micro-services composing huge distributed systems at scale. Netflix, Linked-In, Medium, Amazon, Microsoft Azure, Uber, AirBnB etc. No one person nor even entire teams of architects and programmers can hope to hold the complex architecture of such distributed systems in their head. Even the static configuration of such a system consists of multiple services running on heterogeneous hardware or the cloud, interconnected by networks with multiple SLAs and user interfaces (UIs) running on numerous edge devices. Combined with this static complexity, the real-time behavior of such a system introduces the overlay of independent inputs from users and processes driving through unreliable networked system components.
These components can crash, degrade, or misbehave. Malicious or incompetent users are everywhere. It is in this age, that chaos engineering developed, first as a crude method of taking the measure of such a system; refined by practice into a philosophy and a well accepted approach with conferences, tooling and wide spread adoption.
You could argue that permissionless public blockchain networks like Bitcoin and Ethereum exist in a chaotic environment. They are already unwittingly being subject to chaos. Nodes join and rejoin the network, malicious attackers continually probe the system, network connections break. There is a difference between this chaos and chaos engineering. Chaos engineering, by surfacing this inherent chaos, is an engineering discipline that uses experimental data to uncover systemic weaknesses.
First we set the scene with some basic history and principles of chaos engineering as well as its application in existing distributed systems. There is an open source repository for chaos engineering called the chaos toolkit. Chaos toolkit is open source and generalizes chaos engineering interactions using an open API for expressing the experiments. The toolkit is extensible using the open API and several drivers are already available for Kubernetes, AWS, Azure etc. It can also be used to automate chaos engineering in continuous integration and builds.
We look at the open source chaos toolkit and see how it is being adopted for these experiments on Hyperledger Indy in our second article in this series. Hopefully, this will inspire people to look closely at their own DLT platform and create a maturing chaos experimentation suite to harden their own platform.
Since 2008, when Netflix started moving their servers out from the data center to the cloud, their engineers have been practicing some form of resiliency testing in production. Only later did their take on it become known as Chaos Engineering. Chaos Monkey started the practice, being known for turning off services in the production environment. Principles of Chaos formalized the discipline. Netflix’s Chaos Automation Platform runs chaos experimentation across their production micro-service architecture 24/7.
For those interested in chaos engineering as a discipline, here is a curated list of resources. There is an excellent backgrounder on chaos engineering published by O’Reilly and available for free. Since O’Reilly requires a form of registration to download a link is not provided. Our thanks to the authors who are the leaders in chaos engineering practice in many enterprises. The title is “Chaos Engineering: Building Confidence in System Behavior through Experiments”.
THE PRACTICE OF CHAOS ENGINEERING
To address the weaknesses of distributed systems at scale, Chaos Engineering can be thought of as the creation and running of experiments to uncover systemic weaknesses. The surfaced weaknesses can then be addressed or be noted as limits of the system. The evidence that such weakness has been addressed can be checked by repeating the experiment.
The first step is the measurement of the steady state of the system. The system is known through its outputs. A stable and light touch monitoring system is needed to measure the steady state of the system. Light touch means the act of measuring does not significantly change the behavior of the system. The discovery of the steady state needs the following questions to be answered.
- What is being measured? System variables like cpu usage, memory consumption or business variables like response time and other application specific metrics. Sometimes metrics cover both aspects.
- Is there a time dependent element to the steady state? Patterns of usage and resource utilization may be different at different times of the day, week or month or over different seasons or times of the year or larger cycles. The steady state is really an unsteady state.
Given below is a guide to the design of the experiments and set up of a Chaos Automation Platform (ChAP) from a blockchain viewpoint. In our post, we refer to ChAP even if there isn’t much automation in the Chaos Platform.
- A known weakness should not be the subject of an experiment. If a 1/3 attack subverts consensus (for BFT), turning off a fatal percentage of consensus members has known consequences and no insight can be gained from this sort of experiment. There could be experiments where the numbers stay short of the crucial thresholds.
- For blockchains, chaos engineering experiments should look at consensus, networking, storage layers and the cross cutting elements of identity, smart contracts, governance, user interaction etc. through a random combination of experiments. When discussing the existing chaos practice on Indy in our second article we can see how the practice is applied.
- When the experiment reveals weaknesses in the underlying framework, harvest as much information as possible to isolate the processes, APIs or a combination of system behaviors and correlate to the experiment that caused the problem. This data will help in making changes to harden the system.
- Chaos engineering is not the same as unit and integration testing. Nor is it the same as doing fault injection and failure testing. A ChAP may use some fault injection tools. Fault injection and failure testing usually target one mode of failure at a time. Chaos engineering aims to surface new knowledge of the system due to random combination of events; including benign or beneficial scenarios like a spike in customer traffic. Chaos engineering practice should be implemented in addition to the usual testing tools and practices.
- Start with experiments on development and test networks, after ensuring the integrity of such networks with fixes to the uncovered problems, we graduate to production. Only in production can the non-linear effects of the chaos experiment be truly observed.
- Get communication and buy-in from the entire team, the devops engineers, the development team, legal, IT security, compliance, business reps. Emphasize that chaos engineering is not an adversarial practice; demonstrate how the experiments harden the system as a whole. The knowledge gained will also feedback into the upper layer of development activities including the architecture, design and engineering implementation. Communication with the business end of the enterprises is also needed.
- Randomize the experiment, both in terms of timing and the experiments themselves. Be aware of the cycles of resource utilization and system responses harvested during the study of steady state, also keep a watch on any special circumstances that apply during the experiment.
- Automate the running of the experiments, including a way to quickly turn off the experiment, especially if you are experimenting in production. Of course this means automated monitoring and some form of feedback between the chaos framework and the monitoring system.
- Minimize the blast radius. The result of the experiment should not be highly disruptive to production. The various steps discussed above should help with this.
- In an advanced experiment, one could divide the system into two parts; a control system which is not perturbed by the experiment and a system under chaos that can also be measured to see the effects of the experiment, if any. This is advanced practice of chaos engineering.
- Scale: in Netflix, with Chaos Monkey, only individual processes or VMs were turned off, they graduated to Chaos Kong which turned off entire data centers or regions. This way they were able to see the effects of failing over entire regions.
- Chaos maturity model; speaks to the various levels of maturity with chaos engineering. The various axes: development systems to production; the variety and sophistication of the experiments; the level of automation in chaos engineering; the scale of experimentation the familiarity of the teams with the practice are a continuum. There are some rough and ready names for the maturity model depending on where the teams are in the journey. Elementary, simple, sophisticated, advanced etc. A taxonomy for determining this is available in the book cited before.
- Blockchain frameworks are most effective in a multi-enterprise environment in the case of federated or permissioned blockchains. In public blockchains, the environment is not controlled by single types of entities. Specific to blockchain is creating, communicating and executing a ChAP in a multi-stakeholder, multi-enterprise environment. The benefits of using a ChAP should be made very clear. If the ChAP is instituted right in the beginning stages of development this should not pose a very big challenge as the developers, the business users and operations folk have low expectations of the stability of the platform. The ChAP should then be allowed to grow along with the rest of the DLT framework and can become a natural part of the ecosystem. An agreement on ChAP practices should be part of the initial agreement and governance discussion between parties in a permissioned setting.
- For public blockchains, a buy in from the developer community as well as clear communication with the other participants are necessary for adoption; a path from well established testnets to production systems for ChAP deployment are needed. This may not be as easy as the stakeholders and the governance aspects of public blockchains are still emerging and developing. Existential crises, like the DAO incident in the case of Ethereum or the scaling debate in bitcoin expose the vulnerability of the systems and bring forth solutions that are ad-hoc. A good ChAP and progress along the chaos maturity model might have exposed these vulnerabilities and a search for solutions to have started earlier. There are scores of other vulnerabilities in the core and edge systems that could have been targeted by a well-developed ChAP.
- A federated testnet is necessary in enterprise blockchains, for the ChAP to be ramped up into production. This is true of most enterprise blockchains.
- The knowledge of specific architectures should drive the ChAP engineering practice. For example, in Hyperledger Fabric, the endorsement policies guide the formation of consensus, so removing endorsers until the minimal number of endorsers are present for the endorsement policy can reveal weaknesses in a specific implementation. In Corda, taking out a percentage of the notary network, introducing latency in parts of the network traffic, interfering with the Corda Firewall etc. could reveal weaknesses in a specific deployment.
A look at chaos engineering practice in current large scale distributed systems reveals its promise and power. Adoption in many firms, including in areas like aircraft testing and hospital systems where the practice is performed in production systems shows its usefulness, even in sensitive applications.
Design of experiments in Blockchain frameworks need a combination of specialist knowledge of the framework, exposure to the principles behind ChAP and the constitution of a team working on various levels to create a practice that grows with the platforms and increases the confidence in the specific implementation and hence drives adoption.
We take up the case study of the ChAP for the Indy platform in our next post in this series. This can help us guide our thoughts for a ChAP implementation in specific DLT frameworks.