System Failure: How Complexity and Convexity Make the World Fragile

Published in

DataSeries

18 min readSep 8, 2019

Staring at the proverbial blue screen, rebooting one’s machine after a buggy system crash, interpreting inscrutable computer error messages, putting a freeze on your credit report following another data breach, surviving information technology (IT) projects that are knotty, stressful, delayed, and over budget, and emailing IT technical support have all become so banal and routine that we accept them as facts of Life in the Digital Age. Beyond these everyday experiences, there have been critical computer system bugs and defects that have resulted in the loss of human life such as the 346 people who died on-board the Boeing 737 Max 8 flights in Indonesia and Ethiopia during 2018–2019, the 290 people who perished on Iran Air Flight 655 when it was mistakenly shot down as an enemy combatant by the USS Vincennes in 1988, the 28 US soldiers who were killed in 1991 by Iraqi Scud missiles that penetrated through an errant Patriot missile defense system, and the 6 patients overdosed by the Therac-25 radiation machine during 1985–1987. There have also been computer system failures that have affected thousands of people such as the Arpanet collapse of 1980, the AT&T network glitch of 1991, the Black Monday Wall Street market crash of 1987, the Y2K flaw, the North American electric grid blackout of 2003, the failed launch of healthcare.gov in 2013, the 9–1–1 US emergency telephone service outage of 2014, and the biblical flood of digital data leaks at many major companies over the last two decades. Thanks to the general purpose nature of boolean logic and binary arithmetic represented in silicon integrated circuits that can be composed to calculate, store and communicate, computer systems have become woven into the fabric of Life; they help manage human activities and assets in a remarkable array of fields including commerce, education, entertainment, government, healthcare, infrastructure, military, science, and transportation. However as computers expand their reach and human reliance upon them grows, our modern economy and society bears substantial costs and serious risks for these computer systems defects and IT project problems. Global IT spending has been estimated by Gartner and IDC around $3.65-$4.8 trillion annually in 2018, and the cost of system failures has been appraised by Tricentis and CISQ to be more than $1.7 trillion in 2018. Peter Neumann wrote that “hindsight can be valuable when it leads to new foresight”, and Mark Twain is attributed to have said that “history may not repeat itself, but it does rhyme.’’ With these inspirations and computing’s impact in mind, I have been collecting computer systems failure stories, researching insights from different fields, and building a comprehensive model for understanding, preventing, and mitigating computer system failure. The strategic areas of this failure model are familiar for those already in IT: Technology, Organization, and Process. The Technology domain is composed of Complexity, Quality, Security, and DevOps. The Process area is comprised of Scope, Flow, and Communications. The Organization realm consists of Culture,Governance, and Resources. Some of these parts affect others within their area and also connect across the broader area boundaries. We will explore each element of the model, illustrate them with representative stories, and describe solutions to prevent and recover from computer system and IT project failure. The primary audience of this book is the current and next generation of computer systems professionals, and my hope is that the text helps them avoid the problems that have been encountered in the past and renews their ethical and moral imperative to deliver better, safer, more secure systems in the future because our lives depend on it. The secondary audience of this work is the general reader who may be interested in computer systems and how they impact their lives and the world around them.

Computer Systems Failure Model (Source: Bishr Tabbaa© 2019)

The analysis of the computer systems failure model naturally starts with Technology itself. Complexity is one of the critical technology elements in the model that fundamentally differentiates computer systems and projects from other human artifacts. While the layperson has an intuitive sense of what complexity means, scientists in biology, computer science, economics, and physics are still wrestling to refine and precisely employ this subtle concept, so I want to introduce some important definitions at the outset. Simple comes from the Latin root simplex, and it means easy to know and understand. Simple objects are atomic or readily reducible to just a few elementary parts; simple relationships connote direct, linear, sequential connections between objects. Complex comes from two Latin roots com- and plect- which denote “with” and “to braid, intertwine, or weave”, and in our computational context, it means unknownable, unpredictable, intractable, or irreducible due to its many components with dynamic, hidden, non-linear interactions, distributed feedback loops, sensitivity to initial conditions, openness to the environment, and uncertain, emergent behavior. Christian von Ehrenfels summarized the nature of complex systems when he wrote that “the whole is more than the sum of its parts”. Between the simple and complex, there lies the complicated which refers to parts, units, and systems that are not simple, but remain knowable, linear, exhaustively describable, relatively bounded, and somewhat predictable, understandable and manageable through best practices, checklists, design and implementation heuristics, maintenance intervals, reference manuals, visual diagrams, detailed plans, and institutionalized human expertise perhaps assisted by computers themselves. For a progressive computing analogy using these terms, consider that my keyboard and mouse are simple, my personal computer is complicated, while Internet security is complex. Examples of complex systems in other fields abound and include ant colonies, financial markets, the human brain, some chemical compounds, genetic regulatory networks, and the world’s climate. A major thesis of this text is that the complexity of computer hardware and software systems has exceeded our current understanding of how these systems work and fail, and furthermore, these systems are approaching the complexity of biological systems based on their cardinality and their networked hierarchy due to the widespread connectivity of the Internet and World Wide Web. The physicist Seth Lloyd proposed three criteria to measure the complexity of an object or process: what is its degree of organization, how difficult is it to describe, and how difficult is it to create, and he enumerated more than forty different metrics. For the purposes of our computer systems failure model, we shall focus on metrics of cardinality and networked hierarchy because they align with complexity, are measurable, and can be associated with computer systems failures.

Let us first examine system complexity based on cardinality, or size. The original computer hardware metric is transistor count on an integrated circuit (IC). For the last 50 years, transistor count has increased generally following Gordon Moore’s Law, which states that the transistor count doubles approximately every two years. IC’s are used in microprocessors, field programmable gate arrays (FPGA), flash memory chips, and graphics processing units (GPU). Since Intel introduced the microprocessor in 1971, transistor counts have exponentially risen from about 2,300 in the Intel 4004 model to eight billion in the 72-core Intel Xeon Phi model released in 2018, a remarkable increase of 4 million times as the gate length concomitantly dropped from 10,000 nm in the 4004 model to just 14 nm in the Xeon Phi. Due to the higher transistor density and smaller gate length, computer scientists and engineers are reaching the physical limits of silicon atoms (around 0.2nm in size) and are now weathering quantum side-effects of light during the miniaturized lithographic process, as well as the increasing heat generated from the denser circuitry. The sister metric for software size is Lines of Code (LOC), whereby each line represents roughly a computer instruction analogous to a deoxyribonucleic acid (DNA) base pair. For example, Google’s cloud platform has about 2 billion lines of code comprised of 8 million unique files; by comparison the human genome has about 3.3 billion base pairs (bp), a field mouse genome has about 120 million bp, the bakers yeast genome has 12 million bp, and a prokaryotic syphilis bacterium genome has 1 million bp. For a broader perspective, I have shared below a tabular excerpt of the Lines of Code data set sourced from Visual Capitalist. Although LOC is a reliable indicator of program size, does correlate with the number of system defects, and there are some studies showing that high quality code has low defects/KLOC, the LOC metric does not entirely capture complexity which includes individual component states and their relationships nor has it been shown to correlate with important system properties of interest such as performance, reliability, safety, and security. LOC also does not take into account programming language abstraction meaning a low-level line of x86, MIPS, or SPARC assembly language code is fundamentally different and more limited in the logical work it can accomplish as compared to a higher-level line of C, Java, or Python.

Next, we shall extend the concept of system cardinality from its external dimensions to its internal state space. In 1948, Claude Shannon proposed encoding and transmitting information on communication channels based on digital relays and flip-flop circuits using binary digits and arithmetic; for messages with N possible states, Shannon showed that channels using binary encodings could represent these messages using log2N bits. Algorithmic complexity was later defined by Chaitin, Kolmogrov, and Solomonoff in the 1960’s as the shortest computer program that could produce a specified output, and this concept was further applied using big ꭥ notation to the space and time needed to compute an output for a given input of size N. In 1976, Thomas McCabe Sr introduced the idea of cyclomatic complexity (CC) which says that for a logical system modeled as a connected control flow graph G, it should have a CC = E — N + 2P where P is the number of disconnected parts of G, E is the number of edges (control flow transfers), and N is the number of nodes (sequential statement blocks). CC translates into the linearly independent control paths in a software program, roughly the number of logical decisions + 1; binary decisions such as “if” and “while” statements add one to the CC metric; McCabe recommended an empirical maximum CC of 15 for a function, method, module, or component; when this CC level was exceeded, a split was advised. Carnegie Mellon University’s Software Engineering Institute has suggested a set of heuristics that a component with a CC between 1 and 10 was simple and easy, values between 10 and 20 indicate more complicated code which may still be comprehensible, values above 20 are typical of code that can only be grasped with much effort and should be refactored, and components with values higher than 50 are too complicated, unmaintainable and should be rewritten entirely. In 1977, Maurice Howard Halstead articulated a comprehensive set of software metrics in an effort to make software engineering more like a science and less an art. For given a program P with n1 distinct operators, n2 distinct operands, N1 total operators, and N2 total operands, Halstead proposed several measures. Program P had a Length N = N1 + N2, a Volume V = N log (n1+n2), an Abstraction level L = (2* n2) / (n1 * n2), and overall Effort E = (n1+N2 * (N1+N2) * log (n1+n2)) / (2 * n2). One can then apply these ideas together to a limited Turing Machine model known as a Finite State Machines (FSM) representing the internal logical behavior of a component with an alphabet of A symbols, S states, and T transitions; this suggests a complexity metric of T * (log(A) + log(S)). While Halstead’s metrics were impractical to compute for large systems and did not take root in industry and practice, the other metrics continue to be used. The common intuition behind these algorithmic component metrics is that most computations should be short, simple, and sequential enough to reliably understand, document, code, test, and troubleshoot. Computer programmers still use the ideas of Chaitin, Kolmogrov, and Solomonoff to commonly evaluate the performance tradeoffs of different algorithms. Furthermore, a sound white box testing strategy can be derived from the McCabe CC metric such that each function or method has an equivalent number of test cases (preferably automated) to its cyclomatic complexity size. While some studies have shown a positive correlation between the CC metric and the number of code defects, the research has not been conclusive. This has not stopped international safety standards such as ISO 26262 and IEC 62304 however, from mandating that software have low cyclomatic complexity.

McCabe’s usage of control flow graphs and our earlier comparisons to biology leads us to the last aspect of Complexity as it relates to computer systems and that is, networked hierarchy. Again, we draw inspiration from biology and briefly reflect on its strata that consists of DNA, proteins, organelles, cells, tissues, organs, organisms, and ecosystems; individuals, families, and communities of organisms also compete and collaborate with each other for resources within ecosystems. The biological analogy also shows the limits of linear concepts; although the human genome has 3 billion bp, only 1% of the DNA is encoded into roughly 20,000 genes and expressed into proteins, and yet we bear witness everyday to the remarkable diversity of humankind so it is not just the base pair instructions as linear sequences, but their non-linear relationships as genes in epiphenomenal regulatory networks as well as non-coding base pairs that give rise to the stunning complexity of biological systems. Similarly, computer systems consist of a progressive networked hierarchy of silicon-integrated transistor circuits, computers that host operating systems containers, and virtual machines all executing software applications written in different programming languages with dependencies crossing process, machine, and network boundaries, telecommunication channels made of cable and wireless links that transmit information between devices across your home, the office, and our world, digital networks of switches and routers that move packets of information across those physical channels, databases, magnetic hard disks, network storage arrays, and backup tape storing persistent information and shared state across those hardware and software stacks, cloud infrastructure providers bundling many of these aforementioned compute, network, storage, and application services, miniaturized digital devices such as IoT actuators and sensors embedded into the physical environment integrating all these systems, smartphones, robots, and much more. So both biological and computational ecosystems are multilayer structures comprised of interacting components that have dynamic relationships pulsing within their layers and reverberating across them. Given this diversity in the computer network hierarchy, one might well ask what are reasonable metrics to measure it. Some have proposed trivial metrics such as the size, height or volume of the graph. Others have suggested variations on connectivity, which is the maximum number of edges that can be removed before a graph is split into two non-connected subgraphs. An interesting approach refining connectivity called the global reaching centrality (GRC) computes the difference between the maximum and average value of the generalized reach centralities over the network whereby a local reaching centrality of a node I in graph G is the proportion of all nodes in the graph that can be reached from I via outgoing edges. For example, a star graph has a GRC=1 since all nodes are connected through a central node, a randomized tree with random branching has GRC’s that follow a power law and are approximately 1, and real networks such as brain neurons, food webs, electric grids, metabolic pathways, organizational trust, human language, as well as the Internet have been measured with varying GRC’s between 0.1 and 0.9 depending on the specific network and average degree of the vertices; see table with details below sourced from the fascinating paper by Mones, Vicsek, et al. Although measuring network complexity remains an active area of research, efforts to quantify the concepts of node degree and dependence are confirming the fundamental hypothesis of network and complex systems researchers across multiple disciplines that relationship transitivity matters more than often credited in the traditional Newtonian-Cartesian ethic rooted in linear cause-and-effect, decomposability, reductionism, foreseeability of harm, time reversibility, and an obsession with finding broken parts and blaming people that still dominates mainstream intellectual theory and practice in accident investigations, the law, and systems engineering.

Global Reach Centrality Statistics for Real World Networks (Source: Mones, Vicsek, et al)

Some component and system requirements such as correctness, cost, compliance, performance, reliability, and usability do range between simple and complicated, and they can be managed in a relatively straightforward approach. We shall investigate technologies and processes to improve system Quality for these requirements in another chapter. However, other important system properties we care about such as reliability, safety, and security are emergent, macroscopic, and transcendentally transitive; they require a thoroughly different language and mental model than the Newtonian-Cartesian ethic and ultimately more systematic solutions that go up-and-out instead of just down-and-in. For example, the Arpanet collapse on October 27, 1980 occurred due to several contributing factors: a hardware defect dropped bits in memory, the software had error detecting codes for transmission but not storage, a separate software flaw that garbage collected messages but was poisoned by the simultaneous existence of identical messages, and finally the sheer growth of the network size itself due to its initial success later contributed to its impact upon failure. Similarly, in the case of the 9–1–1 service outage on April 9, 2014 in the Pacific Northwest of the US, there were multiple causes that contributed to the 6 hour service failure: a numeric counter overflow in Intrado’s remote data center in Colorado, the delay in Intrado identifying the problem and transferring emergency phone services to its redundant facility in Florida, the poor crisis coordination between public and private organizations, and the absence of any local, municipal backup systems due to government cost pressures, negligence, and over-emphasis on centralization. In the case of the North American electricity grid blackout on August 14, 2003, there were also a multitude of reasons for the network failure: a software race condition in GE’s XA/21 energy monitoring system used by First Energy (FE), inadequate FE corporate policies, technical maintenance, and situational awareness of its MidWest grid sections related to tree pruning, voltage reserves, and system monitoring, insufficient real-time alert, diagnostic, and coordination systems shared between public and private organizations, and inadequate regulatory oversight and inspections due to government cost pressures and other priorities. The common threads of these stories about computer systems failure and others we will explore in the text are multiple causes, technology and human aspects, and an intrinsic complexity reflecting the modern, globalized, technocratic, bureaucratic society that we live in. Like entropy in thermodynamics, technological complexity has grown as computer systems are used to solve more problems of larger Scope and as they become more interconnected through intentional system integration as well as unforeseen dependencies of distributed systems. Peter Deutsch, a senior distinguished engineer at Sun Microsystems, articulated several fallacies of distributed systems programming common to computer professionals: 1) The network is reliable, 2) Latency is zero, 3) Bandwidth is infinite, 4) The network is secure, 5) Topology does not change, 6) There is one administrator, and 7) The network is homogeneous. In practice, none of these fallacies are ever true, and yet too often as we shall see throughout the text, they rear their head like the hydra of Greek mythology.

Lets step back further from the purely technological aspects of computer system complexity and consider how it is affected by the human dimensions reflected through economics, politics, psychology, and sociology. These computer hardware and software systems are designed, constructed, deployed, maintained, and then used by different people in different places over different periods of time, regulated by different legal jurisdictions, and under the pressure of important environmental influences (e.g. competition, resource scarcity, production pressures, cultural values). Sociologists, psychologists, and human factors researchers have been especially interested in these ingredients in the later half of the 20th century and how they affect the recipes of system success and failure. Melvin Conway presciently stated in 1967 that organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations. Charles Perrow coined the phrase normal accidents in 1984 to describe a system failure that was not simply the cause-and-effect result of an imperfect part, process, or person than can be remedied in an isolated, linear, reductionist manner, but a system failure unfolding from the coupled interfaces, hidden connections, and non-linear feedback loops between components situated in a complex system with high risk, catastrophic potential. Even after the Challenger space shuttle accident in 1986, NASA projects of the 1990’s were heavily influenced by its own motto of “faster, better, cheaper” that spilled into mainstream business jargon and came to epitomize the intractable conflicts of system requirements. Furthermore, the engineering mantra of Silicon Valley to “move fast and break things” as well as the oft-repeated management maxim of “doing more with less” point to the intense production pressure across different technology industries and the constant tension of cost, speed, and efficiency pitted against quality, safety, and security. Now weigh the high turnover rates of technology workers estimated over 10% and then combine it with their median tenure of just two years, this implies that the impact of organizational decisions to stakeholders is longer than their individual employee tenure which also suggests that individuals rarely stay around long enough to learn from and resolve their mistakes. We must also note that software manufacturers are generally exempt from lemon laws in North America, Europe, and Asia that are common place for physical device makers; users especially of computer software have a license filled with a litany of abstruse disclaimers to use, but not to own the system. Furthermore, in the hyper-competitive, relatively unregulated societies of the USA and Asia, there is a remarkably dominant paradigm that emphasizes economic growth, materialism, and market forces, treats the natural environment as a resource and not as something to be intrinsically valued, favors risk-and-reward relativity over absolute safety, and socializes catastrophic risks such as environmental disasters and financial crises as externalities that private corporations and individuals can absolve themselves responsibility for and delegate to the general public and taxpayer. Sidney Dekker formulated the metaphor of “drift into failure” to describe how normal people doing normal work in normal organizations, are then confronted with a Bermuda triangle of unruly technology, production pressures, and conflicting goals, go onto normalize deviance from important safety norms based on empirical success (slippery slope), and the decrementalism continues whilst the growing risks incubate until there is a major catastrophic accident. Dekker wrote that “the bright side brews the dark side, given enough time, enough uncertainty, enough pressure”, explains how system complexity supports the existence of non-linear phase shifts and tipping points, and concludes that “the growth of complexity in society has got ahead of our understanding of how complex systems work and fail.”

For want of a nail, the shoe was lost.
For want of a shoe, the horse was lost.
For want of a horse, the rider was lost.
For want of a rider, the battle was lost.
For want of a battle, the kingdom was lost.
All for the loss of a horseshoe nail.

Rural English saying collected by English poet, George Herbert (1640)

Managing system complexity is less about fixing the broken gears of a machine and more about introducing plasticity, resilience and robustness throughout the lattice of relationships of an amphibious high rise structure balanced upon pontoons. We need our systems to bend and not break, and Nassim Taleb’s philosophy of antifragility is a good framework for integrating a comprehensive mix of technology, organization, and process elements that can assist with surviving and thriving under complexity. Taleb defines antifragility as a convex response to a stressor or source of harm, leading to a positive sensitivity to increase in volatility (or variability, stress, dispersion of outcomes, or uncertainty). From a Technology dimension, the primary antifragile principles, practices, and parts are abstraction, decoupling, de-centralization, dependence mapping, end-to-end testing, error-handling, fault-tolerance, formal verification, hazard analysis (e.g. checklists, fault trees, FMEA, state machines) intuitive user experience, layering, microservices, randomized fault injection in production (e.g. Chaos Monkey),self-healing (e.g. Kubernetes, OpenShift, AWS Elastic Beanstalk), separation of privileges, virtualization, and paradoxically, simplicity when appropriate. From the Process facet, project and product activities must flow in an Agile manner (not waterfall) and thus be open to change. Smaller scopes should be embraced earlier to allow teams time for trial-and-error and proofs-of-concept to reduce risk. Test-driven engineering and peer review should be adopted during design and construction to improve quality. Communications ought to include diverse stakeholders’ views and be shared across the organization. Scrums, sprint retrospectives, and milestone post-mortems should be regularly scheduled and meeting minutes published to enable learning throughout the system life cycle. Finally, from the Organizational aspect, good governance means assigning clear system and project accountability, combines risk and reward metrics to evaluate initiatives across the portfolio, elicits executive support, and engages stakeholders. From a resource standpoint, organizations must manage talent, schedules, and budgets carefully with an eye on the aforementioned governance metrics. Finally, organizations need to foster a healthier, more heterogeneous culture that makes space for humility, paranoia, pessimism, and vigilance to balance the complacency, optimism, overconfidence, and technophilia that underpins the hegemonic mental model for many in engineering, business, government, and society. Nancy Leveson, who coined the phrase Safeware to describe a multifactorial approach to system safety engineering, wrote that “industrialisation has substituted man-made hazards for those rooted in nature” since the 19th century with increasing risks from new hazards (e.g. aviation, chemistry, biotechnology, energy, environment, medicine, military, pesticides, space, etc), more exposure of people to technology risk as they concentrate in urban areas, more energy consumed in production, more automation, faster technology change, and lastly, more interdependent networks of system dependencies. Complexity is at the heart of this Gordian knot that has been tied as mankind has reached beyond his grasp, however it can be managed and gently unfolded if we are open-minded to a comprehensive set of strategies and tactics.

Enjoy the article? Follow me on Medium and Twitter for more updates.

References:

System Failure: How Complexity and Convexity Make the World Fragile

Written by Bishr Tabbaa