This is an excerpt from the Stella Report, which is the result of the first project “Coping With Complexity” performed by the Resilience Engineering In Business-Critical Software Consortium (“SNAFUcatchers”). A video introduction to the report and the consortium is here.
4.5 “Strange loop” dependencies
As systems become more complex, strange loops emerge, where some part that provides a function, also depends on the function it provides (Hofstadter 2007, p. 101). This can remain unproblematic when systems function normally. Strange loops produce difficulties when surprises occur and anomalies arise. Managing, monitoring, modifying digital services depend on digital services in the same network (cardiovascular system and nuclear power plants also contain strange loops). All three cases (Section 3.1–3.3) were complicated by strange loop dependencies.
For example, assembling an understanding the Logstash SNAFU (3.3) depended on examining logs created by rsyslog - but rsyslog was unable to stream entries into the remote log because Logstash was processing new messages so slowly that the kernel TCP/IP queue was nearly always full. Issuing new console commands that would normally have produced meaningful log entries instead slowed the system even further as those commands generate traffic via snoopy directed towards rsyslog.
There are strange loop dependencies that contributed to the Travis CI SNAFU (3.2) and the Apache SNAFU (3.3) as well.
Strange loop phenomena are common in modern computing with its elaborate tool chains and complex dependencies. Sometimes a strange loop complication can be anticipated. An example from another source:
A site was constantly being revised and corrected according to the continuous deployment paradigm and using an automated process that had become reliable over several years of improvement. The site was changed as often as twenty times per day using this automation. Routine maintenance on the automation created a fault pathway which, if activated, would keep the deploy automation from upgrading the site. Because the viability of the site was assured by the constant attention it received and the capacity to immediately deploy a (corrective) change, this was regarded as a higher order emergency than a site outage.
The realization that the organization was so dependent on its deployment automation came as a shock. Because the firm had invested so heavily in deployment automation, monitoring, and the ability to quickly correct faults, site failures had become ‘ordinary’ events, in contrast to a failure of the deployment automation which now took on an existential character.
Discussion around the topic of strange loops at the Stella workshop was lively amongst the members. All three cases presented had complications from or central effects of strange loops dependencies. There is also a troubling association between the strange loop quality of the anomalies and failures of automation. What is clear is:
(a) the complexity of business-critical software means strange loops are present; and
(b) strange loop dependencies make anomalies difficult to resolve.
What is not clear is how to manage the risks posed by strange loop dependencies in business-critical software.