Learning Through Failure
Questo articolo è disponibile anche in italiano
Errare humanum est, perseverare autem diabolicum, “to err is human, but to persist with the mistake is devilish,” is a famous Latin quote that makes the argument that human error inspires self-improvement and the avoidance of future errors. It’s an ancient truism but one that the IT industry has only recently started taking into consideration.
Anyone who has ever worked with technology knows that errors and outages are very common. They are practically impossible to avoid and can sometimes cause considerable damage. Consider what happens when an application crashes and hundreds of employees can no longer do their job, or when data is accidentally lost or corrupted. Even the smallest interruption of website function can result in the loss of citizen trust, delaying the adoption of new digital tools and resulting in continued inefficiency.
When digital processes are adopted and become interconnected — for instance, when all the municipal front offices in Italy start using a shared infrastructure provided by the central government — significant efficiencies are achieved but, as a consequence, the demand for higher service availability will increase.
Thanks to studies of large-scale IT operations conducted by companies like Google and Facebook, and the adoption of manufacturing methodologies (1), it has become possible in recent years to develop methodologies that effectively address these issues.
According to these analyses, outages should not be thought of as the result of isolated actions committed by individuals or by the individual components of a system. Rather, outages happen when certain protections are lacking in the event of unforeseen circumstances. This paradigm shift gives us a different perspective from which to analyze incidents, allowing us to implement methodologies capable of solving problems quickly and reducing the frequency of their occurrence by using postmortem documents to elaborate on their potential causes.
The practice of postmortem (2) is well-known in technology. It is a document in which the events, impact, and actions taken to solve the problem, etc. are recorded,
a sort of black box that allows us to understand what went wrong. Most importantly, it allows us to determine the causes of an outage and to draw lessons from it.
The primary objectives of a postmortem document are to ensure that the incident is well-documented, that its main causes are understood, and that effective countermeasures have been studied and implemented to reduce the likelihood and impact of its recurrence. It is also recommended that this document be produced within 48 hours of the incident occurrence so as to avoid the loss of precious details and information.
It is essential that the analysis and drafting of the postmortem document are centered around the causes of an accident rather than on accusations directed at people or work groups for having acted incorrectly or inappropriately (blameless postmortem).
We believe that the introduction of DevOps (3) culture to the Public Administration and its suppliers is essential (as already described in this post). Adopting the use of postmortem documents is particularly important, so that many of the most widespread problems involving the management of PA services can be identified and shared. We want to make our small contribution towards the diffusion of this practice by sharing the postmortem that we produced following the interruption of Cloud SPC Lotto 1 services, an incident that caused disruptions to the Digital Team and to thirty other public administrations for, in some cases, over forty consecutive hours. Despite how uncommon it is to issue detailed postmortem documents to the PA, the supplier of the SPC Cloud contract was able to provide us with this document, even if took 19 days post-incident to do so. It is our hope that postmortem documents will be published within a few days after the most serious incidents involving public services. This practice should always be included in the tender requirements.
The postmortem document that we are presenting provides an account of the service disruption from the point of view of SPC Cloud users and only concerns the impact of the disruption on the Digital Team’s websites.
The success of the digital transformation of the PA is closely linked to people. Therefore, in addition to redefining the processes involved, it is necessary to foster a change in culture by introducing practices that improve the quality of the work environment and, most importantly, the quality of services.
Impact: the following services cannot be reached:
Duration: 28 hours
Cause: OpenStack network outage – cloud provider “Cloud SPC Lotto 1”
The Digital Team’s websites are based mainly on static HTML generated by the source content of the repositories on GitHub. The HTML code is published via a web server (nginx) and exposed according to HTTPS protocol. Forum Italia (http://forum.italia.it) is the only exception to this deployment model, and is managed separately via Docker containers. At any given time, one or more web servers can be deployed on the cloud provider’s (Cloud SPC Lotto 1) OpenStack virtual machines, using the API provided by the platform.
On 19/05/2018, the following services became unreachable due to an internal connectivity issue of the Cloud Service Provider “Cloud SPC”:
Causes and Contributing Factors
According to a postmortem document released by the supplier on 07/06/2018, the interruption of connectivity experienced by the 31 users (tenants) of the SPC Cloud service was triggered by a planned update of the OpenStack platform carried out on the night of Thursday 17/05/2018. The problem was detected the following morning (18/05/2018), thanks to reports from users who were no longer able to access the services provided on the Cloud SPC platform.
The document states that a restart of the control nodes of the OpenStack platform (nodes that handle OpenStack’s management services: neutron, glance, cinder, etc.) caused “an anomaly” in the network infrastructure, blocking the traffic on several computing nodes (nodes where virtual instances are executed), and causing virtual machines belonging to 31 users to become unreachable. The postmortem document also explains how a bug in the playbook (update script) would have blocked network activities by modifying the permissions of the file “/var/run/neutron/lock/neutron-iptables,” as indicated in the platform’s official documentation.
The unavailability of the Cloud SPC infrastructure was undoubtedly the root cause of the problem, but the lack of an application-level protection mechanism for the Digital Team’s services prolonged their unavailability. Indeed, due to the fact that the possibility of the entire cloud provider becoming unreachable had not been taken into account during the design phase of the services, it was not possible to respond adequately to this event. Despite the SPC Cloud provider’s failover mechanisms, the web services were not protected from generalized outages capable of undermining the entire infrastructure of the only Cloud provider at our disposal.
The Cloud SPC platform cannot currently distribute virtual machines through data centers or different regions (OpenStack region). It would have been useful to be able to distribute virtual resources through independent infrastructures, even infrastructures provided by the same supplier.
In hindsight, the Public Administration should have access to multiple cloud providers, so as to ensure the resilience of its services even when the main cloud provider is interrupted.
The most important lesson we learned from this experience is the need to continue investing in the development of a cross-platform, multi-supplier Cloud model. This model would guarantee the reliability of Public Administration services even when the main cloud provider becomes affected by problems that make it unreachable for a long period of time.
22.30 CEST: The SPC MaaS alert service sends alerts through email indicating that several nodes can no longer be reached. <START of programmed activities>
6:50 CEST: The aforementioned services, available at the IP address 22.214.171.124, can no longer be reached <START of INTERRUPTION>
08:00 CEST: The problem is detected and reported to the supplier
09:30 CEST: The machines are determined to be accessible through OpenStack’s administration interface (API and GUI) and internal connectivity reveals no issue. Virtual machines can communicate through the tenant’s private network, but do not connect to the Internet.
15:56 CEST: The Digital Team sends the supplier and CONSIP a help request via email
18:00 CEST: The supplier communicates that they have identified the problem, which turns out to be the same problem experienced by the DAF project, and commence work on a manual workaround
19:00 CEST: The supplier informs us that a fix has been produced and that it will be applied to the virtual machines belonging to the 31 public administrations (tenants) involved.
11:10 CEST: The supplier restores connectivity to the VMs of the AgID tenant
11:30 CEST: The Digital Team reboots the web services and the sites are again reachable <END OF INTERRUPTION>
(1) Manufacturing methodologies, particularly lean production derived from the Toyota manufacturing system and adapted to IT (Lean IT).
(2) This document reports the results of a Root Cause Analysis, a process that helps to identify the causes of a particular incident, how it happened, and why so that preventative measures can be taken. Google’s SRE culture has been particularly involved in developing this methodology as it applies to IT services.
(3) For further information, see: L. Fong-Jones, N. R. Murphy; B. Beyer, How SRE relates to DevOps, O’Reilly Media, Inc., 2018; N. Forsgren, J. Humble, G. Kim, Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations, 2018; N. R. Murphy, J. Petoff, C. Jones, B. Beyer, Site Reliability Engineering, O’Reilly Media, Inc., 2016