Improving Incident Learning Part 3

Published in

SEEK blog

11 min readApr 8, 2020

This is part 3 of a companion piece to a talk that was presented throughout 2019 and 2020: Learning from Incidents at SEEK

In our previous post we left pondering a few questions:

What is the origin story behind the 5 Whys and Root Cause Analysis? What was so compelling about using them in incident post-mortems and reviews?
Why couldn’t we see incidents as an unplanned investment in building more resilient software, rather than something to be avoided and controlled?

In this blog post we will explore these questions further, by proposing a theory on how the origins of the management of work (which ensures that what needs to be done gets done), then influences our approach to managing incidents.

Management of work

People working in organisations today are used to multiple levels of management. It’s how modern organisations manage 100’s if not 1000’s of people to ensure they work together to produce the products and services their customers need. There are many different flavours of management, ranging from draconian command and control models, through to modern highly decentralised practices such as holocractic management

Hierarchical management structures as we know them today originated around 150 years ago. Yet despite the long passage of time, their influence on management thinking, borne in the era of coal and steam power, can still be found in many workplaces.

Origins of management

As the Second Industrial Revolution was in full flight by the turn of the 20th century, workforces within companies were scaling-up rapidly as new advances in technology accelerated innovation on an industrial scale. Inventions such as the telegraph and the mass expansion of railway networks created compelling opportunities for organisations to capitalise on the wealth they could create.

Between 1869 and 1910, the value of American manufacturing rose from $3 billion to $13 billion. The steel industry produced just 68,000 tons in 1870, but 4.2 million tons in 1890. The central vehicle of this surge in economic productivity was the modern corporation.

Managing these rapidly growing workforces became a significant challenge.

These challenges led to many new ideas and practices for the effective and efficient management of labour. This period has become known as the Efficiency Movement.

Frederick Winslow Taylor is widely considered to be one of the most prominent thinkers of this time and his famous publication: The Principles of Scientific Management (1911), was widely adopted by companies who desired to extract maximum performance out of their workers and establish management structures that could scale with growth.

Frederick Winslow Taylor — One of the most influential management consultants of the 20th century

Known generally as Taylorism, the methods and teachings within his publication became highly influential on 20th century organisations. Most notably due to the success it brought companies by employing his methods.

While much has been written about Taylors scientific management theories and practices, such as the usage of time and motion studies, there is just as much (if not more), written about the negative impacts it had. Particularly the poor psychological health impacts it had on workers; tendencies towards eugenics for the selection of management personnel; and the drawbacks of command and control leadership structures on stifling agility and innovation. Any further discussion on this however, is well beyond the scope of this blog!

So what does Taylorism have to do with incident management?

To understand this we need to understand an important philosophy behind Taylorism:

The relationship between work-as-imagined and work-as-done, must be kept equal, at all times.

To put it another way, a desirable success measure of any organisation is to consistently and efficiently produce the outcomes it requires, using processes enforced by management, that demonstrate traceability from the corner office down to the factory floor.

We’ll refer back to this equation frequently for the remainder of this post.

Under this equation, an incident is viewed as a breakdown in the process, a potential failing of management to stay in control of the factory floor, meet production target commitments, or worse, put the company at risk of legal proceedings if the aftermath is bad enough.

Cynically, it isn’t too difficult to imagine many incidents over the decades were unfairly apportioned to “human error”. For one thing it exonerates managers by placing the blame on the worker and their inability to follow prescriptive processes!

While much has changed in workplaces since the beginning of the 20th century, these influences still exist. When I entered the workforce after University in 1999 as a software engineer, these were the conditions I worked under:

My performance was subjected to the lines of code that I was able to produce per day, correlated to the time spent on multiple tasks, tracked using numerous timesheet codes i.e. time in motion studies
Work was planned and forecasted using Gantt Charts — these originated in the 1910’s by a former colleague of Frederick Taylor, Henry Gantt
Incident analysis strongly advocated for 5 Whys of Investigation and Root Cause Analysis techniques — we were instructed to identify the cardinal relationship between cause and effect of an incident as 1:1
Software was developed using the Waterfall Model of delivery which is, in all respects, a modern interpretation of an early 20th century assembly line.

An example of waterfall project stages used in software delivery. Progression to each stage usually requiring extensive documentation, sign-offs and lengthy approval processes

In the 20 years since I entered the workforce, software engineering has been through a paradigm shift as adaptive versus predictive methods of work took over. Yet the same shifts can’t be seen when we approach and cope with managing complex software systems when they fail. Specifically there is still a strong tendency to:

Establish heavily documented processes to maintain the perception of quality control — keep work-as-imagined = work-as-done
Isolate failure down to single origins using 5 Whys and Root Cause Analysis
Create localised action items to only address specific causes of failure — learnings, contributing factors and systemic issues often get missed
Generate reports to demonstrate control is being gained, back up the hierarchy, using quantitative metrics and lagging indicators.

So where does the thinking behind these influences for managing incidents come from?

The same thinking processes that early origins of management theory were founded on: linear thinking.

Linear thinking, where it comes from and how it influences incidents

A linear, scalable and repeatable process such as an assembly line should by design, produce the same product over and over again, consistently and efficiently. It stands to reason, that the thought processes, theory and management structures used to govern it, will also follow the same mindset.

Influenced by early 20th century management practices, we approach incident management using this mindset, meaning we see all failures as a sequence of predictable (in hindsight), events. Events that are believed to have come from a single origin. This form of incident analysis (of which there are others), is referred to as the Swiss Cheese Model.

So to understand “what went wrong”, using linear thinking, we isolate each incident down to a single origin of failure, then work backwards (along the theoretical assembly line), following a 5 Whys sequence till we find the breakdown in process.

By the mid 20th century management thinking was starting to change. The Toyota Production System, developed by W. Edwards Deming, brought many changes to the Taylorist influenced world. Most notably a greater focus on humanising work, putting control in the hands of operators and establishing processes focused on quality at the source and just-in-time manufacturing. This work is broadly admired, regularly referenced and adopted in the software industry.

It is interesting to note, that 5 Whys and Root Cause Analysis came from Toyota and has seen widespread use in Lean Manufacturing

Linear thinking influences on incident management, how far have we really come? — Source

When we consider the Toyota Production System was conceived and developed almost 70 years ago, and was done during a time when humans were nowhere near producing or coping with systems as complex as we are now. How can we really know if we are coping with failure in this dynamic and unpredictable world as best we can following decades old techniques such as 5 Whys and RCA?

Firstly we need to understand what modern systems complexity looks like, how we build, maintain and support such systems, before we can understand how to learn from its failures and, ideally, think differently about how to deal with them.

Modern complexity

A modern web-site like SEEK has 100’s of services, systems and processes all integrating and sharing information together in real-time 24 hours a day, 7 days a week. It is incredibly dynamic and complex, in fact at the time of publication, the AWS infrastructure my organisation is managing consists of:

1500 running AWS EC2 instances
1000 AWS Lambda functions
200 AWS RDS Instances and
4500 AWS S3 Buckets.

This highly complex system is being built, maintained and supported by approximately 300 people. Agile and agile-derived development practices are used everywhere to build our products and services. We place a high value on collaboration and being able to deploy new versions of software on-demand at any time of the day.

And that is just the headlines, there is a lot more under the surface including 100’s of code repositories and deployment rates that can reach over 1000 production changes in a single calendar month. In fact how it all works together has become so complex that no one person could know it all. Even if they did, the rate of change is so fast that their knowledge could be out of date within days, if not hours.

Key takeaway: the way in which our systems work, do not even come close to resembling an assembly line and our processes have little if anything to do with linear thinking practices

By comparison, in 1999 when I was one of 150 software engineers coping with building a large ERP system I was strongly influenced by linear thinking in almost every facet of work I did. Delivery and requirements were heavily controlled by the Rational Unified Process. UML models were everywhere. Architectural decisions took months. And over time we were even directed to conform to CMMI Level 5. This added even more processes and documentation to our work, pushing out delivery schedules even further and slowing our output to a crawl.

The infrastructure was nothing like what we deal with now either, at its peak it comprised only 100 servers and 20 databases. Yet it was terribly fragile. It failed often and took months to do simple changes. Nightly builds of the software were a standard affair as it took, literally, hours to build everything.

Looking back it seems inconceivable to try and build any software system like I was doing at the turn of the millennium following decades-old work management practices, so why don’t we see the same flaws in managing incidents using these linear-thinking influenced approaches too?

Where it commonly goes wrong

In my experience the following two situations constrain us to linear thinking incident management mindsets. And in turn, negatively influence an organisation’s ability to learn from incidents seeing them as a valuable contribution to building more resilient software.

Lack of empathy for software engineering complexity

This occurs when the individuals tasked with building and running incident “command and control” centres, which include overseeing incident review processes, have little or no recent experience performing in software engineering roles. The tendency to have a low-level of empathy for the day-to-day challenges that come with building modern, highly complex software systems can emerge. The outcome can be establishment of exhaustive incident management and review processes aimed to “close the management knowledge gap". This works against the operating rhythms of high velocity software teams. Worse, it reinforces existing software delivery anti-patterns.

Only using traditional incident management processes such as ITIL

In organisations that either align with, or have people with careers that have been heavily exposed to processes like ITIL, management of incidents can resemble approaches to traditional infrastructure incident management. This can be a danger for a couple of reasons:

Clash of cultures
If your software engineering teams have been or are being driven by philosophies such as Fail Fast, Extreme Programming and DevOps, the incident management process can descend into a theatrical reporting exercise that won’t be respected. This is because neither side will value or understand the other’s views and there will be no safe environment to “put yourself in the operators shoes”.
Reinforcement of linear thinking
An organisation already struggling with maintaining our Taylorist equation, will be further constrained by the sheer weight of process and documentation from the increased checks and balances ITIL creates.
Every incident is seen as “break fix”
With physical infrastructure most problems are resolved by the swap-in and swap-out of a broken component. This narrow, “isolate broken component” mindset, leads to an over emphasis on creating localised band-aid solutions, and less focus on systemic issues requiring broader action and discussion. Simply put, we will be more concerned with ensuring that ‘as few things as possible go wrong’ rather than ensuring that ‘as many things as possible go right’ — also known as Safety One vs Safety 2 thinking.

I have seen the situations described above occur in almost every organisation I have worked in.

Changing the “that’s not how we do it here” approach

Don’t just focus and react to single weather events, look at all the contributing factors, as they occur over time, to understand what the climate is telling us — Source

Complex ever changing systems such as our planet’s climate, do not follow a linear sequence of events. Humans have survived not because our species is the strongest or the fittest, but because of our capacity to adapt and look for new ways in which to learn from unexpected events in complex systems. And as the systems we create continue to grow in increasing complexity, our ability to maintain resilience must be sustainable to continue to adapt along with it.

But changing mindsets is hard. Moreover it’s exhausting, because it isn’t intuitive to minds conditioned by century old management dogma. But it will become a critical skill in coping with systems that are growing in complexity into the future to enable Resilience Engineering to become the new normal of, “that’s just how we do it here”.

Ultimately we must create cultures in software engineering organisations that are empowered to stop, debate and take action on the question: “why are we doing what we’re doing?”.

Lastly to the Taylorist equation of maintaining work-as-imagined equalling work-as-done:

…the difference between them should not be looked at simply as a problem that ought to be eliminated if at all possible. The difference should instead be seen as a source of information about how work is actually done and as an opportunity to improve work — Source

In the next post we’ll propose a new way for coping with incidents and how a change in mindset and approach yielded better learnings and understanding of the complex system we deal with every day.