operational-sympathy - Medium

What Artemis II Recovery Taught Me About Changes in Production

Suranga Nisiwasala — Fri, 17 Apr 2026 04:41:56 GMT

The Pacific Ocean is a great place to run your SRE playbook.

Last Saturday I watched NASA’s Artemis II splashdown live — four astronauts returning from a 10-day mission around the Moon, splashing down in the Pacific Ocean off the coast of San Diego.

I wasn’t watching it as an SRE. I was just watching it like everyone else.

But from the moment the parachutes deployed, something kept pulling at me. A rhythm. A sequence. Every step waited for the previous one to complete. Nobody rushed. Nobody skipped. And yet none of it felt slow — it felt precise.

Then came the moment that made everything click. Four helicopters landed on the flight deck. Rotors still spinning. One rotor stopped. Then another. Then another. Then the last. Only after every rotor had fully stopped did the astronauts begin to step off.

And I thought: that’s exactly how we do production changes.

I went back and watched the key moments again, looking for more of what I’d noticed. I saw it everywhere — in exactly the same order we run a production maintenance window.

NASA wasn’t just “landing astronauts” — they were executing a zero-failure, high-risk production release. Here’s what I took from it.

1. Shipping With a Known Issue Is Sometimes the Right Call

Before re-entry, everyone knew: Orion’s heat shield had documented design flaws from the uncrewed Artemis I mission. NASA didn’t ground the mission. Instead, they modified the re-entry trajectory to reduce exposure time to extreme temperatures — cutting the highest-heat phase from 20 minutes to 13.5 minutes.

This is not recklessness. This is calculated risk mitigation.

In production, we face this constantly. A release has a known non-critical bug. A dependency has a memory leak under heavy load. The system isn’t perfect, but the window is open, the rollback is tested, and the blast radius is understood.

You don’t always wait for perfection. You reduce the exposure surface, document the known risk, and proceed with eyes wide open. What you never do is pretend the flaw doesn’t exist.

2. Prepare Around the Risk — Everything in Place Before the Critical Phase

The crew hadn’t been waiting before re-entry. They’d been preparing. Procedures reviewed line by line. Equipment stowed. Suits checked for leaks. Weather briefed. Recovery force confirmed.

Meanwhile, the USS John P. Murtha, the recovery ship had left port days before splashdown. The recovery team was already on site and waiting. Nobody scrambled when Orion hit the water. The response capability was pre-staged.

In SRE, this is your maintenance window brief. That means:

Runbook reviewed and validated before the change window opens
Rollback steps tested, not just written
On-call engineer confirmed available — not just “reachable”
Monitoring dashboards open and baselined before the first command runs

The change doesn’t start when you run the first command. It starts when you begin preparing for what happens if something goes wrong.

3. Each Stage Creates the Conditions for the Next

Orion wasn’t slowed by one big parachute. Small drogue chutes deployed first to stabilise the capsule and bleed off speed. Only after they had done their job were the three main parachutes released, slowing the capsule to 20 mph at splashdown.

Why not open the big ones from the start? Because at that speed, they would have torn apart. The drogues don’t complete the job — they create the conditions under which the next step is survivable.

In production, you don’t take down a chunk of your fleet at once. You drain one node — removing it from the load balancer, letting in-flight requests complete, bringing it to a quiet state. Apply the change, confirm it’s healthy, move to the next.

Each stage isn’t a precaution layered on top. It’s what makes the next stage possible at all.

4. Check for Fumes First — In Production, That’s Your Baseline

Before anyone approached the Orion capsule, recovery teams swept with air quality sensors — checking specifically for hydrazine propellant and ammonia coolant from the reaction control system. These were known hazards from the capsule’s design. They weren’t doing a routine scan hoping everything was fine — they were verifying a specific, documented risk before anyone got close.

If they had skipped that check and a diver got sick, they’d have no way to know whether the fumes were already leaking or whether approaching the capsule disturbed something. The check isn’t just about protecting the divers — it’s about knowing whether the capsule was already leaking before anyone touched it. That’s attribution.

In production, this is why you run your test suite and review your observability dashboards before the change window opens. Not to look for problems — but to know what normal looks like. If something goes wrong after the change, you need to answer one question with confidence: was this caused by the change, or was it already there?

You can only answer that question if you captured a clean baseline first. A change made without a baseline is a change you can’t fully explain afterward.

5. The Medics Went In First. The Astronauts Came Out Second.

When the hatch opened, medical officers entered the capsule first to assess the astronauts. The crew only began exiting after the medical team had evaluated them and cleared each person.

They didn’t pop the hatch and wave everyone out. They sent in observers with expertise, got a status report, and then began the extraction.

In production, your observability team — or your synthetic testing suite — is that medical team. The change has been applied. But the maintenance window isn’t over yet. Before you cut traffic to the new version, before you close the maintenance window, before you hand back to users:

Run your smoke tests
Check your synthetic transactions
Get a human eye on the key dashboards

You open the hatch only after someone with expertise has looked inside and confirmed it’s safe.

6. All Helicopters Land. Rotors Stop One by One. Then Offboarding Begins.

This was the detail that stopped me completely.

After each hoist, the helicopter didn’t immediately fly to the ship. It moved to a holding position and waited. Only once all four astronauts were collected did both helicopters fly to the ship together. All four landed on the flight deck — rotors still running. Then, one by one, each rotor stopped. Only after every helicopter had landed and every rotor had fully stopped did the crew begin to step off.

They didn’t start extracting astronaut #1 the moment the first chopper touched down. They waited for the full system to reach a stable state first.

This is distributed systems consensus applied to physical recovery operations.

In a rolling deployment, this is the mistake we see constantly: the first replica is healthy, so the team relaxes. Traffic starts shifting. Someone declares it a success while three other replicas are still deploying.

Partial commitment is the most dangerous state in any system. Your rollout isn’t healthy when the first pod is healthy. It’s healthy when all pods are healthy, traffic has fully shifted, and your error rate has held steady for a meaningful observation window.

Wait for all rotors to stop before anyone steps off.

7. Even Experienced Astronauts Follow the Checklist

Christina Koch has spent nearly a year in space. Victor Glover is a decorated Navy test pilot. Reid Wiseman served as Chief of the NASA Astronaut Office. Jeremy Hansen is a Royal Canadian Air Force fighter pilot. These are not people who need to be told how a recovery operation works.

And yet — they sat in the capsule and waited. They moved when told. Were hoisted one by one. Waited for every rotor to stop before stepping off. Not one said “I’ve done this before, I’ll go first.”

This is the defining characteristic of elite operators: experience doesn’t make you skip the process. It’s what makes you understand why the process exists.

In SRE, your most senior engineers should be your most rigorous about the runbook. Not because they need to be told what to do — but because they’ve seen what happens when steps are skipped. The junior engineer skips the drain check because they don’t know what can go wrong. The senior engineer does it because they do.

Experience doesn’t make you skip the process. It’s what makes you understand why the process exists.

8. The Learning Phase Started Before the Crew Left the Ship

About 90 minutes after splashdown, the crew were aboard the USS John P. Murtha undergoing medical checks. NASA held a post-mission press conference the same evening. Engineers had already begun inspecting the heat shield. Onboard data retrieval was being planned. The mission wasn’t over — the learning phase had started immediately.

In production: run your retrospective while context is accurate — what wasn’t in the runbook, what assumption was wrong, what held up under pressure. Don’t wait until next week when the adrenaline has faded and the details have blurred.

Every change window is a test flight that produces data. Capture it while it’s fresh.

What I Actually Watched

Stability over speed — that’s the message NASA just showed us. And it’s exactly how production changes should be run.

What I was feeling watching that recovery — before I could articulate it — was the comfort of watching a system that had genuinely internalised operational discipline. Not a checklist followed because someone said to. A culture where the sequence, the waiting, the one-at-a-time extraction were simply how things are done — no explanation needed, no enforcement required.

The principles map directly to how we run production changes:

Ship with known risks only when the mitigation is understood and the exposure is reduced
Pre-stage your recovery capability — nobody should scramble when it matters
Each stage creates the conditions for the next — never skip ahead
Check for fumes first — capture your baseline before the change begins
Wait for full system stability before declaring the rollout done
Verify system health before exposing users
Your most experienced engineers should be your most process-disciplined
Capture the learnings immediately — every change window is a test flight

The Artemis II crew is home. All four walked off that ship unaided, waving at the cameras.

That’s what a well-executed production change looks like.

SRE Technical Lead and CNCF Kubestronaut at WSO2. I pay close attention to process, stability over speed, and keeping production systems simple enough to actually operate — production lessons show up in the most unexpected places, apparently including a NASA recovery operation on a quiet Saturday morning.

Did you catch different SRE lessons watching the splashdown? I’d love to hear them in the comments.

What Artemis II Recovery Taught Me About Changes in Production was originally published in operational-sympathy on Medium, where people are continuing the conversation by highlighting and responding to this story.

Beyond Blameless Postmortems: How We Turn Production Failures into Design Improvements

Dilshan Fardil — Mon, 02 Mar 2026 05:23:15 GMT

What our internal RCA process revealed about building operationally sympathetic systems

After our tenth production incident in Q3, we sat down with the postmortem reports spread across the table. Ten different incidents. Ten different immediate causes. But as we looked closer, something uncomfortable became clear: seven of them shared the same root cause category. We weren’t learning. We were just fighting the same fire in different rooms.

That’s when we realized our RCA process wasn’t actually working. We were documenting failures, sure. Writing action items, absolutely. But we weren’t changing how we designed systems. We weren’t building operational sympathy into our architecture.

This is the story of how we transformed our RCA process from a paperwork exercise into the foundation of operationally sympathetic design at WSO2.

What Is RCA (Root Cause Analysis)?

Before we dive into what we learned, let’s establish what an RCA actually and what it should do.

An RCA is a structured process you run whenever you have a customer impacting event. Not just a minor glitch. A serious incident where something went wrong and customers felt it. Downtime. Data loss. Performance degradation. Service unavailability.

The goal isn’t just to identify what broke. It’s to understand why it broke and most importantly how to ensure it never happens again.

If your RCA doesn’t change how you build systems, it’s just documentation theater.

The Problem: RCAs That Don’t Change Behavior

Here’s the uncomfortable truth about our early RCAs: they were theater. Professional, thorough documentation of our failures that changed absolutely nothing.

The pattern was predictable: Incident happens. War room. Fix it. Write it up. File it away. Months later? Different service, different team, same root cause. Nobody connected the dots.

Our action items were Band-Aids: “Restart the service,” “Increase memory,” “Add a retry.” We treated symptoms, not causes.

Then came the moment that changed everything.

Frustrated after yet another “new” incident that felt oddly familiar, we decided to categorize our RCAs differently. Instead of reading them chronologically, we grouped them by pattern. What we saw was eye-opening:

Most were observability gaps. We couldn’t see what was failing. Flying blind every time.

Many were resource exhaustion. Thread pools maxed. Connections drained. Memory leaking. We kept running out because we weren’t tracking.

Several were timeout chaos. External services timing out. Our services retrying forever. Cascading failures.

Some were deployment disasters. Bad code to production. No gradual rollout. No rollback. When deploys went wrong, they went spectacularly wrong.

We weren’t having unique incidents. We were having the same categories of disasters on repeat, just wearing different costumes. The same patterns kept emerging, over and over.

Something had to change.

The Seven Steps of Effective RCA

We rebuilt our RCA process around seven steps. But more importantly, we changed how we approached each step with operational sympathy in mind.

Step 1: Identify the Problem (Not Just the Symptom)

The symptom is what users experienced. The problem is why the system behaved that way. Don’t stop at “the service was slow” or “the database went offline.” Dig deeper: What cascade of events led to this failure? What design assumption broke down?

Get to the actual problem. Don’t stop at the symptom.

Step 2: Collect Data (Your Decisions Must Be Evidence-Based)

This is where observability debt hits hardest. If you didn’t instrument the right things, you can’t collect the right data. You’re forced to guess.

What we learned: The data you wish you had during the RCA is exactly what you should have instrumented before the incident. Every gap in our data collection became an action item: “Add this metric to prevent future blind spots.”

We started tracking data gaps discovered during RCAs. When we found ourselves saying “we wish we had logged this,” that became immediate feedback for our observability strategy.

Step 3: Ask Why (Make Causal Connections)

Asking “why” once isn’t enough. You need to ask it multiple times. This is the “Five Whys” technique, and it’s transformative.

A Simplified Example:

Why did the service fail? → Resource exhausted
Why did the resource exhaust? → External dependency became slow
Why did we not handle slow dependencies? → Missing defensive patterns Why were defensive patterns missing? → Not in our design standards
Why not in standards? → We never documented failure-mode thinking

Now we’ve reached the real root cause: a gap in how we approach system design. The fix isn’t just addressing this one incident it’s updating how we design all integrations going forward.

This is operational sympathy in action: understanding that incidents are almost never single failures. They’re cascades. Find the cascade triggers.

Step 4: Identify Corrections (What Will You Fix?)

Here's where we changed our approach dramatically. We now identify corrections at three levels:

Level 1 - Immediate Fix: What stops the bleeding right now?
Level 2 - Prevent This Specific Incident: What prevents this exact scenario?
Level 3 - Prevent This Category of Incidents: What prevents all incidents in this class?

Level 3 is where operational sympathy lives. It's where you change your design patterns, not just fix bugs.

Step 5: Find the Gaps (Monitoring & Logging Defects)

Every RCA should answer: What didn't we see that we should have?

When we discover these gaps, each becomes an action item. Each becomes an addition to our monitoring standards.

Critical insight: Monitoring and logging go hand in hand. You can't monitor what you don't log. You can't debug what you don't monitor. If you found gaps in the RCA, fix them before the next deployment.

Step 6: Implement the Solution (Not Just the Quick Fix)

This is where many RCAs die. Great analysis. Clear action items. Nothing changes. We need to make implementation mandatory and trackable.

Every action item has an owner and a deadline
Level 1 fixes: implemented immediately
Level 2 fixes: in the next sprint
Level 3 fixes: architectural changes tracked as initiatives
We track completion rates and review them regularly

If you don't implement the changes, why did you do the RCA?

Step 7: Communication (The Hardest Step)

This step is emphasised in our process because it's where we fail most often. Not technically emotionally.

It's hard to admit mistakes. It's hard to tell customers "we failed you, and here's why." But this communication is crucial for two reasons:

It rebuilds customer trust. Transparency about what went wrong and how you're preventing it shows you take incidents seriously.
It creates internal accountability. When you have to explain the incident externally, teams take the fixes more seriously.

Our communication template:

What happened (timeline)
What the impact was (be specific)
What the root cause was (not just the symptom)
What we're doing to prevent it (all three levels)
How we're verifying it won't happen again

Don't write two sentences. Write the real story. Restore trust through honesty and detail.

Use Our RCA Template

Azeez has released a comprehensive RCA template that we now use at WSO2. It’s structured, thorough, and forces you to think about operational sympathy at every step.

Get the template here: https://medium.com/operational-sympathy/root-cause-analysis-report-template-released-de1add6345f8

And also Azeez talk about the process and the thinking behind how to preparing a RCA in here : https://medium.com/operational-sympathy/the-curious-case-of-the-leaking-land-rover-38e28758e6f5

Don’t reinvent the wheel. Use a proven template. Focus your energy on learning from the incident, not formatting the report.

IMPORTANT : What Rigorous RCA Practice Taught Us

After committing to rigorous RCA practice, here’s what changed:

Measurable Improvements:

Mean Time To Resolution dropped dramatically
Repeat incidents (same root cause) became rare
RCA action item completion rate improved significantly
More incidents caught in staging before reaching production

Behavioural Changes:

Teams now ask “What would the RCA say?” during design
Code reviews explicitly check operational sympathy dimensions
New services are born with monitoring, not bolted on later
On-call engineers feel more confident — they have the tools to diagnose issues

Cultural Shifts:

Blameless culture is real people admit mistakes without fear
RCAs are learning opportunities, not punishment
We celebrate good RCAs that drive meaningful change

The shift from “fixing incidents” to “preventing categories of incidents” was the turning point.

Start With One RCA

You don’t need to overhaul your entire incident response process tomorrow. Start small:

Pick your last incident. Even a minor one. Run the seven-step RCA on it.
Ask the Five Whys. Don’t stop at the surface. Get to the real root cause.
Identify all three levels of corrections. Immediate, preventive, and categorical.
Find the observability gaps. What didn’t you see that you should have? Add those metrics.
Actually implement the fixes. Track them. Make someone accountable.

Do this once well. Then make it your standard. Before you know it, you’re not fighting the same fires. You’re preventing them.

The Best Incident Is the One That Teaches You to Prevent Ten More

RCAs aren’t paperwork. They’re your feedback loop. They’re how production teaches you to build better systems.

Every incident is painful. Every customer impact hurts. But if you learn from it really learn, systematically, through rigorous RCA then that pain has meaning. It makes you better. It makes your systems more operationally sympathetic.

The goal isn’t zero incidents. That’s impossible. The goal is:

Detect faster (better observability)
Resolve faster (better operability)
Never repeat (better design)

RCA gets you there. But only if you treat it as a design tool, not a documentation exercise.

So the next time something breaks, don’t just fix it. Learn from it. Systematically. Rigorously. And use that learning to build systems that fail less, recover faster, and operate with sympathy for the people who have to keep them running.

The first step of solving a problem is recognising there is one.

The next step is making sure it never happens again.

Lessons from production at WSO2, where we’ve learned that great RCAs prevent future incidents.

Read more about operational sympathy: https://medium.com/@afkham-azeez/operational-sympathy-8a9c5dc26b5a

Get the RCA template: https://medium.com/operational-sympathy/root-cause-analysis-report-template-released-de1add6345f8

Use the operational sympathy scorecard: https://docs.google.com/spreadsheets/d/1jryXy-aNQDoDgjMC8T2D5grdgP5bxQr-DwKJkB2hNfE/edit

Beyond Blameless Postmortems: How We Turn Production Failures into Design Improvements was originally published in operational-sympathy on Medium, where people are continuing the conversation by highlighting and responding to this story.

Cloud that Grows, NOT bills that Explode

Wickram Bagawathinathan — Mon, 02 Mar 2026 05:21:05 GMT

Generated with claude.ai

When we talk about building systems in the cloud, our eyes light up at scalability, resilience, performance, and all those shiny new services. But cost? That’s the quiet guest nobody notices… until the first “surprise” bill drops like a plot twist in a thriller.

Treating cloud cost as an afterthought is like realizing halfway through your dream house that you have no idea how much cement costs!!! Yikes… 😜

What if, instead, cost had its own seat at the architecture table from day one? Suddenly, scaling up doesn’t feel like gambling with your wallet: it’s strategy, not chaos.

Cost awareness by design

Generated with claude.ai

In a cloud system, small inefficiencies don’t stay small for long. Hence, making architectural and implementation decisions with a clear understanding of how resource usage burns money. Every database read, every background job, every autoscaling rule: they all have a price tag attached. Cloud bills don’t explode randomly. They usually explode because of assumptions, mostly.

“It’s much cheaper to redesign a diagram than to refactor a live system.”

Architecture shapes cloud bill

Two systems can solve the same problem and have wildly different cost profiles.

A microservices architecture with multiple active services might look elegant, but passive services still cost money.
An autoscaling group without proper limits can scale beautifully… and the bill beautifully too.
A logging system that retains everything forever may seem safe… until storage costs creep up month after month.

Architectural decisions should determine:

How compute scales?
How does storage grow?
How does data flow between services grow or work?
How often are expensive operations triggered?
How many DB calls are triggered to get a simple API subscription list?

Without guardrails, scaling becomes a liability instead of a strength. A traffic spike shouldn’t feel like a financial emergency.

Good cost-aware design asks:

What happens at 10x traffic?
What happens if a queue backs up?
What happens if someone misconfigures a default?
What happens if the system generates millions of log lines in a single day?

Defaults are dangerous

Many cloud services are powerful, and their defaults are designed for flexibility, not frugality, like:

Autoscaling without upper limits.
Provisioned databases sized “just to be safe”.
High-performance storage tiers are used by default.
Excessive interactive log retention.

None of these are wrong… But without intentional choices, they quietly accumulate cost. So what could be intentional?

Setting sensible scaling boundaries.
Choosing appropriate instance sizes.
Implementing lifecycle rules for storage.
Monitoring usage from day one.

The goal isn’t to minimize cost at all costs: it’s to align spending with value.

Native cloud services vs cloud-agnostic solutions

There’s an ongoing debate in architecture discussions: should we use native cloud services or build cloud-agnostic systems?

Cloud-native services (managed databases, serverless compute, managed messaging systems) often:

Reduce operational overhead.
Improve reliability.
Scale automatically.
Potentially optimize costs through usage-based pricing.

But they can increase vendor lock-in.

Cloud-agnostic solutions (like self-managed containers, portable databases, or abstraction layers) offer flexibility, but often at the cost of:

Higher operational complexity.
Always-on (active) infrastructure.
Hidden management costs.

The real question isn’t ideology… It’s economics and context.

Sometimes native services are cheaper in the long run because they eliminate operational overhead. Other times, a portable solution might prevent expensive migrations later. Cost awareness means evaluating all the possible solutions, not just technically, but economically too.

A cost approximation

One of the most underrated engineering exercises is cost modeling. Before finalizing architecture, ask:

What will this cost at the expected traffic?
What will it cost at 10x growth?
What’s the worst-case scenario?
Which components dominate the bill?

A cost approximation exercise:

Forces clarity about traffic assumptions.
Highlights expensive data paths.
Encourages right-sizing decisions early.
Identifies risks before production.

Even a simple spreadsheet can uncover insights you didn’t see coming. I’ve put together a “simple” cost-engineering template you can use as a starting point or reference: Mini_Cloud_Cost_Engineering_Framework

Observability for cost, not just performance

We instrument systems for latency and error rates, but do we instrument for cost drivers? Because when costs become observable, they become manageable.

Consider tracking things like:

DB requests per expensive API call.
Storage growth trends.
Compute hours by service.
Cost per tenant or feature.

Economic discipline

At its core, cost awareness is about responsibility. Cloud makes it incredibly easy to provision resources. It also makes it incredibly easy to waste them.

Cost awareness doesn’t slow innovation; it makes it sustainable. Because the goal isn’t just to build something that works, it’s to build something that works and keeps working without turning into a financial surprise.

Cloud that Grows, NOT bills that Explode was originally published in operational-sympathy on Medium, where people are continuing the conversation by highlighting and responding to this story.

Root Cause Analysis — Report template released

Afkham Azeez — Thu, 19 Feb 2026 06:46:32 GMT

Root Cause Analysis — Report template released

I have been talking and writing about RCA for a while now trying to get everyone in general, and my teams in particular to carry out proper and high quality root cause analysis. One of the questions people have asked me about was how the analysis and findings should be documented. Based on the work we have been doing in the past and the gaps I’ve identified, I’ve created a 1.0 version of an RCA template which I’m releasing under the Apache Software Licence (ASL) 2.0.

https://medium.com/media/d91e1521555bbeaad8b3156aae420375/href

You are free to download, make copies and modify this as you see fit. You could also comment on the Google doc itself if you have suggestions.

The template is self explanatory. Feedback welcome.

Root Cause Analysis — Report template released was originally published in operational-sympathy on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Curious Case of the Leaking Land Rover

Afkham Azeez — Thu, 19 Feb 2026 06:45:31 GMT

The Art of Root Cause Analysis: Solving Problems at Their Source

We recently concluded our inaugural Customer Success kickoff session for the year 2025 themed “Pioneering Excellence”, where I conducted a session with the same title as this article. In my role heading the SRE team at WSO2, I’ve created guidelines for my team on conducting Root Cause Analysis sessions as well as have participated in many RCA review sessions. In our line of work, outages and incidents are part and parcel of life. However, we strive to learn from every such occurrence and ensure that we take every action possible to prevent recurrence. After all, we are in the business of making customers successful, and repeat incidents can only have negative consequences. Hence, every incident related to the deployments my team manages, which include WSO2 Choreo, WSO2 Asgardeo, managed and private clouds that we manage on behalf of our customers, requires a post incident resolution RCA. The inspiration behind this session was based on my learnings and observations as a participant of such sessions. I believe that documenting it here would be helpful for others as well.

Tl;dr

Organizations aim for smooth operations and reliability, but incidents still happen, disrupting services and affecting customer trust; Root Cause Analysis (RCA) is a powerful tool that enables teams to identify the underlying causes of these issues and implement effective, long-lasting solutions. This blog explores the fundamentals of RCA, the principles of blameless analysis, and methodologies such as 5-Whys and Fishbone analysis, demonstrating how to identify actionable solutions that address the roots of the problem.

It’s a feature, not a bug!

A car owner regularly finds oil patches in his driveway, and low engine oil levels. His mechanic repeatedly tops up the oil and replaces the oil filter gasket. Additionally, the mechanic also cleans the engine to remove visible oil residue, giving the impression that the issue is resolved, but the issue persists and the owner is frustrated. This or something similar has happened to many of us, isn’t it? Finally he decides to take his car to a different mechanic, who takes a step back and takes his time to analyze the problem and uncovers that the source of the leak was a cracked valve cover gasket. Once that is replaced, the leak gets permanently fixed. The owner is relieved, but he had needlessly spent time, money and suffered disappointment due to the first mechanic treating the symptoms instead of spending a bit more time to find the root cause. The owner would never go back to that mechanic.

The Why Behind Every Problem

Root Cause Analysis is a systematic process to uncover the true source of a problem or incident. By addressing root causes, teams prevent recurrence and build robust systems. Unlike quick fixes that address symptoms, RCA drives meaningful change by targeting the underlying problems.

Engaging in Root Cause Analysis (RCA) offers numerous tangible advantages that significantly benefit organizations. By identifying and addressing underlying issues, RCA helps minimize repeated outages, effectively reducing downtime and ensuring smoother operations. It also enhances processes and reliability by optimizing workflows and systems, leading to improved efficiency and stability. Furthermore, RCA fosters a culture of continuous improvement, promoting accountability and encouraging teams to embrace proactive problem-solving. This holistic approach not only resolves current issues but also strengthens the foundation for long-term success.

RCA aims to achieve the following critical objectives:

Identify the True Source of the Problem

You may only see the symptoms, but the problem could run much deeper!

Determining the specific underlying causes instead of addressing symptoms alone is crucial. Many a time, people end up applying plasters to counter the symptoms without investing time in understanding why the symptoms manifested in the first place. Needless to mention that these symptoms will raise their ugly heads from time to time unless the reasons for those symptoms are not addressed.

Implement Effective Corrective Actions

Developing actionable and lasting solutions that address the root causes which fix the root causes identified during the RCA is one of the fundamental objectives.

Prevent Similar Incidents in the Future

Introducing process improvements to eliminate the risk of recurrence is another important objective.

Fix It at the Top!

Living in Sri Lanka, we know that to eliminate corruption, it must be tackled at the top. However, what happens in reality is that the big fish go scot-free while the small fry get caught, leading to the situation we are in today.

Getting back to RCA, we have observed that once you start plotting a graph of the root causes, many incidents ultimately lead to a handful of ultimate causes as shown in the illustration above. Fixing the problems closer to the top will ensure the elimination of problems and incidents further down the hierarchy. For example, incidents A, B and C all are due to a partial deployment outage. Further analysis uncovered a misconfiguration which occurred due to lack of awareness in the SRE team as well as poor documentation, which ultimately point to operational process issues & product management issues. This indicates that there could be other problems that would stem from those two roots, and hence to avoid future problems, we should scrutinize and update these processes.

Conducting an RCA

Conducting an effective RCA requires a methodical approach. Here are the fundamental steps in detail:

Define the Problem Clearly: Frame the issue with a specific and concise statement to guide the analysis. Clearly articulating the problem is as good as solving half the problem.
Focus on Root Causes, Not Symptoms: Look beyond immediate effects to discover the deeper systemic issues.
Gather Data and Evidence: Collect logs, metrics, timelines and all relevant data to establish a factual foundation for the analysis.
Involve Relevant Stakeholders: Include people from different roles to ensure diverse perspectives.
Use Structured Tools and Techniques: Leverage methods like Fishbone Diagrams and 5-Whys for a systematic exploration of causes.
Develop Practical Solutions: Focus on realistic and actionable remedies that can be effectively implemented. Ivory tower solutions will not yield anticipated results.
Focus on Blameless RCA: Encourage open communication and analyze processes instead of blaming individuals.
Document and Share Findings: Compile and disseminate results to ensure organizational learning and transparency.
Implement and Verify Corrective Actions: Execute and monitor solutions to validate their effectiveness.
Promote Continuous Improvement: Use RCA outcomes to refine practices and foster a culture of learning.

Blameless RCA: Focus on systems and processes, not individuals

බඳුන් සෝදන අයගේ අතින් තමා පිඟන් කැඩෙන්නේ! (Those who wash the dishes are the ones most likely to break a plate!)

What would be the best method of avoid making mistakes? Not doing anything would ensure that you don’t commit mistakes. Humanity wouldn’t progress if everyone thought like that. Most people wouldn’t deliberately make mistakes and hence it is better to give individuals the benefit of the doubt.

Blameless RCA fosters an environment of psychological safety, where individuals feel secure sharing mistakes and insights. Instead of asking, “Who caused the problem?” teams focus on “What allowed this problem to happen?” This approach encourages collaboration, honesty, and innovation.

How to Conduct a Blameless RCA:

Establish Psychological Safety: Cultivate a culture where team members can openly discuss issues without fear of blame or retribution.
Focus on Systems and Processes: Examine workflows, tools, and systemic factors contributing to the problem instead of attributing fault to individuals.
Use Neutral Language: Frame discussions constructively, such as “The configuration check was missed” instead of “The engineer failed to check the configuration.”
Rely on Data: Base analysis on objective evidence like logs, metrics, and timelines rather than assumptions and biases.

Methodologies

Two popular tools for RCA are Fishbone Analysis and the 5-Whys technique. Let’s explore how they work.

5-Whys Analysis

The 5-Whys method is an iterative questioning technique used to drill down to the root cause. Here’s an example:

Why did the system go down? — A configuration file was missing.
Why was the configuration file missing? — It wasn’t included in the deployment process.
Why wasn’t it included? — The checklist didn’t cover this file.
Why didn’t the checklist cover it? — The checklist was outdated.
Why was the checklist outdated? — No process existed for regular updates.

By repeatedly asking “Why?” teams can uncover deeper systemic issues that might otherwise be overlooked. Even though this is called the 5-whys method, it is not mandatory to ask the question strictly 5 times only. As required, the depth could be more than or less than 5.

Fishbone Analysis

My version of fishbone analysis which combines 5-whys with Ishikawa diagrams

The Fishbone Diagram, also known as the Ishikawa Diagram, is a visual tool that maps out cause-and-effect relationships. The diagram consists of:

The Head: Represents the defined problem or issue.
The Bones: Major categories of causes, such as People, Process, Equipment, Materials, Environment, and Management. This encourages a multi-dimensional analysis of the problem.
The Sub-Causes: Specific contributing factors branching out from the main categories. At each of these “bones”, we would conduct a 5-whys analysis or at least ask the question “why” as many times as appropriate.

By systematically breaking down causes, teams can comprehensively explore all potential contributors to the problem.

Hypothetical Scenario: WSO2 Gateway Timeout Issue

Let’s apply the above methodologies to a hypothetical scenario involving a WSO2 cloud deployment to get a better understanding.

Problem Statement

Multiple users report gateway timeout errors after a recent update. These errors are impacting API calls and causing disruptions.

Initial Investigation:

Following the incident run book, the SRE team reverted recent changes to mitigate the impact temporarily.

Stakeholders: feature developers from the product team, product leads, SRE members who handled the incident, SRE leads, relevant CS leads.

5-Why Analysis:

Why did the gateway timeout occur? The product change resulted in a mandatory configuration parameter to be set. Not setting this results in a change in behavior of the product.
Why was this not detected by the product team? The product team tested it with the parameter properly set. Why? The product team always tests with fresh product builds with the latest config files and don’t test with older config files.
Why wasn’t this mandatory configuration change communicated to the SRE team? The product team forgot to update the documentation & related change log. Why? The feature release checklist doesn’t mandate checking whether docs have to be updated.
Why wasn’t this parameter introduced so that it has a sensible default so that existing systems will not be impacted? The impact was overlooked during the design phase of the feature. Why was that? Impact on existing deployments is not considered as part of the feature design phase.
Process: Why didn’t SRE detect this issue until users reported it? The monitors were missing. Why? Along with the new feature deployment, the monitor was disabled and they forgot to enable it. Why? There is no process to keep track of temporarily disabled monitors.

Fishbone Analysis:

People: Lack of awareness about configuration changes.
Process: No checklist for updating configurations.
Equipment: Missing monitors for new deployments.
Materials: Outdated reference documentation.
Environment: High system load during the update.
Management: Lack of oversight on deployment procedures.

Root Causes:

Lack of a standardized process for updating and validating configurations.
No process to track temporarily disabled monitors.

Action Items:

Introduce sensible defaults for new configurations.
Test with older configuration files during development.
Mandate documentation updates in release checklists.
Implement a system to track temporarily disabled monitors.

Post-RCA Actions

Effective RCA doesn’t end with identifying causes. It’s critical to:

Implement Recommendations: Ensure timely execution of corrective actions and monitor their success.
Update Processes: Revise and standardize workflows to eliminate the recurrence of similar issues.
Monitor Progress: Establish metrics and KPIs to assess the effectiveness of solutions over time.
Share Lessons Learned: Document findings comprehensively and share them across teams to promote organizational learning.

Common Pitfalls to Avoid

Inadequate Stakeholder Involvement: Failing to include all relevant parties can lead to incomplete analyses.
Vague Problem Statements: Ambiguous definitions of the issue hinder the identification of root causes.
Superficial Analysis: Avoiding deeper exploration due to time constraints or fear of blame results in recurring problems.
Overlooking Systemic Changes: Addressing only immediate issues without improving underlying processes leads to recurring incidents.

Key Takeaways

Fix the root, not just the symptom.
Focus on systems, not individuals.
Learn, improve, and prevent future issues.

Root Cause Analysis is more than just a problem-solving tool; it’s a mindset. By systematically identifying and addressing root causes, organizations can build resilient systems, foster collaboration, and drive continuous improvement. Remember, every problem is an opportunity to learn and grow.

The Curious Case of the Leaking Land Rover was originally published in operational-sympathy on Medium, where people are continuing the conversation by highlighting and responding to this story.

Cloud Mechanics: The Cost of Customer Involvement in Managed Cloud Services

Afkham Azeez — Thu, 19 Feb 2026 06:44:57 GMT

As someone fascinated by all things mechanics, I find inspiration in the humor of the sign displayed above from a mechanic’s shop. I must admit, I’ve been guilty of similar behavior in my role as a customer, and I wouldn’t blame the mechanic for wanting to charge me extra. In my role leading the SRE team, I often encounter situations where I wish we had a similar pricing board for our managed cloud services.

Don’t get me wrong. I myself enjoy tinkering with machines and sometimes mess up, and have to request the service of a career mechanic, at which point I end up spending more — to rectify some of the damages I may have done, in addition to rectifying the original problem.

So, what parallels can be drawn between customer interference in a mechanic’s work and the involvement of customer teams in how we install, manage, and operate managed clouds? In what ways does increased customer involvement drive up our costs?

As a business, it’s highly cost-effective for us to run deployments using standardized installation and monitoring scripts that adhere to well-defined, tried-and-tested processes. In other words, utilizing ‘cookie-cutter’ deployments allows us to take full advantage of established models and leverage the team’s deep familiarity with the tools, workflows, and methodologies. This streamlined approach not only reduces complexity but also enables us to achieve economies of scale, which translates into tangible benefits for our customers. By minimizing variations and complications, we can lower operational costs, ensure faster response times, and improve overall adherence to SLAs, ultimately providing a more efficient and cost-effective service.

Unfortunately, this streamlined approach isn’t always feasible. Many customers come with their own internal operations or cloud teams, along with specific standards, preferred technologies, governance frameworks, and security policies that they impose on our operations team. These internal requirements may be well-suited to the customer’s broader IT environment but often complicate the deployment and management of our managed cloud services. As a result, the ‘cookie-cutter’ solution that offers simplicity and efficiency must be set aside, and instead, we are required to make sometimes significant customizations to meet these specific demands.

These customizations can involve adopting entirely different governance policies, reconfiguring network setups, altering monitoring tools and strategies, and making changes at the process level to comply with customer-defined standards. This adds layers of complexity that deviate from our tried-and-tested methods, which are optimized for best practices, compliance, performance, cost efficiency, and scalability. The more deviations there are from our standard processes, the harder it becomes to leverage economies of scale, driving up both the time and cost involved in maintaining these environments.

Moreover, a key challenge arises when our team is given limited access to manage and operate the systems. Many customers want to retain a degree of control over their infrastructure, limiting our ability to make real-time decisions or automate certain processes. In such cases, the terms of operation are dictated by the customer’s teams, which can severely restrict our ability to respond quickly to incidents, roll out updates fast, continuous improvement, or optimize performance. This restricted access also reduces the agility that our teams rely on to efficiently manage cloud infrastructure and maintain uptime.

To further complicate matters, the customer’s internal teams sometimes make changes to the deployment environment without notifying us in advance. These uncoordinated changes, whether it’s tweaking configurations or updating systems, can lead to unforeseen issues, system downtime, or performance degradation. When problems occur, it can take significantly longer to diagnose and resolve the root cause, especially when we have to backtrack through unauthorized changes. This type of scenario not only increases downtime but also requires us to dedicate extra resources for troubleshooting, which in turn inflates costs for the customer.

On top of these operational hurdles, the increased communication overhead between our team and the customer’s internal teams adds another layer of complexity. Constant back-and-forth discussions, meetings to align processes, and additional approvals slow down the overall speed at which we can execute tasks. Each decision, change, or issue requires more steps for validation, which not only prolongs implementation timelines but also diverts resources away from more critical tasks. This coordination overhead translates directly into higher operational costs, as more man-hours are spent on managing communication, mitigating risks, and rectifying issues, rather than focusing on the core service.

Conflicts between our team and the customer’s operations teams often arise when there are differing priorities, technical approaches, or misaligned expectations. These disagreements can lead to significant delays as both teams try to resolve issues, navigate organizational politics, and reach compromises. The back-and-forth process consumes time, money, and energy that could be better spent on continuous improvement, R&D or optimizing the cloud environment. Prolonged conflicts can also take a mental and human toll on the teams involved, causing frustration, burnout, and diminished morale. This not only impacts the quality of work but also strains relationships, making collaboration even more difficult moving forward.

All of these factors combined — customized requirements, restricted access, customer-induced changes, and communication bottlenecks — drive up costs considerably. What could have been a smooth, cost-effective, and high-performance deployment becomes bogged down by inefficiencies. While we are fully capable of adapting to these custom demands, the associated costs, both in time and resources, increase for everyone involved. For the customer, this translates into higher service costs, longer response times, and potentially less effective cloud infrastructure management, as the added complexity detracts from our ability to deliver an optimal solution.

Just like the mechanic’s banner that escalates rates based on the customer’s level of involvement, the same principle holds true in our managed cloud services. When customers allow us to operate with minimal interference, using standardized processes and well-defined models, we can deliver efficient, cost-effective solutions. However, when they step in — whether by imposing custom requirements, limiting our access, or making changes without coordination — the complexity rises, much like when a customer “helps” the mechanic.

As with the mechanic who charges more for having to undo or navigate around a customer’s input, the more involvement and customizations required by the customer’s team, the higher the costs. The additional communication, troubleshooting, and reconfiguration eat away at the efficiency that would otherwise lead to lower costs, faster response times, and better outcomes.

Ultimately, the key to achieving the best results, both for us and our customers, is trust. Just as a mechanic delivers the best service when left to do their job, we provide the most efficient and cost-effective cloud management when we can apply our expertise with minimal constraints. When customers give us the freedom to operate smoothly, everyone benefits — through lower costs, streamlined operations, and optimal performance.

Cloud Mechanics: The Cost of Customer Involvement in Managed Cloud Services was originally published in operational-sympathy on Medium, where people are continuing the conversation by highlighting and responding to this story.

Incident Fatigue: The Hidden Reliability Risk in SRE Teams

Wickram Bagawathinathan — Wed, 18 Feb 2026 09:31:58 GMT

Generated with ChatGPT

If you’ve ever been on call, you already know it… Let’s be very honest, folks…

Most outages, at least for me: don’t start with “Ohh the system is down!!!”. They actually start with “Ughhhh, no… not another one, please…”

Your phone buzzes at 02:17 AM, when you try to land on the moon 😋.
You squint at the alert. It looks familiar, very familiar indeed. And your brain urghs, “didn’t we fix this thing last week???”

Welcome to “Incident Fatigue”, one of the most ignored reliability risks in SRE teams!!!

Let me first breakdown what is incident fatigue, maybe in plain English:

Incident fatigue happens when engineers deal with too many alerts, too many incidents, and too much pressure, for too long without meaningful improvement. “Oh yeah, even that too many ‘too’ irritates me now!!!”

Well… It’s not about being “weak” or “bad at stress”.

This’s exactly what happens when: alerts keep firing, incidents keep repeating and mostly “nothing” really changes.

Eventually, even the best engineers stop reacting with urgency: not because they don’t care but because their brains are tired of seeing and attending the repetitive alerts. And I believe, tired brains make bad decisions in urgency!!!

Generated with ChatGPT

Let’s see whether you have seen this already…

Imagine this:

Your monitoring system fires ~300 alerts per day. Let’s assume high memory and CPU alerts for a particular deployment.
99% of them are not actually impacting a business use case or users untill you get that 1% when the system is completely down.
The same outage happens every month or so. Even the customer is aware and whenever there’s an outage customer asks to restart the VM.
Postmortems exist… but the action items quietly die. How?

At first:

Alerts attended very quickly.
Incidents are handled with care.
Everyone wants to fix this issue “properly”.

After sometime later:

Same alerts are muted.
Incidents are acknowledged but not immediately.
Now we know fixes are temporary.
And when on-call is assigned, we know what to expect.

Is this an engineer fault or failure? Not actually, this is a systerm design problem.

Now let’s understand how incident fatigue becomes a reliability issue rather human issue?

Here’s the uncomfortable truth to swallow:

A fatigue SRE team is a part of your production system!!! If your system requires tired humans to “save” it repeatedly, then the system itself is fragile.

MTTA increases (alert sit unattended or resolved without actually attending).
MTTR increases (decisions take longer).
Human errors increase.
Small issues turn into major outages.

Still, your dashboards may look green… Untill the day they don’t.

Generated with ChatGPT

The Silent Partner of Incident Fatigue: Psychological Safety

It gets worse when psychological safety is missing in your SRE team. What exactly is that?

Psychological safety in simple term is: When your SRE team thinks “We can speak up without getting blamed” during any incidents. This matters a lot when it comes to resolving the issue!!!

What lack of safety looks like?

Engineers, specially juniors hesitate to suggest ideas.
People avoid escalating things that bothers them.
Engineers stay quiet even when something feels wrong, and all your sync calls are silent.
Postmortems turn into “who broke it?”. Your lead says, “let me find the person responsible for the change that caused the issue”.

What to expect as the outcome?

Problems escalate and always require seniors intervention as they are answerable.
Signals are missed as the stress goes up.
Learning and motivation goes down as your team feels unsafe.

Ironically, teams that blame individuals end up with more incidents, not fewer.

Generated with ChatGPT

The Postmortem Trap

Most teams say they do blameless postmortems. Let’s test that now

Answer the question — no cheating.

👉 Do people still hesitate to admit mistakes?
If yes, it’s not blameless. It’s just a document with rich words.

A good postmortem asks:

What signals did we miss?
Why did the system allow this failure?
What made the “wrong” action seem reasonable at the time?

A bad one asks:

Who created the change request?
Why didn’t they test enough?
Why the test case was missed?
Why wasn’t this caught?

Now, think which one reduces future incidents? A good one or a bad one?

Team Resilience: The Metric that Nobody Graphs

We, SREs love metrics: Latency, Error rate, Availability, Etcetera, Etcetera.

But almost no one graphs team resilience. Yet, it’s one of the best predictors of future outages.

Signs your team is losing resilience:

Same incidents repeating.
People try to escape from on-call rotations.
Engineers hesitant to deploy changes.
“Hero engineers” always fixing things.
High turnover in Ops roles or sometimes, low performance as a team.

If people are burning out, reliability debt is accumulating.

What SRE Teams Should Focus?

Treat alerts like Production code:
- Alerts must be actionable.
- Alerts must represent user pain.
- Alerts are regularly reviewed and updated.
If an alert wakes someone up, it better deserve it.
Fix classes of problems, not symptoms:
- Instead of “Restart the pod when memory spikes” ask “Why does memory spike every Monday?”.
Permanent fixes reduce both incidents and fatigue.
Protect psychological safety on purpose:
- Anyone can call out concerns.
- Incident commanders rotate.
- Leads admit mistakes publicly.
- Postmortems are about learning, not judging.
Safer teams respond faster. Frozen if unsafe.
Track human sustainability:
- Alerts per engineer.
- Incidents per on-call shift.
- After-hours pages.
- PTO after incidents.
Again… not to judge people, but to protect them.

Generated with ChatGPT

Final food for your thoughts…

Will incident fatigue announce itself?
Does it sneaks in quietly?

What could be the scariest part?
Is it the best people have already started looking elsewhere?

If you want reliable systems, build reliable teams first.

Incident Fatigue: The Hidden Reliability Risk in SRE Teams was originally published in operational-sympathy on Medium, where people are continuing the conversation by highlighting and responding to this story.

The $180,000 Log Line: Hard Earned Lessons in Production Observability

Dilshan Fardil — Wed, 18 Feb 2026 09:17:16 GMT

Hard-earned lessons from the front lines: why one missing log statement can cost $180K and how simple decisions prevent production nightmares

Late night. Somewhere, an engineer’s phone lights up. The pager. They know before they even look, it’s bad.

They stumble to their laptop, heart racing. The dashboard loads. Payment processing is down. Customers can’t complete transactions. The revenue counter is frozen. Every minute costs thousands of dollars.

They check the metrics. CPU: 40%. Memory: normal. No error logs. The system looks… fine.

Except nothing is fine. Nothing works. And they have no idea why.

Eight hours later, after emergency war rooms, executive escalations, and five engineers pulled from their weekends, they find it. All worker threads every single one stuck waiting on a slow external API call. Thread pool exhausted. Not by high load. Not by a bug. Just… waiting.

The fix takes 5 minutes. The incident took 8 hours. Cost: $180,000 in lost revenue. Three enterprise customers escalating to executives. A team that spent their weekend fighting a fire.

All because someone decided “We’ll add thread pool monitoring later.”

No need for fancy things, One Single log line. One simple metric. That’s all it would have taken.

This Isn’t Hypothetical

At WSO2, we’ve been in the trenches supporting production systems for years. We’ve seen these scenarios play out again and again sometimes in systems we operate, often in systems our customers run. We’ve been on those late night calls. We’ve watched how a missing log statement turns a 20-minute fix into an 8-hour nightmare.

My colleague Afkham Azeez recently wrote about operational sympathy designing systems with explicit awareness of how they’ll behave and fail in production. It’s brilliant - Read it here

Today, I want to talk about one specific piece of that: observability. Not the theory. Not the tools. The human decisions we make every day that determine whether the next production incident is a 20-minute hiccup or an 8-hour catastrophe.

The Debt That Hides Until Production

Technical debt slows down development. Everyone sees it. Everyone complains about it.

Observability debt? Silent. Invisible. Until production, until customers are affected, until dawn when someone is blind and desperate.

We’ve seen it accumulate every time teams say:

“It works on my machine”. But do you know how it fails under real load?
“We’ll add logging if we need it”. How will you know you need it when you’re blind ?
“The ops team handles monitoring”. Can they monitor business logic they don’t understand?
“We’ll circle back after this sprint”. Will anyone remember what to instrument later?

Each decision feels small. Each sprint has urgent features. Each shortcut seems reasonable. Until production. Until customers. Until executives.

The Real Cost Isn’t Just Money

Yes, we’ve seen incidents cost $180,000 in lost revenue during thread pool exhaustion. $15,000 in SLA credits when database queries slow down. $230,000 in abandoned carts when payment systems cascade. These are real numbers from real production incidents.

But the spreadsheet doesn’t capture:

The engineer who couldn’t sleep for a week after being on-call during a major outage. The one who kept replaying it: “What if I’d added that one metric? What if I’d thought to check thread states?”

The customer who escalated to the CEO because their Black Friday sales were processing at 13% success rate. The relationship that took a year to build and one incident to damage.

The team that spent three weeks chasing a memory leak, over-provisioning infrastructure, trying random fixes when one cache size metric would have pointed straight to the problem on day one.

The product launch delayed because the team was fighting production fires instead of building features.

This is the real cost. The human cost. The trust cost. The opportunity cost. And it all traces back to simple decisions made months earlier.

Few Patterns That We See Repeatedly

Let me share four patterns we’ve observed across many production systems. Not pointing fingers just sharing what we’ve learned from the front lines. The human impact. The simple decision that would have changed everything.

The Thread Pool That Looked Fine

HTTP 503s everywhere. Service completely unresponsive. But dashboards show CPU at 40%, memory normal. From standard metrics, everything looks healthy.

The investigation: Hours of waiting for reproductions, manual thread dump captures. Multiple engineers pulled in.

The root cause: All threads blocked waiting on a slow external service. Not working. Not crashed. Just waiting.

What one log would have saved: Thread pool utilization + thread state. Would have shown “95% threads waiting” in the first 2 minutes.

The pattern: Threads can be “busy” without consuming CPU. If you only monitor CPU, you’re blind to blocking operations.

The Query Nobody Could Find

Response time: 200ms → 8 seconds. Random restarts, network checks, database restarts. Hours of guessing.

Root cause: One poorly-optimized query in a recent deployment causing table locks.

What one metric would have saved: Query execution time by type. Would have pointed directly to the culprit.

The Cascade That Spread

Payment success rate: 99.9% → 87%. But only certain payment methods. Manual testing of each provider. Database examination.

Root cause: One payment provider timing out, causing retries that exhausted thread pools for ALL providers.

What one log would have saved: Timeout tracking per integration. Would have isolated the failing provider immediately.

The Hidden Memory Leak

Pods dying every 6 hours. Memory climbing. Profiling. Heap dumps. Over-provisioning. Weeks of attempts.

Root cause: Unbounded cache storing session data indefinitely.

What one metric would have saved: Cache size monitoring. Would have shown unbounded growth from day one.

See the pattern?

Incidents that took hours or weeks. Fixes that took minutes. All preventable with simple, deliberate observability decisions made during development.

A Simple Framework with Five Questions Before You Ship

You don’t need a PhD in observability. You don’t need expensive tools. You need to pause and ask five questions before you ship:

If this breaks, will I know WHY ? Not just that it’s broken why. What log or metric will tell me the root cause?
If this is slow, what will tell me WHERE? Database? External API? Lock contention? Thread pool exhaustion? Can I pinpoint it?
If this uses resources, am I tracking them ? Thread pools, connection pools, caches, queues if it can fill up or run out, monitor it.
If this calls external services, am I logging failures ? Timeouts, retries, circuit breaker states. Not just “external call failed” which one and how.
Can I trace one user’s request end-to-end? Correlation IDs. Request IDs. Something that connects the dots across services.

Two minutes of thinking. Could save weeks of pain.

The Operational Sympathy Scorecard

In his operational sympathy post, Azeez shared a scorecard we now use at WSO2 for assessing production readiness. It’s brilliant in its simplicity:

Nine dimensions of operational readiness, weighted by importance. Each scored 0–5. Takes 15 minutes. Built-In Observability is one of the highest-weighted (15%) right alongside failure handling and recovery.

The question for observability: “Are meaningful metrics, logs, traces, and actionable alerts designed into the system?”

See the full scorecard here: https://docs.google.com/spreadsheets/d/1jryXy-aNQDoDgjMC8T2D5grdgP5bxQr-DwKJkB2hNfE/edit

Using this assessment before major releases forces uncomfortable conversations. But uncomfortable in code review is way better than uncomfortable at midnight in production.

To My Fellow Engineers. This Is Changeable

I know the pressure you’re under. Product wants features yesterday. The sprint is packed. “Just one more thing” feels like too much.

But here’s what we’ve learned from years in production:

That log line you skip today? That metric you defer? That’s not saving time. That’s borrowing from your future self. And your future self will pay it back at late night, with interest, in front of executives, while customers are affected.

We’ve been there. Staring at dashboards that tell us nothing. Trying random fixes. Waiting for thread dumps. Explaining to executives why we don’t know what’s wrong. Supporting customers through painful incidents.

Nobody wants to be in that position. And you don’t have to be.

This is changeable.

Not by heroic effort. Not by expensive tools. By simple, deliberate choices. By asking “what will I wish I had logged?” By spending 2 extra minutes thinking before you ship.

Those 2 minutes? They buy back hours. They buy back weekends. They buy back your sleep. They buy back customer trust.

Start Today. Start Small.

Don’t try to fix everything at once. Pick one thing:

This week: Add one resource metric to your busiest service. Thread pool. Connection pool. Cache size. Pick one.
Next code review: Ask “if this breaks, can we diagnose it?” If the answer is no, add what’s missing.
Before your next release: Run through the operational sympathy scorecard. Score yourself honestly. Pick your lowest-scoring area. Fix that first.

Small steps. Deliberate choices. Compound returns.

The Engineer Who Thanks You Won’t Be You

When you add that log line, that metric, that trace you won’t get a thank you. It won’t show up in sprint demos. Product won’t celebrate it.

But somewhere, sometime, an engineer you’ve never met will be on-call. Something will break. They’ll open the dashboard. And they’ll see exactly what they need.

They’ll fix it in 20 minutes instead of 8 hours. They’ll go back to sleep. Their weekend won’t be ruined. Customers won’t be affected. Executives won’t get escalation emails.

They won’t know your name. They won’t know it was you who added that metric during development.

But they’ll be grateful anyway.

Be the engineer who designs for the crisis before it happens. Think about the incident when you’re writing the code not after it’s burning in production. Because the next engineer on-call might be you.

Read more about operational sympathy: https://medium.com/@afkham-azeez/operational-sympathy-8a9c5dc26b5a

Use the scorecard: https://docs.google.com/spreadsheets/d/1jryXy-aNQDoDgjMC8T2D5grdgP5bxQr-DwKJkB2hNfE/edit

The $180,000 Log Line: Hard Earned Lessons in Production Observability was originally published in operational-sympathy on Medium, where people are continuing the conversation by highlighting and responding to this story.

Welcome!

Afkham Azeez — Wed, 18 Feb 2026 08:12:20 GMT

Welcome to the Operational Sympathy publication. We are a team of writers who are passionate about the subject and will share our thoughts and concepts. The blog on Operational Sympathy introduced the concept and you can read more about it there.

Definition

Operational Sympathy is a mindset and practice where software architects, designers, and developers intentionally design how their systems behave in production under load, during failures, and under security threats by planning failure modes, graceful degradation, built-in observability, and clear operational run books that enable early detection and fast recovery.

The thinking would be, if you the developer were the person to be woken at 2 AM by an alert, what would you do to make your life easy so that the incident could be resolved with the least effort and shortest possible time.

If you are a software architect, developer, SRE or devops person, please subscribe to this publication to get notified about articles on this area.

Welcome! was originally published in operational-sympathy on Medium, where people are continuing the conversation by highlighting and responding to this story.