Stories by Nikhil Ninawe on Medium

Automating the Hotfix Pipeline: How We Empowered Engineering Managers and Cut Release Bottlenecks

Nikhil Ninawe — Sun, 10 May 2026 05:28:03 GMT

A practical guide to reducing release team dependency, one Jenkins job at a time.

Every SaaS company that ships frequently eventually hits the same wall: the hotfix process becomes a bottleneck. In our Org, we reached a point where every critical fix — no matter how small — required the release engineering team to manually create Jira versions, spin up Slack channels, create branches, trigger builds, and orchestrate deployments. Our engineers could write a one-line fix in 10 minutes, but getting it to production took hours of coordination.

Last week, I set out to change that. Here’s how we automated our hotfix pipeline end-to-end and gave engineering managers the power to ship critical fixes without waiting on the release team.

The Problem: Too Many Handoffs, Too Little Autonomy

Our hotfix process had seven discrete manual steps, most of which required the release team to act as a human router. A typical flow looked like this:

Release team creates a new hotfix version in Jira
Release team creates a Slack channel, adds the right people, posts a kickoff message
Release team creates the hotfix branch in Bitbucket
Developer opens a PR against the hotfix branch
Release team reviews and merges the PR
Release team triggers the build manually
Release team deploys to the smoke environment and coordinates QA verification

The engineering managers — the people with the most context on what needed to be fixed — were spectators in their own hotfix process. Every step required a ping to the release channel, a wait for someone to pick it up, and a confirmation loop. During off-hours or weekends, a simple bug fix could sit for hours waiting on a human in the loop.

The Solution: Automate the Boring Parts, Delegate the Rest

We broke the problem into three layers: setup automation, merge access delegation, and build + deploy automation.

Layer 1: One-Click Hotfix Setup

The first bottleneck was the administrative overhead of starting a hotfix. Creating the Jira version, the Slack channel, adding team members, and posting the kickoff message — all of this was manual and error-prone.

We consolidated all of these into a single parameterised Jenkins job. An operator provides the release version and hotfix number, and the job handles everything:

Creates the fix version in Jira using the Jira REST API
Creates a dedicated Slack channel with a standardized naming convention
Adds the relevant team members automatically
Posts a templated kickoff message with all the context a developer needs to get started

To make the fix version creation even more resilient, I wrote a Python script that programmatically creates and tags Jira fix versions.

Layer 2: Giving Managers the Keys

This was the cultural shift as much as a technical one. Historically, only the release team had merge permissions on hotfix branches in Bitbucket. This made sense as a guardrail early on, but it had become a bottleneck.

We raised a formal request to provide merge access to engineering managers across all repositories. The principle was simple: managers already approve the code in review — they should be able to click the merge button too.

To support this, we also built a self-service Jenkins job that lets managers create hotfix branches themselves. No more waiting for the release team to cut a branch. A manager sees a critical bug, creates the branch, and tells their developer to push a fix.

Layer 3: Automated Build and Deploy

With managers now able to merge, we needed the downstream pipeline to be fully automated. Here’s what we built:

Bitbucket Webhooks → Jenkins Builds: Every merge to a hotfix branch triggers an automated build via a Bitbucket webhook. No one needs to click “Build Now.” The moment code lands on the hotfix branch, the pipeline kicks in.

Automatic Deployment to Smoke: After a successful build, the artifact is automatically deployed to our smoke-bravo environment. QA can begin validating within minutes of a merge, not hours.

What’s Still on the Roadmap

The pipeline isn’t fully autonomous yet. Here’s what’s next:

Automated Smoke Stack Creation: Right now, the smoke environment needs to exist before a deployment. We want the pipeline to automatically provision a smoke stack if one isn’t already running.
Automated Regression Suite Triggers: Post-deployment, regression tests should be triggered automatically. We’re evaluating whether this should be a Jenkins downstream job or a webhook-triggered test harness.
Smart Verification Tagging: The biggest remaining manual step is QA verification. Our plan is to introduce a smoke-verified tag on Jira tickets. Once QA validates a fix, they tag the ticket and post a smoke-verification-complete message. The system will then automatically check that all tickets in the hotfix are tagged and all verification messages are posted. Only when both conditions are met will it push an automated "all green" message to the hotfix Slack channel, signalling readiness for production.

This last piece is particularly exciting because it replaces the most anxiety-inducing part of the process: the manual “did everyone verify their fix?” follow-up loop.

Lessons Learned

Start with the highest-friction handoff. We didn’t try to automate everything at once. We started with the merge permission change because it was the single most frequent bottleneck. Everything else cascaded from there.

Trust, then verify. Giving managers merge access required trust. But we didn’t remove guardrails — branch protection rules, required reviewers, and automated builds all still apply. We just removed the “wait for the release team to click a button” step.

Automate the boring parts first. The Jenkins job that creates a Jira version and a Slack channel saves maybe 10 minutes per hotfix. But it happens dozens of times per release cycle, and eliminating it removed an entire category of “I’m blocked waiting for someone.”

The Numbers

Before this work, a typical hotfix cycle — from “bug identified” to “deployed to smoke” — required 4–5 handoffs to the release team and could take anywhere from 2 to 8 hours depending on availability.

With the new pipeline, once a manager creates the branch and a developer pushes the fix, the path from merge to smoke deployment is fully automated and takes under 15 minutes. The release team is no longer in the critical path for hotfixes.

Wrapping Up

Release engineering is often invisible work. When it’s done well, nobody notices — deployments just happen. When it’s done poorly, everyone notices — and they notice loudly, at 2 AM, in an incident channel.

The work we did last week wasn’t about building something flashy. It was about removing friction, one Jenkins job at a time. It was about recognising that the people closest to the code should be empowered to ship fixes without waiting in a queue. And it was about turning a seven-step manual process into a pipeline where humans only need to do what humans are good at: writing the fix and verifying it works.

If your release process still has a human acting as a router between “code is ready” and “code is in the environment,” consider this your sign to automate it.

From Database Chaos to In-Memory Speed: Optimizing Error Monitoring at Scale

Nikhil Ninawe — Mon, 04 May 2026 01:33:26 GMT

How we reduced error alert processing time by 90% while maintaining flexibility

The Problem: Death by a Thousand Database Queries

Picture this: You’re an SRE team managing a complex microservices architecture across multiple environments. Your error monitoring system is generating thousands of alerts daily, but most of them are noise — known issues, third-party library warnings, or infrastructure hiccups you’ve already cataloged.

You need to filter these errors, but every single error check requires a database query to see if it matches a known pattern. Your monitoring system is drowning, processing is slow, and legitimate alerts are getting buried in the noise.

This was our reality. And this is the story of how we fixed it.

The Evolution: Three Iterations to Success

V1: The Naive Approach 🐌

Our first implementation was straightforward but painful:

# Check EVERY error against EVERY pattern in the database
for error in new_errors:
    for pattern in get_patterns_from_db():  # Database query!
        if pattern in error:
            ignore_error()
            break

The Problem:

1000 errors × 200 patterns = 200,000 database queries per run
Processing time: 5–10 minutes
Database load: Unsustainable
Alert latency: Unacceptable

V2: The Hard-Coded Solution 🏃

Desperate for speed, we hard-coded patterns:

IGNORED_PATTERNS = [
    "connection timeout",
    "third-party API error",
    # ... 200 more patterns
]

for error in new_errors:
    if any(pattern in error for pattern in IGNORED_PATTERNS):
        continue

The Result:

Processing time: < 10 seconds ⚡
Database load: Zero
But… updating patterns required code deployment 😱

V3: The Sweet Spot 🎯

What if we could have both speed AND flexibility? Enter: database-backed in-memory caching.

The Architecture: Best of Both Worlds

Here’s the key insight: patterns change infrequently, but we check them constantly.

The Solution Design

class ErrorFilterDB:
    """
    Database-backed error filtering with in-memory caching.
    Load once, use thousands of times.
    """
    def __init__(self, host: str, username: str, password: str):
        self.host = host
        self.username = username
        self.password = password
        # In-memory caches
        self._error_patterns_cache: Set[str] = set()
        # Load patterns once at startup
        self.refresh_cache()

    def refresh_cache(self):
        """Load all patterns from database into memory."""
        conn = self.get_connection()
        try:
            with conn.cursor() as cursor:
                query = """
                    SELECT error_pattern
                    FROM ignored_errors
                    WHERE active = 1
                """
                cursor.execute(query)
                results = cursor.fetchall()
                # Store in a Set for O(1) lookup
                self._error_patterns_cache = {
                    row['error_pattern'] for row in results
                }
                print(f"Loaded {len(self._error_patterns_cache)} patterns")
        finally:
            conn.close()
    def should_ignore_error(self, error_string: str) -> bool:
        """
        Check if error matches any pattern.
        Uses generator expression for early exit.
        """
        return any(
            pattern in error_string
            for pattern in self._error_patterns_cache
        )

Why This Works

One-Time Load: Patterns loaded once at startup
Set Operations: O(1) membership testing
Generator Expression: Early exit on first match
No Database Overhead: Zero queries during processing
Easy Updates: Change patterns in DB, restart service

The Complete Pipeline

Here’s how it all comes together:

# Initialize once at startup
print("Initializing error filter...")
error_filter = ErrorFilterDB.create_from_vault()
print(f"Loaded {len(error_filter._error_patterns_cache)} patterns")
# Main processing loop
while True:
    new_error_string = redis_client.lpop('elastalert.new_errors')
    if not new_error_string:
        break
    # Fast in-memory check - no database query!
    if error_filter.should_ignore_error(new_error_string):
        continue  # Skip known errors
    # Process new/unknown errors
    process_and_alert(new_error_string)

Handling Distributed Systems Challenges

Real-world systems aren’t perfect. Network latency, temporary outages, and slow queries happen. Here’s how we made our system resilient:

OpenSearch Optimization

# OpenSearch client with battle-tested timeout settings
es = OpenSearch(
    hosts=[{'host': f"{environment}-analytics.domain.net", 'port': 80}],
    timeout=60,           # Increased from default 10s
    max_retries=3,        # Retry on transient failures
    retry_on_timeout=True # Don't fail on temporary slowdowns
)

Why these numbers matter:

60s timeout: Gives complex aggregation queries time to complete
3 retries: Handles temporary network blips
retry_on_timeout: Prevents false failures during high load

Multi-Environment Architecture

Supporting multiple environments (production, staging, rehearsal) required careful design:

ENVIRONMENT = os.getenv('ENVIRONMENT', 'rehearsal')
webhook_map = {
    "rehearsal": webhook_url,
    "stage": stage_webhook_url,
    "production": production_webhook_url
}
# Environment-aware connections
redis_client = redis.Redis(host=f"{ENVIRONMENT}-platform-cache.domain.net")
es_client = OpenSearch(hosts=[{'host': f"{ENVIRONMENT}-analytics.domain.net"}])

This pattern ensures:

Same code runs everywhere
Environment-specific routing
No accidental cross-environment contamination

The Results: Numbers That Matter

Here’s what we achieved with the optimized v3 implementation:

Performance Metrics

MetricV1 (Database)V2 (Hard-coded)V3 (Cached)Processing Time5–10 minutes<10 seconds<10 secondsDatabase Queries200,000/run01 (startup only)Pattern UpdatesInstantRequires deploymentRestart serviceMemory UsageLowLow~1MB for 1000 patternsMaintainabilityGoodPoorExcellent

Real-World Impact

90% reduction in alert processing latency
99.9% reduction in database load
Zero code deployments needed for pattern updates
100% flexibility retained for pattern management

Advanced Patterns and Techniques

1. Factory Pattern for Credential Management

Instead of scattering credential logic, we centralized it:

@staticmethod
def create_from_vault():
    """Factory method using vault credentials."""
    host = 'common-vault.domain.net'
    username = os.environ['MYSQL_USERNAME']
    password = os.environ['MYSQL_PASSWORD']
    return ErrorFilterDB(host, username, password)
# Usage
error_filter = ErrorFilterDB.create_from_vault()

Benefits:

Single source of truth for credentials
Easy to swap credential providers
Clean separation of concerns

2. Generator Expressions for Early Exit

This subtle optimization saves significant CPU:

# ❌ BAD: Checks ALL patterns even after finding a match
matches = [pattern in error for pattern in patterns]
if any(matches):
    return True
# ✅ GOOD: Stops at first match
return any(pattern in error for pattern in patterns)

With 1000 patterns and matches often in the first 10, this saves ~99% of checks.

3. Smart Alert Routing

Different error severities go to different channels:

def send_alert_to_slack(error):
    total_hits = error['total_hits']
    if total_hits in range(1, 20):
        color = "warning"
        webhook = webhook_warn_url
    elif total_hits > 20:
        color = "danger"
        webhook = webhook_error_url
    else:
        return  # Don't alert on single occurrences
    slack.send_alert_with_title(webhook, message, title, color)

This prevents alert fatigue while ensuring critical issues get immediate attention.

Lessons Learned: The Hard Way

1. Measure Before Optimizing

We initially thought database connection pooling would solve our problem. It didn’t. Only after measuring did we realize the sheer number of queries was the issue, not connection overhead.

Takeaway: Profile first, optimize second.

2. Cache Invalidation is Still Hard

Our initial implementation had no cache refresh mechanism. When patterns were updated, services needed manual restart. We solved this with:

# Option 1: Periodic refresh (add to main loop)
last_refresh = time.time()
if time.time() - last_refresh > 300:  # 5 minutes
    error_filter.refresh_cache()
    last_refresh = time.time()
# Option 2: Signal-based refresh
# Send SIGHUP to process to trigger refresh

3. Observability is Non-Negotiable

Strategic logging saved us countless debugging hours:

print(f"Loaded {len(self._error_patterns_cache)} patterns")
print(f"Ignoring error (matched filter): {error[:100]}...")
print(f"Processing new error: {error_id}")

In distributed systems, you can’t debug what you can’t see.

4. Design for Multiple Environments from Day One

Adding multi-environment support later would have been painful. Building it in from the start made testing and deployment trivial.

Common Pitfalls to Avoid

1. Memory Leaks with Unbounded Caches

# ❌ BAD: Cache grows forever
cache[error_id] = error_data
# ✅ GOOD: Use TTL or LRU
if time.time() - error['timestamp'] > 86400:  # 24 hours
    continue  # Don't re-process old errors
2. Race Conditions with Cache Updates
# ❌ BAD: Could serve stale data during refresh
self._error_patterns_cache = new_patterns

# ✅ GOOD: Atomic swap
new_cache = {pattern for pattern in fetch_patterns()}
self._error_patterns_cache = new_cache  # Atomic in Python

3. Ignoring Edge Cases

What happens when:

Database is down during startup?
Pattern table is empty?
Environment variable is missing?

Handle these explicitly or fail loudly.

Beyond Error Monitoring: Broader Applications

The patterns we used apply to many real-time processing scenarios:

1. API Rate Limiting

Cache user quotas in-memory, refresh periodically from database

2. Feature Flags

Load flag configurations once, check thousands of times

3. Access Control Lists

Cache permissions, avoid database hits on every request

4. Content Filtering

Spam detection, profanity filters, content moderation

The core principle: When read-to-write ratio is high, cache aggressively.

Future Enhancements

Where we’re headed next:

1. Distributed Caching

Use Redis for cross-service cache sharing:

# Share cache across multiple service instances
redis_client.setex('error_patterns', 300, json.dumps(patterns))

2. Pattern Analytics

Track which patterns match most frequently to optimize pattern order

3. Machine Learning Integration

Auto-detect new error patterns using clustering algorithms

4. Self-Healing Patterns

Automatically add patterns for recurring errors

Conclusion: Speed AND Flexibility

The journey from V1 to V3 taught us that you don’t have to choose between performance and maintainability. With thoughtful architecture:

Database-backed storage gives you flexibility
In-memory caching gives you speed
Smart refresh strategies keep data fresh
Robust error handling keeps systems reliable

Whether you’re building error monitoring, rate limiting, or any high-throughput filtering system, these patterns will serve you well.

The code is in production, handling thousands of errors per minute across multiple environments. It’s fast, it’s maintainable, and it just works.

Key Takeaways

✅ Cache when read:write ratio is high — Our 1000:1 ratio was perfect for caching
✅ Use the right data structure — Sets for O(1) membership testing
✅ Generator expressions for early exit — Stop checking after first match
✅ Design for multiple environments — Same code, different configs
✅ Build observability in from day one — You can’t fix what you can’t see
✅ Measure before and after — Know your baseline, prove your improvement

The Hidden Deployment Bug That Brought Down Our UI: A Tale of Cache, Load Balancers, and Racing…

Nikhil Ninawe — Wed, 29 Apr 2026 07:28:16 GMT

The 3 AM Wake-Up Call

Continue reading on Medium »

A MongoDB Primary Switch Took Down a “Healthy” Service: Lessons from Stale Connection Pools and…

Nikhil Ninawe — Fri, 24 Apr 2026 16:01:07 GMT

A MongoDB Primary Switch Took Down a “Healthy” Service: Lessons from Stale Connection Pools and 2‑Second Timeouts

TL;DR

A planned MongoDB primary switch, combined with stale connection pools and an aggressively low 2‑second timeout, caused a core microservice to start throwing MongoTimeoutException errors. The database cluster was healthy, but the application wasn’t. The incident is a great case study in how application behavior during topology changes can be your real reliability bottleneck.

What Happened (in 2 Minutes)

One weekday morning, a backend microservice (let’s call it orders-api) began throwing MongoTimeoutException errors when trying to query MongoDB. Users saw:

Intermittent failures when listing orders
Reporting endpoints returning HTTP 500
Overall degraded performance and timeouts

The trigger was not a new deploy of orders-api.

Earlier, the infrastructure team had performed a planned MongoDB primary node switch within a replica set, mainly for cost/placement and resilience reasons (cross‑AZ optimization, maintenance, etc.).

Hours after that change, orders-api was still holding stale connections and failing to establish fresh ones within a strict 2‑second connection timeout. Under normal conditions, this might have been barely tolerable; during a topology change, it became a full-blown incident.

Roughly:

Time to detect (MTTD): a few minutes
Time to recover (MTTR): under an hour, after rotating instances and scaling the service

Why This Hurt Users

This wasn’t a background batch job failing silently. This service sat directly on live user paths:

Order/Shipment listing pages started failing
Reporting and analytics that depended on Mongo queries were intermittently down
Clients saw HTTP 500s and increased latency on critical endpoints

Internally, the team also noticed some instances experiencing heavy garbage collection (GC) during the incident window, which made those instances even slower and more fragile.

Bottom line: a database topology change that should have been a non-event for users turned into a visible outage.

Root Cause: Not Just “Mongo Failover”

On paper, MongoDB replica sets are built to handle elections, failovers, and primary changes seamlessly. The cluster itself was fine.

The real root cause lived in the interaction between the application and the cluster.

1. Stale Connection Pools After Primary Switch

The application used MongoDB connection pooling. When the primary node changed, the app:

Did not restart
Continued using pools built against the old primary topology
Failed to refresh effectively and struggled to create new, healthy connections

Under load, calls to Mongo began to fail with MongoTimeoutException because getting a usable connection within the timeout window became increasingly rare.

2. Over‑Aggressive Timeouts (2 Seconds)

The service had a 2‑second connection/operation timeout.

That sounds “strict and snappy,” but in production, 2 seconds can be too low under:

Brief network hiccups
JVM pauses (GC)
Connection pool churn after topology changes
TLS handshakes and reconnections
Peak traffic periods

Once the driver/pool was in a bad state, that 2‑second limit ensured that even minor latency or connection hiccups surfaced immediately as hard failures.

In other words, the timeout configuration amplified the fragility instead of containing it.

3. Runtime Instability (GC Pauses)

Some instances showed signs of Full GC during the incident.

Even if Mongo had been perfectly healthy, Full GC events reduce available CPU for request handling, stretch latencies, and make connection acquisition more unpredictable. Combined with a 2‑second timeout, this turned a transient condition into a stream of failures.

How the Team Mitigated It

The short‑term fix was operational:

Rotate instances: replace or restart the orders-api instances so that they boot up with fresh connection pools targeting the correct Mongo primary.
Scale up temporarily: increase the instance count during rotation to maintain some capacity while instances went through warm‑up, GC, and connection pool building.
Watch dashboards: keep an eye on error rates, latencies, and health checks until the fleet stabilized.

This worked because new processes built clean pools against the new primary. The incident closed once the rotated fleet was stable and error rates returned to baseline.

The Real Lesson: Infra Changes Aren’t Done Until Apps Prove It

A database primary switch is often treated as an infrastructure task:

“The cluster is healthy, failover succeeded, we’re done.”

This incident shows that’s only half of the job.

You’re not really “done” until:

Applications have reconnected and stabilized,
Key user flows are working end‑to‑end, and
Error rates and latency haven’t regressed.

In practice, that suggests a mindset shift:

“Database topology change + application validation” is a single operation.

What to Fix Next Time

Here are the improvements that fall naturally out of this incident.

1. Revisit MongoDB Timeouts

A flat 2‑second timeout may feel tough and “fast,” but in production it can:

Turn transient conditions into visible user failures
Offer almost no room for connection pool recovery after events like elections or primary switches

A more resilient approach:

Use more generous timeouts (e.g., 10–30 seconds for connection establishment), especially during failovers.
Use separate values for:
Connection acquisition timeout
Socket read/write timeout
Overall operation timeout
Monitor these metrics so you can tune down from a place of observed safety, not guesswork.

2. Make Failover Runbooks Include Application Behavior

When planning a MongoDB primary switch (or any similar infra change), your runbook shouldn’t end at “cluster looks healthy.”

It should contain explicit steps to validate:

Can your key services (orders-api, billing-api, etc.) still talk to Mongo?
Do user journeys like “list orders,” “run report,” and “create shipment” still work?
Are error rates, latency, and saturation metrics stable after the switch?

If some services consistently struggle to recover:

Treat that as a known risk.
Add a temporary mitigation (e.g., targeted rolling restart) until you fix the underlying behavior.

3. Improve Application Resilience to Topology Changes

Instead of relying on restarts forever, fix the deeper issues:

Confirm the MongoDB driver is configured to be replica‑set aware, not pinned to a single host.
Ensure the connection pool is allowed to:
Detect topology changes
Drop bad connections
Rebuild pools without needing an app restart
Add instrumentation:
Pool size
In‑use vs available connections
Connection acquisition latency
Timeout counts

This turns “mysterious MongoTimeoutExceptions” into something you can see and debug.

4. Watch JVM Health Alongside DB Health

Because GC pauses worsened this incident, it’s worth:

Tracking GC pauses (especially Full GC), heap usage, and allocation rates
Correlating spikes in GC with spikes in:
Latency
Connection timeouts
Error rates

Sometimes your “database problem” is really a JVM health problem that just manifests at the database boundary.

Closing Thoughts

This incident wasn’t about a flaky database. MongoDB did exactly what it was designed to do: hold elections and promote a new primary.

The real failure was in how the application handled a totally expected topology change, combined with timeouts tuned more for ideal conditions than for real‑world chaos.

If you operate microservices on top of MongoDB (or any distributed database), it’s worth asking:

“When the topology changes, does my application bend… or break?”

Designing for the former is where true reliability lives.

Debugging Production Issues: A Journey Through Exception Replay Bug Fixes

Nikhil Ninawe — Fri, 24 Apr 2026 07:15:19 GMT

When Your Exception Handler Becomes the Exception

Continue reading on Medium »

Building a Production-Ready MongoDB Query Executor: Handling UUID Representations and ArrayFilters…

Nikhil Ninawe — Sun, 19 Apr 2026 06:31:59 GMT

Introduction

Continue reading on Medium »

The $7,000/Year MongoDB Optimization Nobody Talks About: Strategic Primary Placement

Nikhil Ninawe — Thu, 16 Apr 2026 08:11:43 GMT

How we accidentally discovered that WHERE your MongoDB primary lives matters as much as HOW you configure it

Continue reading on Medium »

Securing Sensitive Data in Logstash: Hashing Authentication Tokens in Access Logs

Nikhil Ninawe — Fri, 10 Apr 2026 05:31:56 GMT

Introduction

Continue reading on Medium »

How We Saved $8,000/Year by Adding One MongoDB Connection String Parameter

Nikhil Ninawe — Wed, 08 Apr 2026 05:12:42 GMT

A deep dive into optimizing AWS Inter-Availability Zone data transfer costs

Continue reading on Medium »

1,163 CloudWatch Alarms in Non-Production: A DevOps Horror Story

Nikhil Ninawe — Sat, 04 Apr 2026 09:22:40 GMT

How We Reduced Alarm Fatigue, Cut Costs by 85%, and Actually Started Trusting Our Monitoring Again

TL;DR: Our staging and rehearsal environments had 1,163 CloudWatch alarms. 50 were actively firing (all false positives). We were paying 115/month and ear, and fixed our broken monitoring culture.

The Alert That Changed Everything

It was 3 AM when my phone buzzed for the 47th time that week.

“ALARM: rehearsal-accounts-bravo-LowCPUUtilization”

I ignored it. So did everyone else on the team.

By morning, we had 12 more Slack notifications, 8 PagerDuty alerts, and 3 emails. All from our rehearsal environment — a testing environment that barely gets traffic.

This was the moment I realized: We had an alarm problem.

The Audit: Opening Pandora’s Box

I wrote a quick Python script to inventory all our CloudWatch alarms:

import boto3cloudwatch = boto3.client('cloudwatch', region_name='us-west-2')
alarms = cloudwatch.describe_alarms()
print(f"Total alarms: {len(alarms['MetricAlarms'])}")

The result made me do a double-take:

Total alarms: 4,989

4,989 CloudWatch alarms. For context, we had 632 EC2 instances.

That’s 7.9 alarms per instance. We were paying AWS to send us thousands of notifications we’d learned to ignore.

Breaking Down the Madness

When I filtered for just non-production environments:

Environment Alarms In ALARM StateCost/Month

Production 3,156 127 (4.0%) $314.60

Rehearsal 690 29 (4.2%) $68.00

Stage 47321 (4.4%) $46.30

Dev 670 18 (2.7%) $66.00

Our non-production environments alone had 1,833 alarms, costing $180.30/month.

But here’s the kicker: 68 alarms were actively firing in non-prod. When I investigated:

❌ Zero actual incidents
❌ Zero actionable alerts
❌ 100% noise

The Patterns of Failure

Pattern #1: The “Low CPU Utilization” Epidemic

35 alarms across rehearsal and stage were firing for “Low CPU Utilization.”

rehearsal-accounts-bravo-LowCPUUtilization: ALARM
rehearsal-api-bravo-LowCPUUtilization: ALARM
rehearsal-config-alpha-LowCPUUtilization: ALARM
... (31 more)

The problem: We’d set the threshold at <10% CPU for 5 minutes.

The reality: Rehearsal environments get maybe 10 requests per hour. Of course the CPU is idle!

The cost:

35 alarms × 0.10/ℎ=∗∗3.50/month**
Wasted engineering time investigating: ~2 hours/month
Opportunity cost: Priceless (we stopped trusting our monitoring)

Pattern #2: The Copy-Paste Syndrome

Every service had exactly 9 alarms:

{service}-HighCPUUtilization
{service}-LowCPUUtilization
{service}-HighMemoryUtilization
{service}-LowMemoryUtilization
{service}-HighDiskSpace
{service}-HTTP4xxCount
{service}-HTTP5xxCount
{service}-UnhealthyHostCount
{service}-StatusCheckFailed

77 services × 9 alarms = 693 alarms

This was clearly from a Terraform module that someone wrote once and we copy-pasted forever.

The problems:

Not all services need all alarms (a database doesn’t have HTTP 4xx codes)
Thresholds weren’t adjusted per service (calendar service ≠ auth service)
No differentiation between prod and non-prod

Pattern #3: The Zombie Alarms

35 alarms in INSUFFICIENT_DATA state — meaning the metrics they’re watching don’t exist anymore.

stage-mongo-07-HighDiskSpace: INSUFFICIENT_DATA
rehearsal-old-api-HighMemory: INSUFFICIENT_DATA

What happened:

Instances terminated 6 months ago
Services renamed or decommissioned
Metrics stopped being published

What we did:

Keep paying $0.10/month per alarm
Ignore the noise
Assume “INSUFFICIENT_DATA” is normal

What we should have done: Delete them.

Pattern #4: The Disk Space Time Bombs

3 MongoDB instances in stage had real disk space issues:

stage-mongo-04: 92% disk usage
stage-mongo-05: 94% disk usage
stage-mongo-06: 91% disk usage

But guess what? Nobody noticed because we’d trained ourselves to ignore all stage alarms.

This is the real cost of alarm fatigue: When everything is on fire, nothing is on fire.

The True Cost of Alarm Sprawl

Direct Costs:

1,163 non-prod alarms
- 10 free tier alarms
= 1,153 billable alarms

1,153 × $0.10/month = $115.30/month
Annual cost: $1,383.60

Indirect Costs (Much Worse):

Team Burnout:

Average 15 false alerts per day
2–3 hours/week investigating noise
On-call engineers ignoring pages

2. Missed Real Issues:

3 MongoDB instances at 90%+ disk (critical!)
2 unhealthy instances (degraded performance)
Actual production incidents buried in noise

3. Tool Mistrust:

“Just ignore stage alerts”
“PagerDuty notifications? Probably nothing”
CloudWatch became a joke

You can’t put a price on broken trust in your monitoring.

The Cleanup Plan

Phase 1: Stop the Bleeding (Week 1)

1. Delete Zombie Alarms

import boto3
cw = boto3.client('cloudwatch')
# Get all alarms in INSUFFICIENT_DATA for >7 days
alarms = cw.describe_alarms(StateValue='INSUFFICIENT_DATA')
for alarm in alarms['MetricAlarms']:
    # Verify it's actually a zombie (no metrics in 7 days)
    if is_zombie(alarm):
        print(f"Deleting: {alarm['AlarmName']}")
        cw.delete_alarms(AlarmNames=[alarm['AlarmName']])

Result:

Deleted 35 zombie alarms
**Savings: 3.50/ℎ∗∗(42/year)
Time: 30 minutes

2. Fix the Obvious Issues

Those MongoDB servers? Yeah, we actually fixed them:

# Clear old MongoDB logs
ssh stage-inventory-mongo-04
find /var/log/mongodb -name "*.gz" -mtime +30 -delete
# Increase volume size
aws ec2 modify-volume --volume-id vol-xxx --size 200

Result:

3 critical issues resolved
Alarms stopped firing
Time: 1 hour

3. Adjust Non-Prod Thresholds

For non-production environments, we changed:

# Before
LowCPUThreshold: 10%   # Too sensitive!
HighCPUThreshold: 80%
EvaluationPeriods: 1   # Too quick!

# After (Non-Prod)
LowCPUThreshold: 5%    # More realistic
HighCPUThreshold: 90%  # Higher tolerance
EvaluationPeriods: 3   # 15 mins instead of 5

Or better yet: Disabled “Low CPU” alarms entirely for non-prod.

Rationale: Non-prod environments are supposed to be idle most of the time. That’s not a problem; it’s expected.

Result:

35 noisy alarms silenced
Savings: $3.50/month
Time: 2 hours (Terraform updates)

Phase 2: Consolidate with Composite Alarms (Week 2–3)

This is where the magic happened.

Old approach: 9 alarms per service New approach: 1 composite alarm per service

import boto3
cw = boto3.client('cloudwatch')
# Create composite alarm
cw.put_composite_alarm(
    AlarmName='rehearsal-accounts-service-health',
    AlarmRule='ALARM(rehearsal-accounts-cpu-high) OR '
              'ALARM(rehearsal-accounts-memory-high) OR '
              'ALARM(rehearsal-accounts-unhealthy) OR '
              'ALARM(rehearsal-accounts-5xx-high)',
    AlarmActions=['arn:aws:sns:us-west-2:xxx:ops-alerts'],
    AlarmDescription='Composite health check for accounts service'
)
# Delete the 9 individual alarms
individual_alarms = [
    'rehearsal-accounts-cpu-high',
    'rehearsal-accounts-cpu-low',
    'rehearsal-accounts-memory-high',
    # ... 6 more
]
cw.delete_alarms(AlarmNames=individual_alarms)

Benefits:

✅ One alert instead of 9
✅ Only fires if something is ACTUALLY wrong
✅ Cleaner alert messages
✅ 88% cost reduction per service

Scaling this across 77 services:

Before: 77 services × 9 alarms = 693 alarms
After:  77 services × 1 alarm  = 77 alarms
Reduction: 616 alarms
Savings: 616 × $0.10 = $61.60/month ($739/year)

Phase 3: Right-Size the Infrastructure (Week 3–4)

Those 35 “Low CPU” alarms? They were telling us something:

We were over-provisioned.

Current: t3.large (2 vCPU, 8GB RAM)
Usage:   5-10% CPU, 2GB RAM
Right-size: t3.medium (2 vCPU, 4GB RAM)
Savings: ~$15/instance/month

For 35 instances:

**EC2 Savings: 525/ℎ∗∗(6,300/year)
Alarm Reduction: 35 alarms (now they run at 10–20% CPU)

This is the real win: The alarms weren’t just noise — they were telling us we were wasting money on compute.

The Final Results

Cost Impact:

CloudWatch Alarms:
  Before: $115.30/month
  After:  $16.00/month
  Savings: $99.30/month ($1,192/year)
Right-Sizing EC2 (bonus):
  35 instances: t3.large → t3.medium
  Savings: $525/month ($6,300/year)
TOTAL: $7,492/year

Operational Impact (Priceless):

✅ On-call engineers sleep better

Before: 15 false alerts/day
After: 0–1 false alerts/week

✅ We trust our monitoring again

When an alarm fires, we investigate
Actually caught 2 real issues in first month

✅ Faster incident response

Before: “Which of these 8 alerts is real?”
After: “One alert = one problem = one action”

✅ Better use of engineering time

Before: 2–3 hours/week on false alerts
After: 0.5 hours/week on real incidents
Reclaimed: ~100 hours/year per engineer

Lessons Learned

1. Non-Prod ≠ Prod (Obviously?)

We applied the same alarm strategy to all environments. That’s like using the same security for your house and your shed.

Better approach:

Prod: Aggressive monitoring, low thresholds, page immediately
Stage: Moderate monitoring, email/Slack only
Dev: Minimal monitoring, maybe just health checks

2. Composite Alarms Are a Game-Changer

Creating 9 alarms per service is lazy automation. It’s the monitoring equivalent of:

try:
    # entire application
except Exception as e:
    print("Something went wrong!")

Composite alarms let you say: “Alert me if CPU is high AND memory is high AND we’re getting 5xx errors.”

Not: “Alert me if CPU is 11% for 6 minutes on a Sunday at 3 AM.”

3. Alarm Fatigue Kills

The MongoDB disk space issues were REAL problems that could’ve caused outages. But we missed them because they were buried in 50 false positives.

This is the hidden cost of bad monitoring: When everything’s an emergency, nothing’s an emergency.

4. Alarms Should Drive Action

We had alarms like “Low CPU Utilization” that had no runbook, no action, no owner.

Ask yourself: “If this alarm fires at 2 AM, what should the on-call engineer do?”

If the answer is “nothing” or “I don’t know,” delete the alarm.

5. Infrastructure as Code Can Go Wrong

Our Terraform module for “standard service monitoring” was copy-pasted 77 times. Each service got:

High CPU alarm ✅ (useful)
Low CPU alarm ❌ (useless noise)
HTTP 4xx alarm ✅ (useful for APIs)
HTTP 4xx alarm ❌ (useless for databases)

Better approach:

# Terraform module with environment-aware thresholds
module "monitoring" {
  source = "./modules/service-monitoring"
  
  service_name = "accounts"
  environment  = "production"  # or "staging"
  service_type = "api"         # or "database", "worker"
  
  # Module adjusts alarms based on these inputs
}

Your Action Plan: Fix This in One Sprint

Week 1: Audit (2 hours)

#!/usr/bin/env python3
import boto3
from collections import defaultdict
cw = boto3.client('cloudwatch')
# Get all alarms
alarms = cw.describe_alarms()['MetricAlarms']
# Group by state
by_state = defaultdict(list)
for alarm in alarms:
    by_state[alarm['StateValue']].append(alarm)
print(f"Total alarms: {len(alarms)}")
print(f"  OK: {len(by_state['OK'])}")
print(f"  ALARM: {len(by_state['ALARM'])}")
print(f"  INSUFFICIENT_DATA: {len(by_state['INSUFFICIENT_DATA'])}")
# Find zombies (INSUFFICIENT_DATA for >7 days)
zombies = [a for a in by_state['INSUFFICIENT_DATA'] 
           if days_since_state_change(a) > 7]
print(f"\n🧟 Zombie alarms to delete: {len(zombies)}")
# Find alarm spam (firing constantly)
spam = [a for a in by_state['ALARM']
        if is_constantly_firing(a)]
print(f"📢 Noisy alarms to fix: {len(spam)}")
# Calculate cost
billable = max(0, len(alarms) - 10)
cost = billable * 0.10
print(f"\n💰 Monthly cost: ${cost:.2f}")

Week 2: Quick Wins (4 hours)

Delete zombies → Instant $3–10/month savings
Fix obvious issues → Stop real alarms from firing
Adjust non-prod thresholds → Reduce noise by 50%

Week 3–4: Consolidate (8–16 hours)

Create composite alarms for top 20 services
Delete redundant individual alarms
Document the new alarm strategy

Expected Results:

Alarm reduction: 60–85%
Cost savings: $50–150/month
Time savings: 2–4 hours/week (no more false alert investigations)
Better monitoring: Actually trust your alarms again

Common Objections

“But we need low CPU alarms to catch cost waste!”

Counter: Use AWS Cost Anomaly Detection or Trusted Advisor instead. Don’t wake someone up at 3 AM because a dev server is idle.

Better approach: Weekly cost reports, monthly right-sizing reviews.

“What if we delete an alarm and then need it?”

Reality check: When’s the last time you actually acted on that alarm?

If it’s been firing for 6 months and nobody’s fixed it, it’s not important.

“Composite alarms are too complex!”

They’re actually simpler:

# Complex (9 alarms)
accounts-high-cpu
accounts-low-cpu
accounts-high-memory
accounts-low-memory
...

# Simple (1 alarm)
accounts-service-health

One alarm. One action. One page.

“This will take too long!”

Our timeline:

Week 1: 2 hours (audit)
Week 2: 4 hours (quick wins)
Week 3–4: 12 hours (consolidation)
Total: 18 hours

ROI:

Time saved: 2–4 hours/week = 100+ hours/year
Cost saved: $1,192–7,492/year
Sleep quality: Priceless

The Bottom Line

We had 1,163 CloudWatch alarms in non-production environments. They cost us:

💰 $115/month in direct costs
⏰ 2–3 hours/week in false alert investigations
🎯 Complete loss of trust in our monitoring
😴 Burned-out on-call engineers

After one sprint of focused cleanup:

✅ 85% reduction in alarms (1,163 → 170)
✅ **99/ℎ∗∗(1,192/year)
✅ 96% reduction in false alerts (50 → 2)
✅ Monitoring we actually trust

The real win wasn’t the money. It was getting our monitoring back.

Now when an alarm fires, the team investigates. Because we know it’s real.

What’s Next for You?

Take 2 hours this week and run the audit script. I guarantee you’ll find:

❌ Zombie alarms (INSUFFICIENT_DATA)
❌ Noisy alarms (constantly firing in non-prod)
❌ Copy-pasted alarms (every service has identical thresholds)
❌ Money being wasted ($50–200/month minimum)

Start small:

Day 1: Run the audit
Day 2: Delete 10 zombie alarms
Day 3: Fix one noisy alarm
Week 2: Create your first composite alarm

Or go big:

Dedicate one sprint to alarm cleanup
Target 50%+ reduction
Save thousands per year
Sleep better

Either way, stop paying AWS to send you noise.

Have you dealt with alarm fatigue in your infrastructure? What strategies worked for you? Drop a comment — I’d love to hear your war stories.

If this helped, share it with your DevOps/SRE team. Every engineering org has an alarm problem; most just haven’t measured it yet.