<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Nikhil Ninawe on Medium]]></title>
        <description><![CDATA[Stories by Nikhil Ninawe on Medium]]></description>
        <link>https://medium.com/@nikhil.ninawe?source=rss-b4b8a657c2b------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*e3SjVyFFju2tPo3JSOL7oQ.jpeg</url>
            <title>Stories by Nikhil Ninawe on Medium</title>
            <link>https://medium.com/@nikhil.ninawe?source=rss-b4b8a657c2b------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sun, 24 May 2026 02:00:46 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@nikhil.ninawe/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Automating the Hotfix Pipeline: How We Empowered Engineering Managers and Cut Release Bottlenecks]]></title>
            <link>https://medium.com/@nikhil.ninawe/automating-the-hotfix-pipeline-how-we-empowered-engineering-managers-and-cut-release-bottlenecks-4b648c9ad5ee?source=rss-b4b8a657c2b------2</link>
            <guid isPermaLink="false">https://medium.com/p/4b648c9ad5ee</guid>
            <category><![CDATA[release-engineering]]></category>
            <category><![CDATA[sre]]></category>
            <category><![CDATA[automation]]></category>
            <category><![CDATA[jenkins]]></category>
            <category><![CDATA[devops]]></category>
            <dc:creator><![CDATA[Nikhil Ninawe]]></dc:creator>
            <pubDate>Sun, 10 May 2026 05:28:03 GMT</pubDate>
            <atom:updated>2026-05-10T05:28:30.418Z</atom:updated>
            <content:encoded><![CDATA[<p><em>A practical guide to reducing release team dependency, one Jenkins job at a time.</em></p><p>Every SaaS company that ships frequently eventually hits the same wall: the hotfix process becomes a bottleneck. In our Org, we reached a point where every critical fix — no matter how small — required the release engineering team to manually create Jira versions, spin up Slack channels, create branches, trigger builds, and orchestrate deployments. Our engineers could write a one-line fix in 10 minutes, but getting it to production took hours of coordination.</p><p>Last week, I set out to change that. Here’s how we automated our hotfix pipeline end-to-end and gave engineering managers the power to ship critical fixes without waiting on the release team.</p><h3>The Problem: Too Many Handoffs, Too Little Autonomy</h3><p>Our hotfix process had seven discrete manual steps, most of which required the release team to act as a human router. A typical flow looked like this:</p><ol><li>Release team creates a new hotfix version in Jira</li><li>Release team creates a Slack channel, adds the right people, posts a kickoff message</li><li>Release team creates the hotfix branch in Bitbucket</li><li>Developer opens a PR against the hotfix branch</li><li>Release team reviews and merges the PR</li><li>Release team triggers the build manually</li><li>Release team deploys to the smoke environment and coordinates QA verification</li></ol><p>The engineering managers — the people with the most context on what needed to be fixed — were spectators in their own hotfix process. Every step required a ping to the release channel, a wait for someone to pick it up, and a confirmation loop. During off-hours or weekends, a simple bug fix could sit for hours waiting on a human in the loop.</p><h3>The Solution: Automate the Boring Parts, Delegate the Rest</h3><p>We broke the problem into three layers: <strong>setup automation</strong>, <strong>merge access delegation</strong>, and <strong>build + deploy automation</strong>.</p><h3>Layer 1: One-Click Hotfix Setup</h3><p>The first bottleneck was the administrative overhead of starting a hotfix. Creating the Jira version, the Slack channel, adding team members, and posting the kickoff message — all of this was manual and error-prone.</p><p>We consolidated all of these into a single parameterised Jenkins job. An operator provides the release version and hotfix number, and the job handles everything:</p><ul><li>Creates the fix version in Jira using the Jira REST API</li><li>Creates a dedicated Slack channel with a standardized naming convention</li><li>Adds the relevant team members automatically</li><li>Posts a templated kickoff message with all the context a developer needs to get started</li></ul><p>To make the fix version creation even more resilient, I wrote a Python script that programmatically creates and tags Jira fix versions.</p><h3>Layer 2: Giving Managers the Keys</h3><p>This was the cultural shift as much as a technical one. Historically, only the release team had merge permissions on hotfix branches in Bitbucket. This made sense as a guardrail early on, but it had become a bottleneck.</p><p>We raised a formal request to provide merge access to engineering managers across all repositories. The principle was simple: managers already approve the code in review — they should be able to click the merge button too.</p><p>To support this, we also built a self-service Jenkins job that lets managers create hotfix branches themselves. No more waiting for the release team to cut a branch. A manager sees a critical bug, creates the branch, and tells their developer to push a fix.</p><h3>Layer 3: Automated Build and Deploy</h3><p>With managers now able to merge, we needed the downstream pipeline to be fully automated. Here’s what we built:</p><p><strong>Bitbucket Webhooks → Jenkins Builds:</strong> Every merge to a hotfix branch triggers an automated build via a Bitbucket webhook. No one needs to click “Build Now.” The moment code lands on the hotfix branch, the pipeline kicks in.</p><p><strong>Automatic Deployment to Smoke:</strong> After a successful build, the artifact is automatically deployed to our smoke-bravo environment. QA can begin validating within minutes of a merge, not hours.</p><h3>What’s Still on the Roadmap</h3><p>The pipeline isn’t fully autonomous yet. Here’s what’s next:</p><ol><li><strong>Automated Smoke Stack Creation:</strong> Right now, the smoke environment needs to exist before a deployment. We want the pipeline to automatically provision a smoke stack if one isn’t already running.</li><li><strong>Automated Regression Suite Triggers:</strong> Post-deployment, regression tests should be triggered automatically. We’re evaluating whether this should be a Jenkins downstream job or a webhook-triggered test harness.</li><li><strong>Smart Verification Tagging:</strong> The biggest remaining manual step is QA verification. Our plan is to introduce a smoke-verified tag on Jira tickets. Once QA validates a fix, they tag the ticket and post a smoke-verification-complete message. The system will then automatically check that all tickets in the hotfix are tagged and all verification messages are posted. Only when both conditions are met will it push an automated &quot;all green&quot; message to the hotfix Slack channel, signalling readiness for production.</li></ol><p>This last piece is particularly exciting because it replaces the most anxiety-inducing part of the process: the manual “did everyone verify their fix?” follow-up loop.</p><h3>Lessons Learned</h3><p><strong>Start with the highest-friction handoff.</strong> We didn’t try to automate everything at once. We started with the merge permission change because it was the single most frequent bottleneck. Everything else cascaded from there.</p><p><strong>Trust, then verify.</strong> Giving managers merge access required trust. But we didn’t remove guardrails — branch protection rules, required reviewers, and automated builds all still apply. We just removed the “wait for the release team to click a button” step.</p><p><strong>Automate the boring parts first.</strong> The Jenkins job that creates a Jira version and a Slack channel saves maybe 10 minutes per hotfix. But it happens dozens of times per release cycle, and eliminating it removed an entire category of “I’m blocked waiting for someone.”</p><h3>The Numbers</h3><p>Before this work, a typical hotfix cycle — from “bug identified” to “deployed to smoke” — required 4–5 handoffs to the release team and could take anywhere from 2 to 8 hours depending on availability.</p><p>With the new pipeline, once a manager creates the branch and a developer pushes the fix, the path from merge to smoke deployment is <strong>fully automated and takes under 15 minutes</strong>. The release team is no longer in the critical path for hotfixes.</p><h3>Wrapping Up</h3><p>Release engineering is often invisible work. When it’s done well, nobody notices — deployments just happen. When it’s done poorly, everyone notices — and they notice loudly, at 2 AM, in an incident channel.</p><p>The work we did last week wasn’t about building something flashy. It was about removing friction, one Jenkins job at a time. It was about recognising that the people closest to the code should be empowered to ship fixes without waiting in a queue. And it was about turning a seven-step manual process into a pipeline where humans only need to do what humans are good at: writing the fix and verifying it works.</p><p>If your release process still has a human acting as a router between “code is ready” and “code is in the environment,” consider this your sign to automate it.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=4b648c9ad5ee" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[From Database Chaos to In-Memory Speed: Optimizing Error Monitoring at Scale]]></title>
            <link>https://medium.com/@nikhil.ninawe/from-database-chaos-to-in-memory-speed-optimizing-error-monitoring-at-scale-8226aac19510?source=rss-b4b8a657c2b------2</link>
            <guid isPermaLink="false">https://medium.com/p/8226aac19510</guid>
            <category><![CDATA[python]]></category>
            <category><![CDATA[observability]]></category>
            <category><![CDATA[system-design-concepts]]></category>
            <category><![CDATA[sre]]></category>
            <category><![CDATA[devops]]></category>
            <dc:creator><![CDATA[Nikhil Ninawe]]></dc:creator>
            <pubDate>Mon, 04 May 2026 01:33:26 GMT</pubDate>
            <atom:updated>2026-05-04T01:34:09.928Z</atom:updated>
            <content:encoded><![CDATA[<h3>How we reduced error alert processing time by 90% while maintaining flexibility</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*qk34DhRovo__F0tf" /></figure><h3>The Problem: Death by a Thousand Database Queries</h3><p>Picture this: You’re an SRE team managing a complex microservices architecture across multiple environments. Your error monitoring system is generating thousands of alerts daily, but most of them are noise — known issues, third-party library warnings, or infrastructure hiccups you’ve already cataloged.</p><p>You need to filter these errors, but every single error check requires a database query to see if it matches a known pattern. Your monitoring system is drowning, processing is slow, and legitimate alerts are getting buried in the noise.</p><p>This was our reality. And this is the story of how we fixed it.</p><h3>The Evolution: Three Iterations to Success</h3><h3>V1: The Naive Approach 🐌</h3><p>Our first implementation was straightforward but painful:</p><pre># Check EVERY error against EVERY pattern in the database<br>for error in new_errors:<br>    for pattern in get_patterns_from_db():  # Database query!<br>        if pattern in error:<br>            ignore_error()<br>            break</pre><p><strong>The Problem:</strong></p><ul><li>1000 errors × 200 patterns = 200,000 database queries per run</li><li>Processing time: 5–10 minutes</li><li>Database load: Unsustainable</li><li>Alert latency: Unacceptable</li></ul><h3>V2: The Hard-Coded Solution 🏃</h3><p>Desperate for speed, we hard-coded patterns:</p><pre>IGNORED_PATTERNS = [<br>    &quot;connection timeout&quot;,<br>    &quot;third-party API error&quot;,<br>    # ... 200 more patterns<br>]<br><br>for error in new_errors:<br>    if any(pattern in error for pattern in IGNORED_PATTERNS):<br>        continue</pre><p><strong>The Result:</strong></p><ul><li>Processing time: &lt; 10 seconds ⚡</li><li>Database load: Zero</li><li>But… updating patterns required code deployment 😱</li></ul><h3>V3: The Sweet Spot 🎯</h3><p>What if we could have both speed AND flexibility? Enter: database-backed in-memory caching.</p><h3>The Architecture: Best of Both Worlds</h3><p>Here’s the key insight: <strong>patterns change infrequently, but we check them constantly</strong>.</p><h3>The Solution Design</h3><pre>class ErrorFilterDB:<br>    &quot;&quot;&quot;<br>    Database-backed error filtering with in-memory caching.<br>    Load once, use thousands of times.<br>    &quot;&quot;&quot;<br>    def __init__(self, host: str, username: str, password: str):<br>        self.host = host<br>        self.username = username<br>        self.password = password<br>        # In-memory caches<br>        self._error_patterns_cache: Set[str] = set()<br>        # Load patterns once at startup<br>        self.refresh_cache()<br><br>    def refresh_cache(self):<br>        &quot;&quot;&quot;Load all patterns from database into memory.&quot;&quot;&quot;<br>        conn = self.get_connection()<br>        try:<br>            with conn.cursor() as cursor:<br>                query = &quot;&quot;&quot;<br>                    SELECT error_pattern<br>                    FROM ignored_errors<br>                    WHERE active = 1<br>                &quot;&quot;&quot;<br>                cursor.execute(query)<br>                results = cursor.fetchall()<br>                # Store in a Set for O(1) lookup<br>                self._error_patterns_cache = {<br>                    row[&#39;error_pattern&#39;] for row in results<br>                }<br>                print(f&quot;Loaded {len(self._error_patterns_cache)} patterns&quot;)<br>        finally:<br>            conn.close()<br>    def should_ignore_error(self, error_string: str) -&gt; bool:<br>        &quot;&quot;&quot;<br>        Check if error matches any pattern.<br>        Uses generator expression for early exit.<br>        &quot;&quot;&quot;<br>        return any(<br>            pattern in error_string<br>            for pattern in self._error_patterns_cache<br>        )</pre><h3>Why This Works</h3><ol><li><strong>One-Time Load:</strong> Patterns loaded once at startup</li><li><strong>Set Operations:</strong> O(1) membership testing</li><li><strong>Generator Expression:</strong> Early exit on first match</li><li><strong>No Database Overhead:</strong> Zero queries during processing</li><li><strong>Easy Updates:</strong> Change patterns in DB, restart service</li></ol><h3>The Complete Pipeline</h3><p>Here’s how it all comes together:</p><pre># Initialize once at startup<br>print(&quot;Initializing error filter...&quot;)<br>error_filter = ErrorFilterDB.create_from_vault()<br>print(f&quot;Loaded {len(error_filter._error_patterns_cache)} patterns&quot;)<br># Main processing loop<br>while True:<br>    new_error_string = redis_client.lpop(&#39;elastalert.new_errors&#39;)<br>    if not new_error_string:<br>        break<br>    # Fast in-memory check - no database query!<br>    if error_filter.should_ignore_error(new_error_string):<br>        continue  # Skip known errors<br>    # Process new/unknown errors<br>    process_and_alert(new_error_string)</pre><h3>Handling Distributed Systems Challenges</h3><p>Real-world systems aren’t perfect. Network latency, temporary outages, and slow queries happen. Here’s how we made our system resilient:</p><h3>OpenSearch Optimization</h3><pre># OpenSearch client with battle-tested timeout settings<br>es = OpenSearch(<br>    hosts=[{&#39;host&#39;: f&quot;{environment}-analytics.domain.net&quot;, &#39;port&#39;: 80}],<br>    timeout=60,           # Increased from default 10s<br>    max_retries=3,        # Retry on transient failures<br>    retry_on_timeout=True # Don&#39;t fail on temporary slowdowns<br>)</pre><p><strong>Why these numbers matter:</strong></p><ul><li><strong>60s timeout:</strong> Gives complex aggregation queries time to complete</li><li><strong>3 retries:</strong> Handles temporary network blips</li><li><strong>retry_on_timeout:</strong> Prevents false failures during high load</li></ul><h3>Multi-Environment Architecture</h3><p>Supporting multiple environments (production, staging, rehearsal) required careful design:</p><pre>ENVIRONMENT = os.getenv(&#39;ENVIRONMENT&#39;, &#39;rehearsal&#39;)<br>webhook_map = {<br>    &quot;rehearsal&quot;: webhook_url,<br>    &quot;stage&quot;: stage_webhook_url,<br>    &quot;production&quot;: production_webhook_url<br>}<br># Environment-aware connections<br>redis_client = redis.Redis(host=f&quot;{ENVIRONMENT}-platform-cache.domain.net&quot;)<br>es_client = OpenSearch(hosts=[{&#39;host&#39;: f&quot;{ENVIRONMENT}-analytics.domain.net&quot;}])</pre><p>This pattern ensures:</p><ul><li>Same code runs everywhere</li><li>Environment-specific routing</li><li>No accidental cross-environment contamination</li></ul><h3>The Results: Numbers That Matter</h3><p>Here’s what we achieved with the optimized v3 implementation:</p><h3>Performance Metrics</h3><p>MetricV1 (Database)V2 (Hard-coded)V3 (Cached)<strong>Processing Time</strong>5–10 minutes&lt;10 seconds&lt;10 seconds<strong>Database Queries</strong>200,000/run01 (startup only)<strong>Pattern Updates</strong>InstantRequires deploymentRestart service<strong>Memory Usage</strong>LowLow~1MB for 1000 patterns<strong>Maintainability</strong>GoodPoorExcellent</p><h3>Real-World Impact</h3><ul><li><strong>90% reduction</strong> in alert processing latency</li><li><strong>99.9% reduction</strong> in database load</li><li><strong>Zero code deployments</strong> needed for pattern updates</li><li><strong>100% flexibility</strong> retained for pattern management</li></ul><h3>Advanced Patterns and Techniques</h3><h3>1. Factory Pattern for Credential Management</h3><p>Instead of scattering credential logic, we centralized it:</p><pre>@staticmethod<br>def create_from_vault():<br>    &quot;&quot;&quot;Factory method using vault credentials.&quot;&quot;&quot;<br>    host = &#39;common-vault.domain.net&#39;<br>    username = os.environ[&#39;MYSQL_USERNAME&#39;]<br>    password = os.environ[&#39;MYSQL_PASSWORD&#39;]<br>    return ErrorFilterDB(host, username, password)<br># Usage<br>error_filter = ErrorFilterDB.create_from_vault()</pre><p><strong>Benefits:</strong></p><ul><li>Single source of truth for credentials</li><li>Easy to swap credential providers</li><li>Clean separation of concerns</li></ul><h3>2. Generator Expressions for Early Exit</h3><p>This subtle optimization saves significant CPU:</p><pre># ❌ BAD: Checks ALL patterns even after finding a match<br>matches = [pattern in error for pattern in patterns]<br>if any(matches):<br>    return True<br># ✅ GOOD: Stops at first match<br>return any(pattern in error for pattern in patterns)</pre><p>With 1000 patterns and matches often in the first 10, this saves ~99% of checks.</p><h3>3. Smart Alert Routing</h3><p>Different error severities go to different channels:</p><pre>def send_alert_to_slack(error):<br>    total_hits = error[&#39;total_hits&#39;]<br>    if total_hits in range(1, 20):<br>        color = &quot;warning&quot;<br>        webhook = webhook_warn_url<br>    elif total_hits &gt; 20:<br>        color = &quot;danger&quot;<br>        webhook = webhook_error_url<br>    else:<br>        return  # Don&#39;t alert on single occurrences<br>    slack.send_alert_with_title(webhook, message, title, color)</pre><p>This prevents alert fatigue while ensuring critical issues get immediate attention.</p><h3>Lessons Learned: The Hard Way</h3><h3>1. Measure Before Optimizing</h3><p>We initially thought database connection pooling would solve our problem. It didn’t. Only after measuring did we realize the sheer number of queries was the issue, not connection overhead.</p><p><strong>Takeaway:</strong> Profile first, optimize second.</p><h3>2. Cache Invalidation is Still Hard</h3><p>Our initial implementation had no cache refresh mechanism. When patterns were updated, services needed manual restart. We solved this with:</p><pre># Option 1: Periodic refresh (add to main loop)<br>last_refresh = time.time()<br>if time.time() - last_refresh &gt; 300:  # 5 minutes<br>    error_filter.refresh_cache()<br>    last_refresh = time.time()<br># Option 2: Signal-based refresh<br># Send SIGHUP to process to trigger refresh</pre><h3>3. Observability is Non-Negotiable</h3><p>Strategic logging saved us countless debugging hours:</p><pre>print(f&quot;Loaded {len(self._error_patterns_cache)} patterns&quot;)<br>print(f&quot;Ignoring error (matched filter): {error[:100]}...&quot;)<br>print(f&quot;Processing new error: {error_id}&quot;)</pre><p>In distributed systems, you can’t debug what you can’t see.</p><h3>4. Design for Multiple Environments from Day One</h3><p>Adding multi-environment support later would have been painful. Building it in from the start made testing and deployment trivial.</p><h3>Common Pitfalls to Avoid</h3><h3>1. Memory Leaks with Unbounded Caches</h3><pre># ❌ BAD: Cache grows forever<br>cache[error_id] = error_data<br># ✅ GOOD: Use TTL or LRU<br>if time.time() - error[&#39;timestamp&#39;] &gt; 86400:  # 24 hours<br>    continue  # Don&#39;t re-process old errors<br>2. Race Conditions with Cache Updates<br># ❌ BAD: Could serve stale data during refresh<br>self._error_patterns_cache = new_patterns<br><br># ✅ GOOD: Atomic swap<br>new_cache = {pattern for pattern in fetch_patterns()}<br>self._error_patterns_cache = new_cache  # Atomic in Python</pre><h3>3. Ignoring Edge Cases</h3><p>What happens when:</p><ul><li>Database is down during startup?</li><li>Pattern table is empty?</li><li>Environment variable is missing?</li></ul><p>Handle these explicitly or fail loudly.</p><h3>Beyond Error Monitoring: Broader Applications</h3><p>The patterns we used apply to many real-time processing scenarios:</p><h3>1. API Rate Limiting</h3><p>Cache user quotas in-memory, refresh periodically from database</p><h3>2. Feature Flags</h3><p>Load flag configurations once, check thousands of times</p><h3>3. Access Control Lists</h3><p>Cache permissions, avoid database hits on every request</p><h3>4. Content Filtering</h3><p>Spam detection, profanity filters, content moderation</p><p>The core principle: <strong>When read-to-write ratio is high, cache aggressively</strong>.</p><h3>Future Enhancements</h3><p>Where we’re headed next:</p><h3>1. Distributed Caching</h3><p>Use Redis for cross-service cache sharing:</p><pre># Share cache across multiple service instances<br>redis_client.setex(&#39;error_patterns&#39;, 300, json.dumps(patterns))</pre><h3>2. Pattern Analytics</h3><p>Track which patterns match most frequently to optimize pattern order</p><h3>3. Machine Learning Integration</h3><p>Auto-detect new error patterns using clustering algorithms</p><h3>4. Self-Healing Patterns</h3><p>Automatically add patterns for recurring errors</p><h3>Conclusion: Speed AND Flexibility</h3><p>The journey from V1 to V3 taught us that you don’t have to choose between performance and maintainability. With thoughtful architecture:</p><ul><li><strong>Database-backed storage</strong> gives you flexibility</li><li><strong>In-memory caching</strong> gives you speed</li><li><strong>Smart refresh strategies</strong> keep data fresh</li><li><strong>Robust error handling</strong> keeps systems reliable</li></ul><p>Whether you’re building error monitoring, rate limiting, or any high-throughput filtering system, these patterns will serve you well.</p><p>The code is in production, handling thousands of errors per minute across multiple environments. It’s fast, it’s maintainable, and it just works.</p><h3>Key Takeaways</h3><ol><li>✅ <strong>Cache when read:write ratio is high</strong> — Our 1000:1 ratio was perfect for caching</li><li>✅ <strong>Use the right data structure</strong> — Sets for O(1) membership testing</li><li>✅ <strong>Generator expressions for early exit</strong> — Stop checking after first match</li><li>✅ <strong>Design for multiple environments</strong> — Same code, different configs</li><li>✅ <strong>Build observability in from day one</strong> — You can’t fix what you can’t see</li><li>✅ <strong>Measure before and after</strong> — Know your baseline, prove your improvement</li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=8226aac19510" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The Hidden Deployment Bug That Brought Down Our UI: A Tale of Cache, Load Balancers, and Racing…]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-snippet">The 3 AM Wake-Up Call</p><p class="medium-feed-link"><a href="https://medium.com/@nikhil.ninawe/the-hidden-deployment-bug-that-brought-down-our-ui-a-tale-of-cache-load-balancers-and-racing-83b8600ed872?source=rss-b4b8a657c2b------2">Continue reading on Medium »</a></p></div>]]></description>
            <link>https://medium.com/@nikhil.ninawe/the-hidden-deployment-bug-that-brought-down-our-ui-a-tale-of-cache-load-balancers-and-racing-83b8600ed872?source=rss-b4b8a657c2b------2</link>
            <guid isPermaLink="false">https://medium.com/p/83b8600ed872</guid>
            <category><![CDATA[load-balancing]]></category>
            <category><![CDATA[apache-httpd]]></category>
            <category><![CDATA[html]]></category>
            <category><![CDATA[devops]]></category>
            <category><![CDATA[aws]]></category>
            <dc:creator><![CDATA[Nikhil Ninawe]]></dc:creator>
            <pubDate>Wed, 29 Apr 2026 07:28:16 GMT</pubDate>
            <atom:updated>2026-04-29T07:28:16.725Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[A MongoDB Primary Switch Took Down a “Healthy” Service: Lessons from Stale Connection Pools and…]]></title>
            <link>https://medium.com/@nikhil.ninawe/a-mongodb-primary-switch-took-down-a-healthy-service-lessons-from-stale-connection-pools-and-ba42d23016b9?source=rss-b4b8a657c2b------2</link>
            <guid isPermaLink="false">https://medium.com/p/ba42d23016b9</guid>
            <category><![CDATA[mongodb]]></category>
            <category><![CDATA[incident-response]]></category>
            <category><![CDATA[production]]></category>
            <category><![CDATA[sre]]></category>
            <category><![CDATA[java]]></category>
            <dc:creator><![CDATA[Nikhil Ninawe]]></dc:creator>
            <pubDate>Fri, 24 Apr 2026 16:01:07 GMT</pubDate>
            <atom:updated>2026-04-24T16:01:07.395Z</atom:updated>
            <content:encoded><![CDATA[<h3>A MongoDB Primary Switch Took Down a “Healthy” Service: Lessons from Stale Connection Pools and 2‑Second Timeouts</h3><h3>TL;DR</h3><p>A planned MongoDB primary switch, combined with stale connection pools and an aggressively low 2‑second timeout, caused a core microservice to start throwing MongoTimeoutException errors. The database cluster was healthy, but the application wasn’t. The incident is a great case study in how <strong>application behavior during topology changes</strong> can be your real reliability bottleneck.</p><h3>What Happened (in 2 Minutes)</h3><p>One weekday morning, a backend microservice (let’s call it orders-api) began throwing MongoTimeoutException errors when trying to query MongoDB. Users saw:</p><ul><li>Intermittent failures when listing orders</li><li>Reporting endpoints returning HTTP 500</li><li>Overall degraded performance and timeouts</li></ul><p>The trigger was <strong>not</strong> a new deploy of orders-api.</p><p>Earlier, the infrastructure team had performed a <strong>planned MongoDB primary node switch</strong> within a replica set, mainly for cost/placement and resilience reasons (cross‑AZ optimization, maintenance, etc.).</p><p>Hours after that change, orders-api was still holding <strong>stale connections</strong> and failing to establish fresh ones within a strict <strong>2‑second connection timeout</strong>. Under normal conditions, this might have been barely tolerable; during a topology change, it became a full-blown incident.</p><p>Roughly:</p><ul><li>Time to detect (MTTD): a few minutes</li><li>Time to recover (MTTR): under an hour, after rotating instances and scaling the service</li></ul><h3>Why This Hurt Users</h3><p>This wasn’t a background batch job failing silently. This service sat directly on live user paths:</p><ul><li><strong>Order/Shipment listing</strong> pages started failing</li><li><strong>Reporting and analytics</strong> that depended on Mongo queries were intermittently down</li><li>Clients saw HTTP 500s and increased latency on critical endpoints</li></ul><p>Internally, the team also noticed some instances experiencing <strong>heavy garbage collection (GC)</strong> during the incident window, which made those instances even slower and more fragile.</p><p>Bottom line: a database topology change that should have been a <strong>non-event for users</strong> turned into a <strong>visible outage</strong>.</p><h3>Root Cause: Not Just “Mongo Failover”</h3><p>On paper, MongoDB replica sets are built to handle elections, failovers, and primary changes seamlessly. The cluster itself was fine.</p><p>The real root cause lived in the interaction between the <strong>application</strong> and the <strong>cluster</strong>.</p><h3>1. Stale Connection Pools After Primary Switch</h3><p>The application used MongoDB connection pooling. When the primary node changed, the app:</p><ul><li>Did not restart</li><li>Continued using pools built against the old primary topology</li><li>Failed to refresh effectively and struggled to create new, healthy connections</li></ul><p>Under load, calls to Mongo began to fail with MongoTimeoutException because <strong>getting a usable connection within the timeout window</strong> became increasingly rare.</p><h3>2. Over‑Aggressive Timeouts (2 Seconds)</h3><p>The service had a <strong>2‑second connection/operation timeout</strong>.</p><p>That sounds “strict and snappy,” but in production, 2 seconds can be too low under:</p><ul><li>Brief network hiccups</li><li>JVM pauses (GC)</li><li>Connection pool churn after topology changes</li><li>TLS handshakes and reconnections</li><li>Peak traffic periods</li></ul><p>Once the driver/pool was in a bad state, that <strong>2‑second limit</strong> ensured that even minor latency or connection hiccups surfaced immediately as hard failures.</p><p>In other words, the timeout configuration <strong>amplified</strong> the fragility instead of containing it.</p><h3>3. Runtime Instability (GC Pauses)</h3><p>Some instances showed signs of <strong>Full GC</strong> during the incident.</p><p>Even if Mongo had been perfectly healthy, Full GC events reduce available CPU for request handling, stretch latencies, and make connection acquisition more unpredictable. Combined with a 2‑second timeout, this turned a transient condition into a stream of failures.</p><h3>How the Team Mitigated It</h3><p>The short‑term fix was operational:</p><ul><li><strong>Rotate instances</strong>: replace or restart the orders-api instances so that they boot up with fresh connection pools targeting the correct Mongo primary.</li><li><strong>Scale up temporarily</strong>: increase the instance count during rotation to maintain some capacity while instances went through warm‑up, GC, and connection pool building.</li><li><strong>Watch dashboards</strong>: keep an eye on error rates, latencies, and health checks until the fleet stabilized.</li></ul><p>This worked because <strong>new processes built clean pools</strong> against the new primary. The incident closed once the rotated fleet was stable and error rates returned to baseline.</p><h3>The Real Lesson: Infra Changes Aren’t Done Until Apps Prove It</h3><p>A database primary switch is often treated as an infrastructure task:</p><blockquote><em>“The cluster is healthy, failover succeeded, we’re done.”</em></blockquote><p>This incident shows that’s only <strong>half of the job</strong>.</p><p>You’re not really “done” until:</p><ol><li>Applications have reconnected and stabilized,</li><li>Key user flows are working end‑to‑end, and</li><li>Error rates and latency haven’t regressed.</li></ol><p>In practice, that suggests a mindset shift:</p><blockquote><strong><em>“Database topology change + application validation” is a single operation.</em></strong></blockquote><h3>What to Fix Next Time</h3><p>Here are the improvements that fall naturally out of this incident.</p><h3>1. Revisit MongoDB Timeouts</h3><p>A flat 2‑second timeout may feel tough and “fast,” but in production it can:</p><ul><li>Turn transient conditions into visible user failures</li><li>Offer almost no room for connection pool recovery after events like elections or primary switches</li></ul><p>A more resilient approach:</p><ul><li>Use <strong>more generous timeouts</strong> (e.g., 10–30 seconds for connection establishment), especially during failovers.</li><li>Use <strong>separate values</strong> for:<br>Connection acquisition timeout<br>Socket read/write timeout<br>Overall operation timeout</li><li>Monitor these metrics so you can tune down from a place of <strong>observed safety</strong>, not guesswork.</li></ul><h3>2. Make Failover Runbooks Include Application Behavior</h3><p>When planning a MongoDB primary switch (or any similar infra change), your runbook shouldn’t end at “cluster looks healthy.”</p><p>It should contain explicit steps to validate:</p><ul><li>Can your key services (orders-api, billing-api, etc.) still talk to Mongo?</li><li>Do user journeys like “list orders,” “run report,” and “create shipment” still work?</li><li>Are error rates, latency, and saturation metrics stable <strong>after</strong> the switch?</li></ul><p>If some services consistently struggle to recover:</p><ul><li>Treat that as a <strong>known risk</strong>.</li><li>Add a <strong>temporary mitigation</strong> (e.g., targeted rolling restart) until you fix the underlying behavior.</li></ul><h3>3. Improve Application Resilience to Topology Changes</h3><p>Instead of relying on restarts forever, fix the deeper issues:</p><ul><li>Confirm the MongoDB driver is configured to be <strong>replica‑set aware</strong>, not pinned to a single host.</li><li>Ensure the connection pool is allowed to:<br>Detect topology changes<br>Drop bad connections<br>Rebuild pools without needing an app restart</li><li>Add instrumentation:</li><li>Pool size</li><li>In‑use vs available connections</li><li>Connection acquisition latency</li><li>Timeout counts</li></ul><p>This turns “mysterious MongoTimeoutExceptions” into something you can see and debug.</p><h3>4. Watch JVM Health Alongside DB Health</h3><p>Because GC pauses worsened this incident, it’s worth:</p><ul><li>Tracking GC pauses (especially Full GC), heap usage, and allocation rates</li><li>Correlating spikes in GC with spikes in:</li><li>Latency</li><li>Connection timeouts</li><li>Error rates</li></ul><p>Sometimes your “database problem” is really a <strong>JVM health</strong> problem that just manifests at the database boundary.</p><h3>Closing Thoughts</h3><p>This incident wasn’t about a flaky database. MongoDB did exactly what it was designed to do: hold elections and promote a new primary.</p><p>The real failure was in <strong>how the application handled a totally expected topology change</strong>, combined with <strong>timeouts tuned more for ideal conditions than for real‑world chaos</strong>.</p><p>If you operate microservices on top of MongoDB (or any distributed database), it’s worth asking:</p><blockquote><em>“When the topology changes, does my application bend… or break?”</em></blockquote><p>Designing for the former is where true reliability lives.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=ba42d23016b9" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Debugging Production Issues: A Journey Through Exception Replay Bug Fixes]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-snippet">When Your Exception Handler Becomes the Exception</p><p class="medium-feed-link"><a href="https://medium.com/@nikhil.ninawe/debugging-production-issues-a-journey-through-exception-replay-bug-fixes-bc415f1b9935?source=rss-b4b8a657c2b------2">Continue reading on Medium »</a></p></div>]]></description>
            <link>https://medium.com/@nikhil.ninawe/debugging-production-issues-a-journey-through-exception-replay-bug-fixes-bc415f1b9935?source=rss-b4b8a657c2b------2</link>
            <guid isPermaLink="false">https://medium.com/p/bc415f1b9935</guid>
            <category><![CDATA[spring]]></category>
            <category><![CDATA[exception-handling]]></category>
            <category><![CDATA[java]]></category>
            <category><![CDATA[spring-boot]]></category>
            <dc:creator><![CDATA[Nikhil Ninawe]]></dc:creator>
            <pubDate>Fri, 24 Apr 2026 07:15:19 GMT</pubDate>
            <atom:updated>2026-04-24T07:15:19.394Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[Building a Production-Ready MongoDB Query Executor: Handling UUID Representations and ArrayFilters…]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-snippet">Introduction</p><p class="medium-feed-link"><a href="https://medium.com/@nikhil.ninawe/building-a-production-ready-mongodb-query-executor-handling-uuid-representations-and-arrayfilters-b6038ecfe827?source=rss-b4b8a657c2b------2">Continue reading on Medium »</a></p></div>]]></description>
            <link>https://medium.com/@nikhil.ninawe/building-a-production-ready-mongodb-query-executor-handling-uuid-representations-and-arrayfilters-b6038ecfe827?source=rss-b4b8a657c2b------2</link>
            <guid isPermaLink="false">https://medium.com/p/b6038ecfe827</guid>
            <category><![CDATA[mongodb]]></category>
            <category><![CDATA[java]]></category>
            <category><![CDATA[database]]></category>
            <category><![CDATA[spring-boot]]></category>
            <category><![CDATA[engineering]]></category>
            <dc:creator><![CDATA[Nikhil Ninawe]]></dc:creator>
            <pubDate>Sun, 19 Apr 2026 06:31:59 GMT</pubDate>
            <atom:updated>2026-04-19T06:31:59.565Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[The $7,000/Year MongoDB Optimization Nobody Talks About: Strategic Primary Placement]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-snippet">How we accidentally discovered that WHERE your MongoDB primary lives matters as much as HOW you configure it</p><p class="medium-feed-link"><a href="https://medium.com/@nikhil.ninawe/the-7-000-year-mongodb-optimization-nobody-talks-about-strategic-primary-placement-9108a685d3be?source=rss-b4b8a657c2b------2">Continue reading on Medium »</a></p></div>]]></description>
            <link>https://medium.com/@nikhil.ninawe/the-7-000-year-mongodb-optimization-nobody-talks-about-strategic-primary-placement-9108a685d3be?source=rss-b4b8a657c2b------2</link>
            <guid isPermaLink="false">https://medium.com/p/9108a685d3be</guid>
            <category><![CDATA[cost-optimization]]></category>
            <category><![CDATA[aws]]></category>
            <category><![CDATA[mongodb]]></category>
            <category><![CDATA[cloud-architecture]]></category>
            <category><![CDATA[devops]]></category>
            <dc:creator><![CDATA[Nikhil Ninawe]]></dc:creator>
            <pubDate>Thu, 16 Apr 2026 08:11:43 GMT</pubDate>
            <atom:updated>2026-04-16T08:11:43.746Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[Securing Sensitive Data in Logstash: Hashing Authentication Tokens in Access Logs]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@nikhil.ninawe/securing-sensitive-data-in-logstash-hashing-authentication-tokens-in-access-logs-9600b5e8e2ce?source=rss-b4b8a657c2b------2"><img src="https://cdn-images-1.medium.com/max/1242/1*fj7e-K0KU5u2NKOW73BSAA.png" width="1242"></a></p><p class="medium-feed-snippet">Introduction</p><p class="medium-feed-link"><a href="https://medium.com/@nikhil.ninawe/securing-sensitive-data-in-logstash-hashing-authentication-tokens-in-access-logs-9600b5e8e2ce?source=rss-b4b8a657c2b------2">Continue reading on Medium »</a></p></div>]]></description>
            <link>https://medium.com/@nikhil.ninawe/securing-sensitive-data-in-logstash-hashing-authentication-tokens-in-access-logs-9600b5e8e2ce?source=rss-b4b8a657c2b------2</link>
            <guid isPermaLink="false">https://medium.com/p/9600b5e8e2ce</guid>
            <category><![CDATA[elasticsearch]]></category>
            <category><![CDATA[security]]></category>
            <category><![CDATA[devops]]></category>
            <category><![CDATA[platform-engineering]]></category>
            <category><![CDATA[logstash]]></category>
            <dc:creator><![CDATA[Nikhil Ninawe]]></dc:creator>
            <pubDate>Fri, 10 Apr 2026 05:31:56 GMT</pubDate>
            <atom:updated>2026-04-10T05:31:56.651Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[How We Saved $8,000/Year by Adding One MongoDB Connection String Parameter]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-snippet">A deep dive into optimizing AWS Inter-Availability Zone data transfer costs</p><p class="medium-feed-link"><a href="https://medium.com/@nikhil.ninawe/how-we-saved-8-000-year-by-adding-one-mongodb-connection-string-parameter-891987f0083b?source=rss-b4b8a657c2b------2">Continue reading on Medium »</a></p></div>]]></description>
            <link>https://medium.com/@nikhil.ninawe/how-we-saved-8-000-year-by-adding-one-mongodb-connection-string-parameter-891987f0083b?source=rss-b4b8a657c2b------2</link>
            <guid isPermaLink="false">https://medium.com/p/891987f0083b</guid>
            <category><![CDATA[optimization]]></category>
            <category><![CDATA[mongodb]]></category>
            <category><![CDATA[java]]></category>
            <category><![CDATA[devops]]></category>
            <category><![CDATA[aws]]></category>
            <dc:creator><![CDATA[Nikhil Ninawe]]></dc:creator>
            <pubDate>Wed, 08 Apr 2026 05:12:42 GMT</pubDate>
            <atom:updated>2026-04-08T05:12:42.448Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[1,163 CloudWatch Alarms in Non-Production: A DevOps Horror Story]]></title>
            <link>https://medium.com/@nikhil.ninawe/1-163-cloudwatch-alarms-in-non-production-a-devops-horror-story-0df6f0bdcf16?source=rss-b4b8a657c2b------2</link>
            <guid isPermaLink="false">https://medium.com/p/0df6f0bdcf16</guid>
            <category><![CDATA[cost-optimization]]></category>
            <category><![CDATA[aws]]></category>
            <category><![CDATA[infrastructure-as-code]]></category>
            <category><![CDATA[finops]]></category>
            <category><![CDATA[devops]]></category>
            <dc:creator><![CDATA[Nikhil Ninawe]]></dc:creator>
            <pubDate>Sat, 04 Apr 2026 09:22:40 GMT</pubDate>
            <atom:updated>2026-04-04T09:22:40.155Z</atom:updated>
            <content:encoded><![CDATA[<h3>How We Reduced Alarm Fatigue, Cut Costs by 85%, and Actually Started Trusting Our Monitoring Again</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*wrCIJXKdrB6lcBPa" /></figure><p><em>TL;DR: Our staging and rehearsal environments had 1,163 CloudWatch alarms. 50 were actively firing (all false positives). We were paying 115/month and ear, and fixed our broken monitoring culture.</em></p><h3>The Alert That Changed Everything</h3><p>It was 3 AM when my phone buzzed for the 47th time that week.</p><p><strong>“ALARM: rehearsal-accounts-bravo-LowCPUUtilization”</strong></p><p>I ignored it. So did everyone else on the team.</p><p>By morning, we had 12 more Slack notifications, 8 PagerDuty alerts, and 3 emails. All from our <strong>rehearsal environment</strong> — a testing environment that barely gets traffic.</p><p><strong>This was the moment I realized: We had an alarm problem.</strong></p><h3>The Audit: Opening Pandora’s Box</h3><p>I wrote a quick Python script to inventory all our CloudWatch alarms:</p><pre>import boto3cloudwatch = boto3.client(&#39;cloudwatch&#39;, region_name=&#39;us-west-2&#39;)<br>alarms = cloudwatch.describe_alarms()<br>print(f&quot;Total alarms: {len(alarms[&#39;MetricAlarms&#39;])}&quot;)</pre><p><strong>The result made me do a double-take:</strong></p><pre>Total alarms: 4,989</pre><p><strong>4,989 CloudWatch alarms.</strong> For context, we had 632 EC2 instances.</p><p>That’s <strong>7.9 alarms per instance.</strong> We were paying AWS to send us thousands of notifications we’d learned to ignore.</p><h3>Breaking Down the Madness</h3><p>When I filtered for just non-production environments:</p><p>Environment Alarms In ALARM StateCost/Month</p><p><strong>Production </strong>3,156 127 (4.0%) $314.60</p><p><strong>Rehearsal </strong>690 29 (4.2%) $68.00</p><p><strong>Stage </strong>47321 (4.4%) $46.30</p><p><strong>Dev </strong>670 18 (2.7%) $66.00</p><p><strong>Our non-production environments alone had 1,833 alarms, costing $180.30/month.</strong></p><p>But here’s the kicker: <strong>68 alarms were actively firing</strong> in non-prod. When I investigated:</p><ul><li>❌ Zero actual incidents</li><li>❌ Zero actionable alerts</li><li>❌ 100% noise</li></ul><h3>The Patterns of Failure</h3><h3>Pattern #1: The “Low CPU Utilization” Epidemic</h3><p><strong>35 alarms across rehearsal and stage were firing for “Low CPU Utilization.”</strong></p><pre>rehearsal-accounts-bravo-LowCPUUtilization: ALARM<br>rehearsal-api-bravo-LowCPUUtilization: ALARM<br>rehearsal-config-alpha-LowCPUUtilization: ALARM<br>... (31 more)</pre><p><strong>The problem:</strong> We’d set the threshold at &lt;10% CPU for 5 minutes.</p><p><strong>The reality:</strong> Rehearsal environments get maybe 10 requests per hour. Of course the CPU is idle!</p><p><strong>The cost:</strong></p><ul><li>35 alarms × 0.10/ℎ=∗∗3.50/month**</li><li>Wasted engineering time investigating: <strong>~2 hours/month</strong></li><li>Opportunity cost: <strong>Priceless</strong> (we stopped trusting our monitoring)</li></ul><h3>Pattern #2: The Copy-Paste Syndrome</h3><p>Every service had exactly 9 alarms:</p><pre>{service}-HighCPUUtilization<br>{service}-LowCPUUtilization<br>{service}-HighMemoryUtilization<br>{service}-LowMemoryUtilization<br>{service}-HighDiskSpace<br>{service}-HTTP4xxCount<br>{service}-HTTP5xxCount<br>{service}-UnhealthyHostCount<br>{service}-StatusCheckFailed</pre><p><strong>77 services × 9 alarms = 693 alarms</strong></p><p>This was clearly from a Terraform module that someone wrote once and we copy-pasted forever.</p><p><strong>The problems:</strong></p><ol><li>Not all services need all alarms (a database doesn’t have HTTP 4xx codes)</li><li>Thresholds weren’t adjusted per service (calendar service ≠ auth service)</li><li>No differentiation between prod and non-prod</li></ol><h3>Pattern #3: The Zombie Alarms</h3><p><strong>35 alarms in INSUFFICIENT_DATA state</strong> — meaning the metrics they’re watching don’t exist anymore.</p><pre>stage-mongo-07-HighDiskSpace: INSUFFICIENT_DATA<br>rehearsal-old-api-HighMemory: INSUFFICIENT_DATA</pre><p><strong>What happened:</strong></p><ul><li>Instances terminated 6 months ago</li><li>Services renamed or decommissioned</li><li>Metrics stopped being published</li></ul><p><strong>What we did:</strong></p><ul><li>Keep paying $0.10/month per alarm</li><li>Ignore the noise</li><li>Assume “INSUFFICIENT_DATA” is normal</li></ul><p><strong>What we should have done:</strong> Delete them.</p><h3>Pattern #4: The Disk Space Time Bombs</h3><p><strong>3 MongoDB instances in stage had real disk space issues:</strong></p><pre>stage-mongo-04: 92% disk usage<br>stage-mongo-05: 94% disk usage<br>stage-mongo-06: 91% disk usage</pre><p>But guess what? <strong>Nobody noticed because we’d trained ourselves to ignore all stage alarms.</strong></p><p><strong>This is the real cost of alarm fatigue:</strong> When everything is on fire, nothing is on fire.</p><h3>The True Cost of Alarm Sprawl</h3><h3>Direct Costs:</h3><pre>1,163 non-prod alarms<br>- 10 free tier alarms<br>= 1,153 billable alarms</pre><pre>1,153 × $0.10/month = $115.30/month<br>Annual cost: $1,383.60</pre><h3>Indirect Costs (Much Worse):</h3><ol><li><strong>Team Burnout:</strong></li></ol><ul><li>Average 15 false alerts per day</li><li>2–3 hours/week investigating noise</li><li>On-call engineers ignoring pages</li></ul><p><strong>2. Missed Real Issues:</strong></p><ul><li>3 MongoDB instances at 90%+ disk (critical!)</li><li>2 unhealthy instances (degraded performance)</li><li>Actual production incidents buried in noise</li></ul><p><strong>3. Tool Mistrust:</strong></p><ul><li>“Just ignore stage alerts”</li><li>“PagerDuty notifications? Probably nothing”</li><li>CloudWatch became a joke</li></ul><p><strong>You can’t put a price on broken trust in your monitoring.</strong></p><h3>The Cleanup Plan</h3><h3>Phase 1: Stop the Bleeding (Week 1)</h3><h4>1. Delete Zombie Alarms</h4><pre>import boto3<br>cw = boto3.client(&#39;cloudwatch&#39;)<br># Get all alarms in INSUFFICIENT_DATA for &gt;7 days<br>alarms = cw.describe_alarms(StateValue=&#39;INSUFFICIENT_DATA&#39;)<br>for alarm in alarms[&#39;MetricAlarms&#39;]:<br>    # Verify it&#39;s actually a zombie (no metrics in 7 days)<br>    if is_zombie(alarm):<br>        print(f&quot;Deleting: {alarm[&#39;AlarmName&#39;]}&quot;)<br>        cw.delete_alarms(AlarmNames=[alarm[&#39;AlarmName&#39;]])</pre><p><strong>Result:</strong></p><ul><li>Deleted 35 zombie alarms</li><li>**Savings: 3.50/ℎ∗∗(42/year)</li><li><strong>Time: 30 minutes</strong></li></ul><h4>2. Fix the Obvious Issues</h4><p>Those MongoDB servers? Yeah, we actually fixed them:</p><pre># Clear old MongoDB logs<br>ssh stage-inventory-mongo-04<br>find /var/log/mongodb -name &quot;*.gz&quot; -mtime +30 -delete<br># Increase volume size<br>aws ec2 modify-volume --volume-id vol-xxx --size 200</pre><p><strong>Result:</strong></p><ul><li>3 critical issues resolved</li><li>Alarms stopped firing</li><li><strong>Time: 1 hour</strong></li></ul><h4>3. Adjust Non-Prod Thresholds</h4><p>For non-production environments, we changed:</p><pre># Before<br>LowCPUThreshold: 10%   # Too sensitive!<br>HighCPUThreshold: 80%<br>EvaluationPeriods: 1   # Too quick!<br><br># After (Non-Prod)<br>LowCPUThreshold: 5%    # More realistic<br>HighCPUThreshold: 90%  # Higher tolerance<br>EvaluationPeriods: 3   # 15 mins instead of 5</pre><p>Or better yet: <strong>Disabled “Low CPU” alarms entirely for non-prod.</strong></p><p><strong>Rationale:</strong> Non-prod environments are supposed to be idle most of the time. That’s not a problem; it’s expected.</p><p><strong>Result:</strong></p><ul><li>35 noisy alarms silenced</li><li><strong>Savings: $3.50/month</strong></li><li><strong>Time: 2 hours (Terraform updates)</strong></li></ul><h3>Phase 2: Consolidate with Composite Alarms (Week 2–3)</h3><p>This is where the magic happened.</p><p><strong>Old approach:</strong> 9 alarms per service <strong>New approach:</strong> 1 composite alarm per service</p><pre>import boto3<br>cw = boto3.client(&#39;cloudwatch&#39;)<br># Create composite alarm<br>cw.put_composite_alarm(<br>    AlarmName=&#39;rehearsal-accounts-service-health&#39;,<br>    AlarmRule=&#39;ALARM(rehearsal-accounts-cpu-high) OR &#39;<br>              &#39;ALARM(rehearsal-accounts-memory-high) OR &#39;<br>              &#39;ALARM(rehearsal-accounts-unhealthy) OR &#39;<br>              &#39;ALARM(rehearsal-accounts-5xx-high)&#39;,<br>    AlarmActions=[&#39;arn:aws:sns:us-west-2:xxx:ops-alerts&#39;],<br>    AlarmDescription=&#39;Composite health check for accounts service&#39;<br>)<br># Delete the 9 individual alarms<br>individual_alarms = [<br>    &#39;rehearsal-accounts-cpu-high&#39;,<br>    &#39;rehearsal-accounts-cpu-low&#39;,<br>    &#39;rehearsal-accounts-memory-high&#39;,<br>    # ... 6 more<br>]<br>cw.delete_alarms(AlarmNames=individual_alarms)</pre><p><strong>Benefits:</strong></p><ol><li>✅ One alert instead of 9</li><li>✅ Only fires if something is ACTUALLY wrong</li><li>✅ Cleaner alert messages</li><li>✅ 88% cost reduction per service</li></ol><p><strong>Scaling this across 77 services:</strong></p><pre>Before: 77 services × 9 alarms = 693 alarms<br>After:  77 services × 1 alarm  = 77 alarms<br>Reduction: 616 alarms<br>Savings: 616 × $0.10 = $61.60/month ($739/year)</pre><h3>Phase 3: Right-Size the Infrastructure (Week 3–4)</h3><p>Those 35 “Low CPU” alarms? They were telling us something:</p><p><strong>We were over-provisioned.</strong></p><pre>Current: t3.large (2 vCPU, 8GB RAM)<br>Usage:   5-10% CPU, 2GB RAM<br>Right-size: t3.medium (2 vCPU, 4GB RAM)<br>Savings: ~$15/instance/month</pre><p>For 35 instances:</p><ul><li>**EC2 Savings: 525/ℎ∗∗(6,300/year)</li><li><strong>Alarm Reduction: 35 alarms</strong> (now they run at 10–20% CPU)</li></ul><p><strong>This is the real win:</strong> The alarms weren’t just noise — they were telling us we were wasting money on compute.</p><h3>The Final Results</h3><h3>Cost Impact:</h3><pre>CloudWatch Alarms:<br>  Before: $115.30/month<br>  After:  $16.00/month<br>  Savings: $99.30/month ($1,192/year)<br>Right-Sizing EC2 (bonus):<br>  35 instances: t3.large → t3.medium<br>  Savings: $525/month ($6,300/year)<br>TOTAL: $7,492/year</pre><h3>Operational Impact (Priceless):</h3><ol><li>✅ <strong>On-call engineers sleep better</strong></li></ol><ul><li>Before: 15 false alerts/day</li><li>After: 0–1 false alerts/week</li></ul><ol><li>✅ <strong>We trust our monitoring again</strong></li></ol><ul><li>When an alarm fires, we investigate</li><li>Actually caught 2 real issues in first month</li></ul><ol><li>✅ <strong>Faster incident response</strong></li></ol><ul><li>Before: “Which of these 8 alerts is real?”</li><li>After: “One alert = one problem = one action”</li></ul><ol><li>✅ <strong>Better use of engineering time</strong></li></ol><ul><li>Before: 2–3 hours/week on false alerts</li><li>After: 0.5 hours/week on real incidents</li><li><strong>Reclaimed: ~100 hours/year per engineer</strong></li></ul><h3>Lessons Learned</h3><h3>1. Non-Prod ≠ Prod (Obviously?)</h3><p>We applied the same alarm strategy to all environments. That’s like using the same security for your house and your shed.</p><p><strong>Better approach:</strong></p><ul><li><strong>Prod:</strong> Aggressive monitoring, low thresholds, page immediately</li><li><strong>Stage:</strong> Moderate monitoring, email/Slack only</li><li><strong>Dev:</strong> Minimal monitoring, maybe just health checks</li></ul><h3>2. Composite Alarms Are a Game-Changer</h3><p>Creating 9 alarms per service is lazy automation. It’s the monitoring equivalent of:</p><pre>try:<br>    # entire application<br>except Exception as e:<br>    print(&quot;Something went wrong!&quot;)</pre><p><strong>Composite alarms let you say:</strong> “Alert me if CPU is high AND memory is high AND we’re getting 5xx errors.”</p><p>Not: “Alert me if CPU is 11% for 6 minutes on a Sunday at 3 AM.”</p><h3>3. Alarm Fatigue Kills</h3><p>The MongoDB disk space issues were REAL problems that could’ve caused outages. But we missed them because they were buried in 50 false positives.</p><p><strong>This is the hidden cost of bad monitoring:</strong> When everything’s an emergency, nothing’s an emergency.</p><h3>4. Alarms Should Drive Action</h3><p>We had alarms like “Low CPU Utilization” that had no runbook, no action, no owner.</p><p><strong>Ask yourself:</strong> “If this alarm fires at 2 AM, what should the on-call engineer do?”</p><p>If the answer is “nothing” or “I don’t know,” <strong>delete the alarm.</strong></p><h3>5. Infrastructure as Code Can Go Wrong</h3><p>Our Terraform module for “standard service monitoring” was copy-pasted 77 times. Each service got:</p><ul><li>High CPU alarm ✅ (useful)</li><li>Low CPU alarm ❌ (useless noise)</li><li>HTTP 4xx alarm ✅ (useful for APIs)</li><li>HTTP 4xx alarm ❌ (useless for databases)</li></ul><p><strong>Better approach:</strong></p><pre># Terraform module with environment-aware thresholds<br>module &quot;monitoring&quot; {<br>  source = &quot;./modules/service-monitoring&quot;<br>  <br>  service_name = &quot;accounts&quot;<br>  environment  = &quot;production&quot;  # or &quot;staging&quot;<br>  service_type = &quot;api&quot;         # or &quot;database&quot;, &quot;worker&quot;<br>  <br>  # Module adjusts alarms based on these inputs<br>}</pre><h3>Your Action Plan: Fix This in One Sprint</h3><h3>Week 1: Audit (2 hours)</h3><pre>#!/usr/bin/env python3<br>import boto3<br>from collections import defaultdict<br>cw = boto3.client(&#39;cloudwatch&#39;)<br># Get all alarms<br>alarms = cw.describe_alarms()[&#39;MetricAlarms&#39;]<br># Group by state<br>by_state = defaultdict(list)<br>for alarm in alarms:<br>    by_state[alarm[&#39;StateValue&#39;]].append(alarm)<br>print(f&quot;Total alarms: {len(alarms)}&quot;)<br>print(f&quot;  OK: {len(by_state[&#39;OK&#39;])}&quot;)<br>print(f&quot;  ALARM: {len(by_state[&#39;ALARM&#39;])}&quot;)<br>print(f&quot;  INSUFFICIENT_DATA: {len(by_state[&#39;INSUFFICIENT_DATA&#39;])}&quot;)<br># Find zombies (INSUFFICIENT_DATA for &gt;7 days)<br>zombies = [a for a in by_state[&#39;INSUFFICIENT_DATA&#39;] <br>           if days_since_state_change(a) &gt; 7]<br>print(f&quot;\n🧟 Zombie alarms to delete: {len(zombies)}&quot;)<br># Find alarm spam (firing constantly)<br>spam = [a for a in by_state[&#39;ALARM&#39;]<br>        if is_constantly_firing(a)]<br>print(f&quot;📢 Noisy alarms to fix: {len(spam)}&quot;)<br># Calculate cost<br>billable = max(0, len(alarms) - 10)<br>cost = billable * 0.10<br>print(f&quot;\n💰 Monthly cost: ${cost:.2f}&quot;)</pre><h3>Week 2: Quick Wins (4 hours)</h3><ol><li><strong>Delete zombies</strong> → Instant $3–10/month savings</li><li><strong>Fix obvious issues</strong> → Stop real alarms from firing</li><li><strong>Adjust non-prod thresholds</strong> → Reduce noise by 50%</li></ol><h3>Week 3–4: Consolidate (8–16 hours)</h3><ol><li><strong>Create composite alarms</strong> for top 20 services</li><li><strong>Delete redundant individual alarms</strong></li><li><strong>Document the new alarm strategy</strong></li></ol><h3>Expected Results:</h3><ul><li><strong>Alarm reduction:</strong> 60–85%</li><li><strong>Cost savings:</strong> $50–150/month</li><li><strong>Time savings:</strong> 2–4 hours/week (no more false alert investigations)</li><li><strong>Better monitoring:</strong> Actually trust your alarms again</li></ul><h3>Common Objections</h3><h3>“But we need low CPU alarms to catch cost waste!”</h3><p><strong>Counter:</strong> Use AWS Cost Anomaly Detection or Trusted Advisor instead. Don’t wake someone up at 3 AM because a dev server is idle.</p><p><strong>Better approach:</strong> Weekly cost reports, monthly right-sizing reviews.</p><h3>“What if we delete an alarm and then need it?”</h3><p><strong>Reality check:</strong> When’s the last time you actually acted on that alarm?</p><p>If it’s been firing for 6 months and nobody’s fixed it, it’s not important.</p><h3>“Composite alarms are too complex!”</h3><p>They’re actually simpler:</p><pre># Complex (9 alarms)<br>accounts-high-cpu<br>accounts-low-cpu<br>accounts-high-memory<br>accounts-low-memory<br>...</pre><pre># Simple (1 alarm)<br>accounts-service-health</pre><p>One alarm. One action. One page.</p><h3>“This will take too long!”</h3><p><strong>Our timeline:</strong></p><ul><li>Week 1: 2 hours (audit)</li><li>Week 2: 4 hours (quick wins)</li><li>Week 3–4: 12 hours (consolidation)</li><li><strong>Total: 18 hours</strong></li></ul><p><strong>ROI:</strong></p><ul><li>Time saved: 2–4 hours/week = <strong>100+ hours/year</strong></li><li>Cost saved: <strong>$1,192–7,492/year</strong></li><li>Sleep quality: <strong>Priceless</strong></li></ul><h3>The Bottom Line</h3><p>We had <strong>1,163 CloudWatch alarms in non-production environments</strong>. They cost us:</p><ul><li>💰 <strong>$115/month in direct costs</strong></li><li>⏰ <strong>2–3 hours/week in false alert investigations</strong></li><li>🎯 <strong>Complete loss of trust in our monitoring</strong></li><li>😴 <strong>Burned-out on-call engineers</strong></li></ul><p>After one sprint of focused cleanup:</p><ul><li>✅ <strong>85% reduction in alarms</strong> (1,163 → 170)</li><li>✅ **99/ℎ∗∗(1,192/year)</li><li>✅ <strong>96% reduction in false alerts</strong> (50 → 2)</li><li>✅ <strong>Monitoring we actually trust</strong></li></ul><p><strong>The real win wasn’t the money.</strong> It was getting our monitoring back.</p><p>Now when an alarm fires, the team investigates. Because we know it’s real.</p><h3>What’s Next for You?</h3><p>Take 2 hours this week and run the audit script. I guarantee you’ll find:</p><ol><li>❌ Zombie alarms (INSUFFICIENT_DATA)</li><li>❌ Noisy alarms (constantly firing in non-prod)</li><li>❌ Copy-pasted alarms (every service has identical thresholds)</li><li>❌ Money being wasted ($50–200/month minimum)</li></ol><p><strong>Start small:</strong></p><ul><li>Day 1: Run the audit</li><li>Day 2: Delete 10 zombie alarms</li><li>Day 3: Fix one noisy alarm</li><li>Week 2: Create your first composite alarm</li></ul><p><strong>Or go big:</strong></p><ul><li>Dedicate one sprint to alarm cleanup</li><li>Target 50%+ reduction</li><li>Save thousands per year</li><li>Sleep better</li></ul><p>Either way, <strong>stop paying AWS to send you noise.</strong></p><p><em>Have you dealt with alarm fatigue in your infrastructure? What strategies worked for you? Drop a comment — I’d love to hear your war stories.</em></p><p><em>If this helped, share it with your DevOps/SRE team. Every engineering org has an alarm problem; most just haven’t measured it yet.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=0df6f0bdcf16" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>