<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Mohamed Hussain S on Medium]]></title>
        <description><![CDATA[Stories by Mohamed Hussain S on Medium]]></description>
        <link>https://medium.com/@mohhddhassan?source=rss-1f4643fe3841------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*sZ00NWC_7CGyCqTcnoUZvw.png</url>
            <title>Stories by Mohamed Hussain S on Medium</title>
            <link>https://medium.com/@mohhddhassan?source=rss-1f4643fe3841------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sat, 23 May 2026 16:24:20 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@mohhddhassan/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[PostgreSQL Restore Failures: Stop Fighting Recovery Logic]]></title>
            <link>https://blog.stackademic.com/postgresql-restore-failures-stop-fighting-recovery-logic-79c3301d4e00?source=rss-1f4643fe3841------2</link>
            <guid isPermaLink="false">https://medium.com/p/79c3301d4e00</guid>
            <category><![CDATA[postgresql]]></category>
            <category><![CDATA[database]]></category>
            <category><![CDATA[devops]]></category>
            <category><![CDATA[pgbackrest]]></category>
            <category><![CDATA[sre]]></category>
            <dc:creator><![CDATA[Mohamed Hussain S]]></dc:creator>
            <pubDate>Mon, 18 May 2026 06:30:14 GMT</pubDate>
            <atom:updated>2026-05-18T06:30:14.200Z</atom:updated>
            <content:encoded><![CDATA[<h3>PostgreSQL Restore Failures: It Wasn’t pgBackRest, It Was My Recovery Logic</h3><p>I was building what I thought was a foolproof PostgreSQL backup and restore workflow using <strong>pgBackRest</strong>. The plan was simple: automate backups, automate restores, validate the data, and make recovery predictable.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*B8hZ7Uh7LIzlHQAp.png" /></figure><p>Instead, I ended up repeatedly breaking PostgreSQL recovery itself.</p><p>At one point, the database refused to start entirely. Applications crashed, authentication failed, and my logs were a graveyard of cryptic errors:</p><pre>invalid checkpoint record<br>could not locate a valid checkpoint record at 0/DEAD</pre><p>Shortly after, I hit timeline mismatch errors that made my head spin: ERROR: [058]: target timeline 3 forked from backup timeline 2.</p><p>My first instinct? <strong>“pgBackRest is corrupting my restores.”</strong> I was wrong. The real culprit wasn’t the tool — it was the way I was handling the recovery logic.</p><h3>The Architecture of a Failure</h3><p>I moved the testing to my local machine to isolate the issue. I wanted a clean environment where I could trigger restores via API workflows. My logic followed a standard script:</p><ul><li><strong>Download</strong> the backup repository.</li><li><strong>Stop</strong> PostgreSQL.</li><li><strong>Restore</strong> the backup.</li><li><strong>Start</strong> PostgreSQL.</li><li><strong>Validate</strong> the database.</li></ul><p>It sounds straightforward, but this is where the “hidden” mechanics of PostgreSQL began to clash with my automation.</p><h3>My Fatal Assumption: “Cleaning Up” the Garbage</h3><p>Every time a restore failed, I noticed a few files sitting in my data directory: backup_label, recovery.signal, and standby.signal.</p><p>To my untrained eye, these looked like leftover artifacts from a crashed process — debris that needed to be cleared for a “clean” start. I added a line to my automation that I thought was clever:</p><pre># The line that broke everything<br>rm -f recovery.signal standby.signal backup_label</pre><p>I genuinely believed I was helping PostgreSQL start with a blank slate. In reality, <strong>I was deleting the exact recovery metadata PostgreSQL needed to survive.</strong></p><h3>What PostgreSQL Was Actually Trying To Do</h3><p>This was my turning point. I realized pgBackRest wasn’t writing “junk” files; it was orchestrating a delicate handoff.</p><ul><li><strong>backup_label</strong>: This tells PostgreSQL exactly where the recovery WAL (Write-Ahead Log) replay should begin.</li><li><strong>recovery.signal</strong>: This acts as a flag telling PostgreSQL to enter recovery mode rather than normal startup.</li><li><strong>WAL Replay</strong>: This is the process of reconstructing a consistent state from logs.</li></ul><p>By deleting these files, I was essentially blindfolding PostgreSQL and asking it to run through a minefield. It didn’t know where it was in time, so it panicked.</p><h3>The “Timeline Mismatch” Rabbit Hole</h3><p>As I kept “fixing” the restores, I hit a new wall: <strong>Timeline Mismatches.</strong></p><pre>ERROR: [058]: target timeline 3 forked from backup timeline 2</pre><p>PostgreSQL backups aren’t just static snapshots; they are living histories. Because my earlier failed attempts had partially started the database, I had inadvertently created “forked” histories. I wasn’t just debugging a restore anymore; I was debugging a fractured timeline of events.</p><blockquote><strong><em>The Lesson:</em></strong><em> PostgreSQL backups are tightly coupled with WAL history. If you break the continuity of the recovery, you break the future of your backups.</em></blockquote><h3>The Fix: Stop Fighting the Engine</h3><p>The biggest realization was simple: <strong>Stop trying to “help” PostgreSQL recover.</strong></p><p>Instead of manual cleanup, I pivoted to a “scorched earth” approach for the data directory and let the official tools do their jobs.</p><h3>The Corrected Flow:</h3><ul><li><strong>Stop PostgreSQL</strong> cleanly.</li><li><strong>Completely empty</strong> the data directory (don’t just overwrite; delete everything).</li><li>Run pgBackRest restore with --type=immediate.</li><li>Let PostgreSQL start and manage its own internal recovery.</li><li><strong>Promote</strong> the instance only after it confirms readiness.</li></ul><p>The addition of --type=immediate was the final piece of the puzzle. It told pgBackRest to restore to the latest consistent point available, ensuring the WAL replay stayed valid and predictable.</p><pre># The stable approach<br>self._empty_directory(self.pg_data_dir)<br>self._run_pgbackrest(&quot;restore&quot;, &quot;--type=immediate&quot;)</pre><h3>Hard-Learned Takeaways</h3><p>If you are managing PostgreSQL restores, keep these rules in mind to avoid the week of debugging I endured:</p><ul><li><strong>backup_label is sacred:</strong> Never delete it manually during a restore.</li><li><strong>Isolation is key:</strong> Never test restore automation against a live application environment. The “blast radius” of a failed database startup is huge.</li><li><strong>Clean starts only:</strong> A “delta” restore is fast, but if you’re seeing corruption, wipe the directory and start fresh.</li><li><strong>Trust the Tool:</strong> pgBackRest is designed to orchestrate recovery. If you find yourself writing rm commands in your data folder, you’re probably doing it wrong.</li></ul><h3><strong>Final Thought:</strong></h3><p>The system wasn’t broken. <br>My mental model of how PostgreSQL recovers was.</p><p>Once I stopped fighting the recovery logic and started supporting it, the “corrupt” restores vanished.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=79c3301d4e00" width="1" height="1" alt=""><hr><p><a href="https://blog.stackademic.com/postgresql-restore-failures-stop-fighting-recovery-logic-79c3301d4e00">PostgreSQL Restore Failures: Stop Fighting Recovery Logic</a> was originally published in <a href="https://blog.stackademic.com">Stackademic</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Debugging a Pipeline That “Should Have Worked”]]></title>
            <link>https://blog.stackademic.com/debugging-a-pipeline-that-should-have-worked-fa277955240b?source=rss-1f4643fe3841------2</link>
            <guid isPermaLink="false">https://medium.com/p/fa277955240b</guid>
            <category><![CDATA[debugging]]></category>
            <category><![CDATA[deployment-pipelines]]></category>
            <category><![CDATA[vector]]></category>
            <category><![CDATA[clickhouse]]></category>
            <category><![CDATA[devops]]></category>
            <dc:creator><![CDATA[Mohamed Hussain S]]></dc:creator>
            <pubDate>Sat, 09 May 2026 06:12:35 GMT</pubDate>
            <atom:updated>2026-05-12T03:33:38.916Z</atom:updated>
            <content:encoded><![CDATA[<p><em>Part 4 of a series on building a metrics pipeline into ClickHouse</em></p><h3>When Everything Looks Correct</h3><p>At this stage, everything was in place.</p><ul><li>The pipeline model was clear</li><li>Transforms were working</li><li>Data matched the schema</li></ul><p>By all accounts, it should have worked.</p><p>But it didn’t.</p><h3>The Kind of Error That Breaks Your Flow</h3><p>Then I saw this:</p><pre>There exists no table monitoring.cpu in database monitoring</pre><p>That didn’t make sense.</p><p>I hadn’t:</p><ul><li>Created that table</li><li>Referenced it</li><li>Used it anywhere in my pipeline</li></ul><p>And yet, it kept appearing.</p><h3>When Debugging Goes in the Wrong Direction</h3><p>At first, I assumed the problem was in my Vector configuration.</p><p>So I checked:</p><ul><li>Transforms</li><li>Sinks</li><li>Table mappings</li></ul><p>Everything looked correct.<br>Which made it worse.<br>Because now there was no clear direction.</p><h3>The Turning Point: Logs</h3><p>This is where I stopped guessing and started looking at the system.</p><pre>sudo tail -f /var/log/clickhouse-server/clickhouse-server.err.log</pre><p>Logs don’t lie.</p><p>They don’t explain — but they point.</p><h3>The Actual Problem</h3><p>The issue wasn’t in my pipeline.</p><p>It was outside of it.</p><p>A previously used Telegraf process was still running.</p><p>Even after:</p><ul><li>Removing configs</li><li>Switching to Vector</li><li>Rebuilding everything</li></ul><p>The old process was still alive.</p><p>And still sending data.</p><h3>The Invisible Interference</h3><p>That meant two things were happening at once:</p><ul><li>My new Vector pipeline</li><li>An old Telegraf pipeline</li></ul><p>Both writing to the same system.<br>Which explained everything.<br>The error wasn’t coming from what I built.<br>It was coming from what I forgot to stop.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*B_Ss7wFlq1d3nM52rbH-ug.png" /></figure><h3>The Fix</h3><p>Simple.</p><p>But easy to miss.</p><pre>ps aux | grep telegraf</pre><p>And then:</p><pre>sudo systemctl stop telegraf</pre><p>That was it.</p><p>The error disappeared.</p><h3>The Real Lesson</h3><p>This wasn’t a configuration issue.</p><p>It was a system issue.</p><blockquote><em>You were debugging the wrong pipeline.</em></blockquote><p>And that’s more common than it sounds.</p><h3>What This Changes</h3><p>After this, my approach to debugging changed completely.</p><p>I stopped asking:</p><blockquote><em>“What’s wrong with my config?”</em></blockquote><p>And started asking:</p><blockquote><em>“What is actually running right now?”</em></blockquote><h3>The Bigger Picture</h3><p>Data pipelines don’t exist in isolation.</p><p>They live inside systems.</p><p>And systems have:</p><ul><li>History</li><li>State</li><li>Residue</li></ul><p>Ignoring that leads to confusion.</p><h3>Series Recap</h3><ul><li><a href="https://blog.dataengineerthings.org/metrics-clickhouse-and-the-simple-setup-that-wasnt-fc123e6b190f">Part 1: Metrics, ClickHouse, and the “Simple” Setup That Wasn’t</a></li><li>Part 2: Vector Pipelines and the Mental Shift from Configs to Flow</li><li>Part 3: Writing transforms and handling data correctly (this post)</li><li>Part 4: Debugging the pipeline and making it work</li></ul><h3>Final Thought</h3><p>The hardest part wasn’t learning a new tool.</p><p>It was learning how to think about the system.</p><blockquote><em>Tools don’t break pipelines. Assumptions do.</em></blockquote><h3>A message from our Founder</h3><p>Hey, <a href="https://linkedin.com/in/sunilsandhu">Sunil</a> here. I wanted to take a moment to thank you for reading until the end and for being a part of this community. Did you know that our team run these publications as a volunteer effort to over 3.5m monthly readers? We don’t receive any funding, we do this to support the community.</p><p>If you want to show some love, please take a moment to follow me on <a href="https://linkedin.com/in/sunilsandhu">LinkedIn</a>, <a href="https://tiktok.com/@messyfounder">TikTok</a>, <a href="https://instagram.com/sunilsandhu">Instagram</a>. You can also subscribe to our <a href="https://newsletter.plainenglish.io/">weekly newsletter</a>. And before you go, don’t forget to clap and follow the writer️!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=fa277955240b" width="1" height="1" alt=""><hr><p><a href="https://blog.stackademic.com/debugging-a-pipeline-that-should-have-worked-fa277955240b">Debugging a Pipeline That “Should Have Worked”</a> was originally published in <a href="https://blog.stackademic.com">Stackademic</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Writing Transforms in Vector: Where the Pipeline Actually Breaks]]></title>
            <link>https://blog.stackademic.com/writing-transforms-in-vector-where-the-pipeline-actually-breaks-0bff5bd283e3?source=rss-1f4643fe3841------2</link>
            <guid isPermaLink="false">https://medium.com/p/0bff5bd283e3</guid>
            <category><![CDATA[vector]]></category>
            <category><![CDATA[clickhouse]]></category>
            <category><![CDATA[observability]]></category>
            <category><![CDATA[vrl]]></category>
            <category><![CDATA[data-engineering]]></category>
            <dc:creator><![CDATA[Mohamed Hussain S]]></dc:creator>
            <pubDate>Wed, 06 May 2026 07:06:35 GMT</pubDate>
            <atom:updated>2026-05-09T12:08:39.160Z</atom:updated>
            <content:encoded><![CDATA[<p><em>Part 3 of a series on building a metrics pipeline into ClickHouse<br></em><a href="https://medium.com/p/ac85e9087570"><em>Read Part 2: Vector Pipelines and the Mental Shift from Configs to Flow</em></a></p><h3>Where the Architecture Stops Helping</h3><p>By this point, things finally made sense.</p><p>I had moved away from the “Telegraf trap.”<br>I understood the Vector pipeline model.<br>Sources, transforms, sinks — it all felt clean.</p><p>For the first time, the system didn’t feel like a black box.</p><p>And yet, the pipeline still wasn’t working.</p><p>That’s when it became obvious:</p><blockquote><em>The structure was correct. The data wasn’t.</em></blockquote><h3>The Reality of Raw Metrics</h3><p>Up until now, I had been thinking at a structural level:</p><ul><li>Where data comes from</li><li>Where it goes</li><li>How it flows</li></ul><p>But ClickHouse doesn’t care about your architecture.</p><p>It cares about <strong>data shape</strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*eWwR2RaEzeBIBoZ47SZxnw.png" /></figure><p>And raw metrics are messy.</p><p>They come in:</p><ul><li>Different types</li><li>Different formats</li><li>Different structures</li></ul><p>Even when everything <em>looks</em> correct, a single mismatch can break ingestion.</p><p>No errors. No warnings. Just… nothing.</p><h3>Enter the Transformation Layer</h3><p>This is where Vector becomes both powerful — and unforgiving.</p><p>Transforms are written using <strong>Vector Remap Language</strong>.</p><p>At a glance, VRL feels simple.</p><p>In practice, it behaves more like a strict compiler than a scripting language.</p><ul><li>Types must match</li><li>Fields must exist</li><li>Conversions must be explicit</li></ul><p>There is no “close enough.”</p><blockquote><em>Either the data is correct, or it doesn’t flow.</em></blockquote><h3>The Timestamp Problem</h3><p>If there was one issue that kept breaking everything, it was this:</p><p><strong>timestamps.</strong></p><p>It sounds trivial. It wasn’t.</p><p>The incoming metrics had timestamps.<br>ClickHouse expects timestamps.</p><p>But not <em>those</em> timestamps.</p><p>They needed to be:</p><ul><li>Parsed correctly</li><li>Converted explicitly</li><li>Stored in the expected format</li></ul><p>The fix looked like this:</p><pre>.timestamp = to_unix_timestamp!(parse_timestamp!(.timestamp, &quot;%+&quot;))</pre><p>One line.</p><p>Hours of debugging.</p><h3>When Data Doesn’t Match the System</h3><p>The deeper issue wasn’t just timestamps.</p><p>It was the realization that:</p><blockquote><em>Raw metrics are not designed for your database.</em></blockquote><p>They are designed for collection — not storage.</p><p>So everything had to be reshaped.</p><p>For host metrics, that meant:</p><ul><li>Filtering only relevant metrics</li><li>Renaming fields to match schema</li><li>Converting values into consistent numeric types</li></ul><p>For GPU metrics, it was worse.</p><h3>The GPU Problem</h3><p>GPU metrics didn’t come as structured data.</p><p>They came as a CSV string from nvidia-smi.</p><p>Something like:</p><pre>0, Tesla T4, 35, 20, 4000, 15109, 11109, 65</pre><p>Which meant:</p><ul><li>Parse the string</li><li>Split it into fields</li><li>Clean whitespace</li><li>Convert everything manually</li></ul><p>And then, reshape it.</p><p>Because even after parsing, the data still wasn’t usable.</p><h3>From Wide Data to Rows</h3><p>One of the biggest shifts was how metrics were stored.</p><p>Initially, it was tempting to keep everything in a single record:</p><pre>cpu, memory, disk, temperature</pre><p>But that doesn’t scale well in ClickHouse.</p><p>So the data had to be transformed into a row-based format:</p><pre>metric_name = &quot;cpu&quot;, value = ...<br>metric_name = &quot;memory&quot;, value = ...</pre><p>Each metric becomes its own row.</p><p>This wasn’t just a formatting decision.</p><p>It changed:</p><ul><li>How queries are written</li><li>How aggregations work</li><li>How flexible the system becomes</li></ul><h3>The Iteration Loop</h3><p>Most of the real work happened here:</p><pre>Write transform → Run → Nothing works → Check → Fix → Repeat</pre><p>Not once.<br>Not twice.<br>Multiple times.</p><p>Each iteration uncovered something new:</p><ul><li>A missing field</li><li>A wrong type</li><li>A format mismatch</li></ul><p>This wasn’t configuration anymore.</p><p>This was <strong>data engineering</strong>.</p><h3>When It Finally Clicked</h3><p>At some point, things stopped breaking.</p><p>Data started flowing.</p><p>ClickHouse stopped rejecting inserts.</p><p>Queries started returning meaningful results.</p><p>Nothing “magical” happened.</p><p>The data just… matched the system.</p><h3>What This Changes</h3><p>This part of the pipeline taught me something important:</p><blockquote><em>Pipelines don’t fail at collection. They fail at transformation.</em></blockquote><p>You can have:</p><ul><li>The right tools</li><li>The right architecture</li><li>The right setup</li></ul><p>And still fail — because the data isn’t shaped correctly.</p><h3>What’s Next</h3><p>At this point, the pipeline was <em>almost</em> working.</p><p>But there was still one problem left.<br>A confusing one.<br>An error that didn’t make sense.<br>And it had nothing to do with Vector.</p><p>In the final part, I’ll walk through:</p><ul><li>How I debugged the pipeline</li><li>What ClickHouse logs revealed</li><li>And a mistake that had nothing to do with the code — but broke everything</li></ul><h3>Series Overview</h3><ul><li>Part 1: Metrics, ClickHouse, and the “Simple” Setup That Wasn’t</li><li>Part 2: Vector Pipelines and the Mental Shift from Configs to Flow</li><li>Part 3: Writing transforms and handling data correctly (this post)</li><li><a href="https://medium.com/p/fa277955240b">Part 4: Debugging the pipeline and making it work</a></li></ul><h3>Final Thought</h3><p>The pipeline didn’t start working when I changed tools.</p><p>It started working when I understood the data.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=0bff5bd283e3" width="1" height="1" alt=""><hr><p><a href="https://blog.stackademic.com/writing-transforms-in-vector-where-the-pipeline-actually-breaks-0bff5bd283e3">Writing Transforms in Vector: Where the Pipeline Actually Breaks</a> was originally published in <a href="https://blog.stackademic.com">Stackademic</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Vector Pipelines and the Mental Shift from Configs to Flow]]></title>
            <link>https://blog.stackademic.com/vector-pipelines-and-the-mental-shift-from-configs-to-flow-ac85e9087570?source=rss-1f4643fe3841------2</link>
            <guid isPermaLink="false">https://medium.com/p/ac85e9087570</guid>
            <category><![CDATA[observability]]></category>
            <category><![CDATA[clickhouse]]></category>
            <category><![CDATA[backend]]></category>
            <category><![CDATA[vector]]></category>
            <category><![CDATA[data-engineering]]></category>
            <dc:creator><![CDATA[Mohamed Hussain S]]></dc:creator>
            <pubDate>Mon, 04 May 2026 07:41:41 GMT</pubDate>
            <atom:updated>2026-05-09T12:03:46.439Z</atom:updated>
            <content:encoded><![CDATA[<p><em>This post is Part 2 of a series on building a metrics pipeline into ClickHouse. Read </em><a href="https://medium.com/p/fc123e6b190f"><em>Part 1: Metrics, ClickHouse, and the “Simple” Setup That Wasn’t.</em></a></p><h3><strong>Picking Up Where Things Broke</strong></h3><p>In the previous part, I talked about the “Telegraf Trap” — the moment I realized that writing a configuration file isn’t the same as building a resilient system. I was fighting the tool rather than focusing on the data.</p><p>The biggest issue wasn’t just a lack of plugins. It was a lack of visibility. I had reached a point where I didn’t have enough control over how data moved from point A to point B.</p><p>I needed to move from a “run-and-hope” mindset to a <strong>pipeline mindset.</strong></p><h3><strong>Why Vector?</strong></h3><p>I came across Vector while looking for something that treated data as a first-class citizen.</p><p>If Telegraf felt like a Swiss Army knife, Vector felt like a <strong>factory floor.</strong> Instead of a single monolithic configuration that tries to do everything at once, Vector is built around an explicit <strong>pipeline model.</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*6QjAprFxG6JUCrRNtG3smA.png" /></figure><p>This shift in architecture changed how I approached the problem.</p><h3>The Core Idea: The Directed Acyclic Graph (DAG)</h3><p>In Part 1, I mentioned that I needed a model that looked like:</p><blockquote><em>Data source → Transformation → Destination</em></blockquote><p>Vector formalizes this into three distinct components:</p><ul><li><strong>Sources:</strong> Where the data begins its life.</li><li><strong>Transforms:</strong> The logic layer where you shape, filter, and fix data.</li><li><strong>Sinks:</strong> The final destination (in my case, ClickHouse).</li></ul><p>By separating these, Vector makes the flow of data explicit. There’s no guessing what happens to a metric between the collector and the database you have to define it.</p><h3>From Config Files to Data Flow</h3><p>With Telegraf, my workflow was a repetitive loop of frustration:</p><p><strong><em>Write config -&gt; Run</em> -&gt; <em>Debug errors.</em></strong></p><p>With Vector, the process became an architectural design:</p><p><strong><em>Collect -&gt; Transform -&gt; Route -&gt; Store.</em></strong></p><p>The focus moved from: <em>“What parameters does this plugin need?”</em> To: <em>“How must this data evolve at each stage to be useful?”</em></p><h3>The New Learning Curve</h3><p>Of course, a new mental model doesn’t mean zero friction. Moving to Vector introduced its own set of challenges. It wasn’t just about learning a new tool; it was about learning to be an architect.</p><p>More importantly, I realized that <strong>the pipeline only works if every stage is defined with absolute precision.</strong> Early on, I hit the “Silent Failure” wall:</p><ul><li>Data would enter the source but never reach the sink.</li><li>A single type mismatch in a transform would stall the entire flow.</li><li>The pipeline felt like a high-performance engine: when it works, it’s beautiful; when a single bolt is loose, nothing moves.</li></ul><h3>The Realization: Everything Is Connected</h3><p>One lesson I learned quickly: <strong>In a pipeline, you cannot treat components as independent islands.</strong></p><p>In my previous setup, I viewed the “collector” and the “database” as two separate problems. In Vector, they are part of the same nervous system. A bad transform doesn’t just produce bad data — it breaks the end-to-end flow.</p><p>This forced me to stop thinking about individual “tools” and start thinking about <strong>the integrity of the stream.</strong></p><h3>What Improved Immediately</h3><p>Despite the initial configuration hurdles, three things became clear:</p><ol><li><strong>Observability:</strong> I could finally see exactly where data was being dropped.</li><li><strong>Granularity:</strong> I had surgical control over the metrics before they hit ClickHouse.</li><li><strong>Flexibility:</strong> I could finally shape the data to fit my schema, rather than forcing my schema to fit the tool.</li></ol><h3>The Hardest Part Was Still Coming</h3><p>I had the structure. I had the pipeline. But one massive hurdle remained: <strong>The Data Shape.</strong></p><p>ClickHouse is a powerful beast, but it is a fastidious one. It demands data in a very specific format, and this is where the “Transformation” stage of the pipeline becomes the most critical part of the entire architecture.</p><h3>What’s Next</h3><p>In Part 3, we get into the “heavy lifting”:</p><ul><li>A first look at Vector Remap Language (VRL) for data shaping</li><li>Solving the strict data type requirements of ClickHouse.</li><li>Fixing timestamp issues and shaping metrics for high-speed ingestion.</li></ul><h3>Series Overview</h3><ul><li>Part 1: <a href="https://medium.com/p/fc123e6b190f">Metrics, ClickHouse, and the “Simple” Setup That Wasn’t</a></li><li><strong>Part 2: Vector Pipelines and the Mental Shift from Configs to Flow (This post)</strong></li><li><a href="https://blog.stackademic.com/writing-transforms-in-vector-where-the-pipeline-actually-breaks-0bff5bd283e3?postPublishedType=repub">Part 3: Writing transforms and handling data correctly</a></li><li><a href="https://medium.com/p/fa277955240b">Part 4: Debugging the pipeline and making it work</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=ac85e9087570" width="1" height="1" alt=""><hr><p><a href="https://blog.stackademic.com/vector-pipelines-and-the-mental-shift-from-configs-to-flow-ac85e9087570">Vector Pipelines and the Mental Shift from Configs to Flow</a> was originally published in <a href="https://blog.stackademic.com">Stackademic</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Metrics, ClickHouse, and the “Simple” Setup That Wasn’t]]></title>
            <link>https://blog.stackademic.com/metrics-clickhouse-and-the-simple-setup-that-wasnt-fc123e6b190f?source=rss-1f4643fe3841------2</link>
            <guid isPermaLink="false">https://medium.com/p/fc123e6b190f</guid>
            <category><![CDATA[devops]]></category>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[software-engineering]]></category>
            <category><![CDATA[observability]]></category>
            <category><![CDATA[clickhouse]]></category>
            <dc:creator><![CDATA[Mohamed Hussain S]]></dc:creator>
            <pubDate>Fri, 10 Apr 2026 00:22:20 GMT</pubDate>
            <atom:updated>2026-05-09T11:48:07.270Z</atom:updated>
            <content:encoded><![CDATA[<blockquote>Part 1: Why I walked away from Telegraf and shifted my mental model on observability.</blockquote><p>Collecting metrics is easy. Shipping them to an analytical database without losing your mind? That’s the hard part.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*7nmyRyNNVfA6gvevEvTKcg.png" /></figure><p>We’ve all been there. You start with a straightforward requirement: <em>“Collect system metrics (CPU, memory, GPU) and store them in ClickHouse for analysis.”</em> It’s a classic observability use case. You collect, you send, you query. Simple, right?</p><p>But as I quickly learned, there is a massive gap between “running a tool” and “building a system.”</p><h3>The Initial Approach: The Telegraf Trap</h3><p>I started where most people do: <strong>Telegraf</strong>.</p><p>It’s the industry standard for a reason. It’s plugin-based, mature, and has a massive community. This was also my formal introduction to TOML configuration. At first, the mindset was: <em>“I just need to write a config and hit run.”</em></p><p>But configuration isn’t just syntax — it’s the DNA of your system’s behavior.</p><p>My goals were modest:</p><ol><li>Capture host-level metrics.</li><li>Monitor GPU performance.</li><li>Pipe it all into <strong>ClickHouse</strong> for high-speed OLAP queries.</li><li>Prepare it for dashboard.</li></ol><h3>Where the Gears Started Grinding</h3><p>On paper, Telegraf is a Swiss Army knife. In practice, I found myself trying to cut down a tree with a corkscrew. I hit four specific walls:</p><ul><li><strong>The ClickHouse Gap:</strong> There wasn’t a native, “it-just-works” output plugin for ClickHouse that handled the schema the way I envisioned.</li><li><strong>The “Black Box” Problem:</strong> Debugging why a metric didn’t arrive where it should have felt unintuitive.</li><li><strong>Rigidity:</strong> As my requirements grew, my TOML file became a brittle monolith.</li><li><strong>Tool vs. Task:</strong> I was spending 80% of my time fighting the tool and only 20% thinking about my data.</li></ul><h3>The Shift: From Configs to Pipelines</h3><p>This is where the lightbulb finally flickered on. I had been thinking in a linear, static way:</p><p>Write config → Run tool → Expect output</p><p>That’s a recipe for brittle infrastructure. I needed to stop thinking about “tools” and start thinking about <strong>Data Flow</strong>. I needed a mental model that looked like this:</p><p>Data source → Transformation → Destination</p><p>The problem wasn’t that Telegraf was “bad” — it was that I lacked control over the <strong>Transformation</strong> step. I needed a system that treated data like a moving stream, not a static delivery.</p><h3>Why I Switched to Vector</h3><p>I realized that for a ClickHouse-backed pipeline, I needed three things: <strong>Granular control, visibility, and flexibility.</strong></p><p>That search led me to <strong>Vector</strong>. Unlike traditional collectors, Vector doesn’t just “send” data; it builds a directed acyclic graph (DAG) of your information. It treats configuration as a <strong>pipeline</strong>.</p><h3>Coming up in Part 2…</h3><p>In the next post, I’ll dive into the technical implementation of Vector, why the <em>sources → transforms → sinks</em> model is a game-changer for ClickHouse, and how my architecture finally stabilized.</p><p><strong>The lesson so far?</strong> Tools don’t solve problems - understanding systems does.</p><p><a href="https://medium.com/p/ac85e9087570">Part 2 — Vector Pipelines and the Mental Shift from Configs to Flow</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=fc123e6b190f" width="1" height="1" alt=""><hr><p><a href="https://blog.stackademic.com/metrics-clickhouse-and-the-simple-setup-that-wasnt-fc123e6b190f">Metrics, ClickHouse, and the “Simple” Setup That Wasn’t</a> was originally published in <a href="https://blog.stackademic.com">Stackademic</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[ClickHouse Full-Text Search in 2026: What Actually Works (and What Doesn’t)]]></title>
            <link>https://blog.stackademic.com/clickhouse-full-text-search-in-2026-what-actually-works-and-what-doesnt-dbb507cf5eee?source=rss-1f4643fe3841------2</link>
            <guid isPermaLink="false">https://medium.com/p/dbb507cf5eee</guid>
            <category><![CDATA[log-analytics]]></category>
            <category><![CDATA[modern-data-architecture]]></category>
            <category><![CDATA[database]]></category>
            <category><![CDATA[clickhouse]]></category>
            <category><![CDATA[full-text-search]]></category>
            <dc:creator><![CDATA[Mohamed Hussain S]]></dc:creator>
            <pubDate>Fri, 03 Apr 2026 08:55:41 GMT</pubDate>
            <atom:updated>2026-04-09T07:33:33.299Z</atom:updated>
            <content:encoded><![CDATA[<p>For years, the answer was simple:</p><p><em>Don’t use ClickHouse for full-text search.</em></p><p>If you needed “real” search, you ended up syncing your data to Elasticsearch or OpenSearch anyway. It was just part of the architecture.</p><p>But that assumption is starting to change.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2R6WBH2X29Kr-g-bn95FvQ.png" /></figure><p>With the introduction of inverted indices and native ranking capabilities, the question is no longer <em>if</em> ClickHouse can handle search — but <strong>how far you can push it before needing a dedicated search engine</strong>.</p><h3>What Do We Mean by Full-Text Search?</h3><p>Full-text search is not just about matching strings.</p><p>In a typical search system, it involves:</p><ul><li>Breaking text into tokens (words)</li><li>Mapping those tokens efficiently (inverted index)</li><li>Ranking results based on relevance (BM25 or similar)</li><li>Handling variations, ordering, and importance</li></ul><p>For example, searching:</p><pre>error connecting database</pre><p>A search engine doesn’t just look for exact matches. It understands relationships between words, ranks results, and returns the most relevant entries first.</p><p>Historically, this is where ClickHouse fell short.</p><h3>What Changed in 2026</h3><p>ClickHouse now provides native capabilities that significantly close this gap.</p><h3>Inverted Indices</h3><p>Instead of scanning entire datasets with LIKE queries, ClickHouse can now use inverted indices to map tokens directly to data locations.</p><pre>ALTER TABLE logs <br>ADD INDEX inv_idx message TYPE inverted(0) GRANULARITY 1;</pre><p>This allows queries to skip irrelevant data blocks and focus only on matching segments.</p><h3>Relevance Scoring (BM25)</h3><p>Search is not just about matching it’s about ranking.</p><p>ClickHouse now supports BM25-style scoring, allowing you to prioritize more relevant results.</p><pre>SELECT *,<br>       BM25(message) AS score<br>FROM logs<br>WHERE hasToken(message, &#39;error&#39;)<br>ORDER BY score DESC<br>LIMIT 10;</pre><p>This brings it closer to how traditional search engines behave.</p><h3>Token-Based Matching</h3><p>Functions like hasToken() introduce awareness of word boundaries.</p><p>This avoids issues like:</p><ul><li>matching “log” inside “biological”</li><li>or partial, unintended matches</li></ul><p>It’s a small detail, but an important one.</p><h3>Where ClickHouse Now Works Really Well</h3><p>There are specific scenarios where ClickHouse is no longer just “good enough” - it’s actually a strong choice.</p><h3>Log Analytics and Observability</h3><p>This is where things get interesting.</p><p>You can:</p><ul><li>Search logs for a specific error</li><li>Filter by time</li><li>Aggregate results</li></ul><p>All in a single query.</p><pre>SELECT count(*)<br>FROM logs<br>WHERE hasToken(message, &#39;error&#39;)<br>AND timestamp &gt;= now() - INTERVAL 1 HOUR;</pre><p>In many real-world setups, maintaining both ClickHouse and a search engine introduces unnecessary complexity — duplicate pipelines, higher costs, and operational overhead.</p><p>Consolidating these into a single system can simplify things significantly.</p><h3>Hybrid Queries (Search + Analytics)</h3><p>This is something traditional search engines struggle with.</p><pre>SELECT service, count(*)<br>FROM logs<br>WHERE hasToken(message, &#39;timeout&#39;)<br>GROUP BY service;</pre><p>Combining search with aggregations is where ClickHouse has a clear advantage.</p><h3>Architectural Simplicity</h3><p>Running multiple systems for analytics and search has always been an “accepted cost.”</p><p>But it comes with trade-offs:</p><ul><li>Data duplication</li><li>Pipeline complexity</li><li>Increased infrastructure overhead</li></ul><p>With newer capabilities, ClickHouse can now handle a larger share of that workload.</p><h3>Where It Still Falls Short</h3><p>Despite the progress, there are still clear limitations.</p><h3>Complex Linguistic Processing</h3><p>ClickHouse does not provide deep language features like:</p><ul><li>stemming</li><li>synonym expansion</li><li>advanced language models</li></ul><p>Dedicated search engines still handle this better.</p><h3>Fuzzy Matching and Auto-Correct</h3><p>Handling typos or approximate matches is still limited.</p><p>While functions like edit distance exist, they are not optimized for large-scale, real-time fuzzy search.</p><h3>Search-First Applications</h3><p>If search is your <strong>primary product</strong> — for example:</p><ul><li>e-commerce search</li><li>document discovery systems</li><li>user-facing search platforms</li></ul><p>Then specialized search engines still provide better control and tuning.</p><h3>ClickHouse vs Search Engines (A Practical View)</h3><p>A simple way to think about it:</p><p><strong>ClickHouse</strong></p><ul><li>Analytics-first</li><li>Strong aggregation performance</li><li>Now supports structured search</li></ul><p><strong>Search engines (Elasticsearch / OpenSearch)</strong></p><ul><li>Search-first</li><li>Advanced relevance tuning</li><li>Mature text processing ecosystem</li></ul><h3>So… Can ClickHouse Replace a Search Engine?</h3><p>In some cases, yes.</p><p>If your workload looks like:</p><ul><li>log analytics</li><li>internal tools</li><li>analytics-heavy search</li></ul><p>ClickHouse is now capable enough to handle both search and aggregation in one place.</p><p>But if your system depends heavily on:</p><ul><li>relevance tuning</li><li>fuzzy matching</li><li>language-aware search</li></ul><p>Then a dedicated search engine still makes sense.</p><h3>Final Thoughts</h3><p>The boundary between analytics and search is no longer as clear as it used to be.</p><p>ClickHouse has evolved from being “not suitable for search” to supporting meaningful full-text search capabilities for many real-world workloads.</p><p>It may not replace search engines everywhere.</p><p>But for a growing set of use cases, it can simplify your architecture — while still delivering the performance and flexibility you need.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=dbb507cf5eee" width="1" height="1" alt=""><hr><p><a href="https://blog.stackademic.com/clickhouse-full-text-search-in-2026-what-actually-works-and-what-doesnt-dbb507cf5eee">ClickHouse Full-Text Search in 2026: What Actually Works (and What Doesn’t)</a> was originally published in <a href="https://blog.stackademic.com">Stackademic</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[When Synthetic Data Lies: A Hidden Correlation Problem]]></title>
            <link>https://blog.stackademic.com/when-synthetic-data-lies-a-hidden-correlation-problem-13dd4a2611f5?source=rss-1f4643fe3841------2</link>
            <guid isPermaLink="false">https://medium.com/p/13dd4a2611f5</guid>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[data]]></category>
            <category><![CDATA[analytics]]></category>
            <category><![CDATA[clickhouse]]></category>
            <category><![CDATA[synthetic-data-generation]]></category>
            <dc:creator><![CDATA[Mohamed Hussain S]]></dc:creator>
            <pubDate>Wed, 01 Apr 2026 10:26:44 GMT</pubDate>
            <atom:updated>2026-04-02T07:30:23.053Z</atom:updated>
            <content:encoded><![CDATA[<p>While building a small analytics setup with ClickHouse and Superset, I generated synthetic data to test queries and dashboards.</p><p>At first, everything looked fine.</p><p>The distributions seemed reasonable.<br>The dashboards behaved as expected.</p><p>Nothing looked suspicious.</p><p>But as I increased the dataset size, something started to feel… off.</p><p>Revenue appeared heavily concentrated in a single country.<br>Some countries, in certain cases, had no purchases at all.</p><p>At a glance, it looked like a simple imbalance.</p><p>But the pattern was too consistent to ignore.</p><h3>It Didn’t Look Like a Query Problem</h3><p>The obvious assumption was that something was wrong with the queries.</p><p>So I checked everything:</p><ul><li>query logic</li><li>filters</li><li>materialized views</li><li>dashboard configurations</li></ul><p>Everything looked correct.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*D9mnfIqS5dzeKoNW.png" /></figure><p>Which left only one possibility:</p><blockquote><em>The issue wasn’t in how the data was queried — it was in how the data was generated.</em></blockquote><h3>Going Back to the Data</h3><p>Instead of relying on dashboards, I went back to the raw data.</p><p>A simple aggregation made things clearer:</p><ul><li>one country dominating purchases</li><li>another missing entirely</li></ul><p>Interestingly, the overall event distribution still looked normal.</p><p>Which made the issue harder to spot at first.</p><h3>The First Problem: Randomness That Wasn’t Random</h3><p>The initial data generation logic used rand() like this:</p><pre>multiIf(<br>    rand() % 100 &lt; 40, &#39;India&#39;,<br>    rand() % 100 &lt; 65, &#39;US&#39;,<br>    rand() % 100 &lt; 80, &#39;UK&#39;,<br>    rand() % 100 &lt; 90, &#39;Germany&#39;,<br>    &#39;UAE&#39;<br>)</pre><p>At a glance, this seems reasonable.</p><p>But there’s a subtle problem.</p><p>Each rand() call is evaluated independently.</p><p>So instead of generating a single random value and assigning a category, the logic evaluates a <em>new</em> random value at every step.</p><p>That small detail introduces unintended bias.</p><h3>Fixing That… and Creating a New Problem</h3><p>To make the data more stable, I switched to a deterministic approach:</p><pre>(number * 17) % 100 AS event_rand<br>(number * 29) % 100 AS country_rand</pre><p>This made the distributions predictable and easier to reason about.</p><p>But it also introduced a different issue — one that wasn’t obvious immediately.</p><p>Some countries now had <strong>zero purchases</strong>.</p><h3>The Real Problem: Hidden Correlation</h3><p>The issue wasn’t randomness anymore.</p><p>It was <strong>correlation</strong>.</p><p>Both event_type and country were derived from the same base value: number.</p><p>Even with different multipliers, they were still mathematically related.</p><p>Which meant:</p><ul><li>certain combinations could never occur</li><li>some values were implicitly “locked together”</li></ul><p>In this case, rows that produced &quot;purchase&quot; never aligned with rows that produced &quot;UAE&quot;.</p><p>So UAE ended up with zero purchases.</p><p>Not because of probability.</p><p>But because of how the data was constructed.</p><h3>What Was Actually Wrong</h3><p>The core issue wasn’t randomness.</p><p>It was <strong>lack of independence</strong>.</p><p>The variables in the dataset weren’t independent of each other.</p><p>And that’s enough to produce misleading analytics.</p><h3>Fixing the Data Properly</h3><p>To fix this, I changed how the values were generated:</p><ul><li>used different transformations</li><li>added offsets</li><li>ensured variables were independently distributed</li></ul><p>For example:</p><pre>(number * 17) % 100 AS event_rand<br>(number * 31 + 13) % 100 AS country_rand<br>(number * 47 + 23) % 100 AS device_rand</pre><p>This breaks the alignment between variables and restores independence.</p><p>After this change, the distributions behaved as expected.</p><h3>Why This Matters</h3><p>At smaller scales, the issue wasn’t obvious.</p><p>The data looked fine.<br>The dashboards looked correct.</p><p>But as the dataset grew:</p><ul><li>patterns became more consistent</li><li>biases became more visible</li><li>incorrect assumptions started to look like real insights</li></ul><h3>Key Takeaway</h3><blockquote><em>Synthetic data can look correct - and still be wrong.</em></blockquote><p>The problem wasn’t query performance.</p><p>It was data correctness.</p><p>Scaling the data didn’t create the issue.</p><p>It revealed it.</p><h3>Final Thoughts</h3><p>This was a useful reminder that:</p><ul><li>data generation deserves as much attention as querying</li><li>small assumptions can lead to large inconsistencies</li><li>“reasonable-looking” data isn’t always reliable</li></ul><p>It’s easy to trust what dashboards show.</p><p>But sometimes, it’s worth questioning how that data was created in the first place.</p><h3>Related</h3><p>This issue surfaced while building an analytics setup using ClickHouse and Superset, where I was also comparing raw tables and materialized views.</p><p>If you’re interested in that setup:</p><p>👉 <a href="https://dev.to/mohhddhassan/how-clickhouse-superset-work-together-for-analytics-and-what-actually-matters-597n">ClickHouse + Superset</a></p><h3>A message from our Founder</h3><p>Hey, <a href="https://linkedin.com/in/sunilsandhu">Sunil</a> here. I wanted to take a moment to thank you for reading until the end and for being a part of this community. Did you know that our team run these publications as a volunteer effort to over 3.5m monthly readers? We don’t receive any funding, we do this to support the community.</p><p>If you want to show some love, please take a moment to follow me on <a href="https://linkedin.com/in/sunilsandhu">LinkedIn</a>, <a href="https://tiktok.com/@messyfounder">TikTok</a>, <a href="https://instagram.com/sunilsandhu">Instagram</a>. You can also subscribe to our <a href="https://newsletter.plainenglish.io/">weekly newsletter</a>. And before you go, don’t forget to clap and follow the writer️!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=13dd4a2611f5" width="1" height="1" alt=""><hr><p><a href="https://blog.stackademic.com/when-synthetic-data-lies-a-hidden-correlation-problem-13dd4a2611f5">When Synthetic Data Lies: A Hidden Correlation Problem</a> was originally published in <a href="https://blog.stackademic.com">Stackademic</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Managing Large PostgreSQL Tables with Native Partitioning and pg_partman]]></title>
            <link>https://blog.stackademic.com/managing-large-postgresql-tables-with-native-partitioning-and-pg-partman-4e8138a68c39?source=rss-1f4643fe3841------2</link>
            <guid isPermaLink="false">https://medium.com/p/4e8138a68c39</guid>
            <category><![CDATA[open-source]]></category>
            <category><![CDATA[database]]></category>
            <category><![CDATA[postgresql]]></category>
            <category><![CDATA[partitioning]]></category>
            <dc:creator><![CDATA[Mohamed Hussain S]]></dc:creator>
            <pubDate>Sun, 22 Mar 2026 04:18:10 GMT</pubDate>
            <atom:updated>2026-03-22T04:18:10.673Z</atom:updated>
            <content:encoded><![CDATA[<p>As databases grow, tables that store large volumes of time-based data tend to become harder to manage.</p><p>What starts as a simple table gradually turns into a bottleneck. Queries slow down, indexes grow larger, maintenance operations like VACUUM take longer, and even simple tasks like managing old data become cumbersome.</p><p>PostgreSQL provides a built-in solution for this problem in the form of <strong>table partitioning</strong>.</p><p>Partitioning helps break a large table into smaller, more manageable pieces. But while it improves performance and data organization, managing partitions manually can introduce its own set of operational challenges.</p><p>In this article, we’ll walk through how PostgreSQL’s native partitioning works, where it starts to become difficult to manage at scale, and how <strong>pg_partman</strong> helps automate the process.</p><h3>Understanding Native PostgreSQL Partitioning</h3><p>At its core, partitioning allows a single logical table to be split into multiple physical tables, known as partitions.</p><p>PostgreSQL supports three types of partitioning: range, list, and hash. For most real-world systems dealing with time-based data, <strong>range partitioning</strong> is the most commonly used approach.</p><p>Consider a simple example where we store application events with a timestamp.</p><pre>CREATE TABLE events (<br>    id BIGSERIAL,<br>    created_at TIMESTAMP,<br>    data JSONB<br>) PARTITION BY RANGE (created_at);</pre><p>Here, events acts as the parent table. The actual data does not live in this table directly. Instead, it is stored in child tables, each representing a specific range of time.</p><p>For example:</p><pre>CREATE TABLE events_2026_03_20<br>PARTITION OF events<br>FOR VALUES FROM (&#39;2026-03-20&#39;) TO (&#39;2026-03-21&#39;);</pre><pre>CREATE TABLE events_2026_03_21<br>PARTITION OF events<br>FOR VALUES FROM (&#39;2026-03-21&#39;) TO (&#39;2026-03-22&#39;);</pre><p>Each of these partitions is a real, physical table inside PostgreSQL. However, from the application’s perspective, there is still just one table: events.</p><h3>How Data Flows into Partitions</h3><p>One of the powerful aspects of PostgreSQL partitioning is that applications don’t need to be aware of individual partitions.</p><p>Data is always inserted into the parent table.</p><pre>INSERT INTO events (created_at, data)<br>VALUES (&#39;2026-03-21 10:15:00&#39;, &#39;{&quot;event&quot;:&quot;login&quot;}&#39;);</pre><p>PostgreSQL automatically determines which partition the row belongs to and routes it accordingly. In this case, the data would be written into the events_2026_03_21 partition.</p><p>Queries also continue to run against the parent table.</p><pre>SELECT * FROM events<br>WHERE created_at &gt;= now() - interval &#39;7 days&#39;;</pre><p>Behind the scenes, PostgreSQL uses a technique called <strong>partition pruning</strong>, which ensures that only the relevant partitions are scanned. Instead of scanning an entire large table, it limits the work to just the partitions that match the query conditions.</p><h3>Why Partitioning Makes a Difference</h3><p>The performance benefits of partitioning come from reducing the amount of data PostgreSQL needs to process.</p><p>If a table contains years of data, but a query only needs the last seven days, PostgreSQL can skip older partitions entirely. This significantly reduces query time and improves efficiency.</p><p>Partitioning also simplifies operational tasks. Instead of deleting millions of rows, old data can be removed instantly by dropping an entire partition.</p><pre>DROP TABLE events_2024_03_01;</pre><p>This is much faster and more efficient than running large delete operations on a single table.</p><p>Over time, this approach helps keep tables smaller, indexes more manageable, and maintenance operations faster.</p><h3>Where Native Partitioning Becomes Difficult</h3><p>While native partitioning is powerful, managing it manually introduces operational complexity.</p><p>Partitions need to be created ahead of time. If a partition is missing, inserts can fail. Old partitions need to be cleaned up regularly, and retention policies must be implemented manually.</p><p>A simple but common scenario looks like this:</p><pre>                   application inserts event at midnight<br>                                    ↓<br>                       new partition does not exist<br>                                    ↓<br>                              insert fails</pre><p>As systems scale and data grows, keeping track of partitions manually becomes error-prone and difficult to maintain.</p><p>This is where automation becomes important.</p><h3>Automating Partition Management with pg_partman</h3><p>This is where <strong>pg_partman</strong> comes in.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*8tMf8htUS3-nCPH51wfAFw.png" /></figure><p>pg_partman is a PostgreSQL extension designed to automate the management of partitioned tables. Instead of manually creating and maintaining partitions, pg_partman handles it for you.</p><p>It builds on top of PostgreSQL’s native partitioning and provides features such as:</p><ul><li>automatic creation of future partitions</li><li>automatic retention and cleanup of old partitions</li><li>simplified maintenance workflows</li></ul><p>Rather than relying on external scripts or manual intervention, partition management becomes part of the database itself.</p><h3>How pg_partman Fits into the Workflow</h3><p>With pg_partman in place, the workflow becomes much simpler.</p><p>A parent table is configured once, and pg_partman takes care of maintaining its partitions over time.</p><p>For a table partitioned daily, pg_partman ensures that upcoming partitions are always created in advance.</p><pre>events_p2026_03_22<br>events_p2026_03_23<br>events_p2026_03_24</pre><p>This guarantees that new data always has a valid destination and prevents insert failures caused by missing partitions.</p><p>It can also handle retention policies, automatically removing older partitions when they are no longer needed.</p><h3>Why pg_partman Makes a Difference</h3><p>The biggest advantage of pg_partman is that it removes the operational burden of managing partitions manually.</p><p>Instead of worrying about creating partitions, scheduling cleanup jobs, or handling edge cases, the system becomes self-maintaining.</p><p>This leads to:</p><ul><li>fewer operational errors</li><li>consistent partition management</li><li>simpler database workflows</li></ul><p>It allows teams to focus on the data itself rather than the mechanics of maintaining partitions.</p><h3>When This Approach Works Best</h3><p>Partitioning combined with pg_partman is especially useful for workloads that deal with continuously growing, time-based data.</p><p>This includes:</p><ul><li>event logs</li><li>analytics pipelines</li><li>application activity tracking</li><li>time-series datasets</li></ul><p>In these systems, new data is constantly written, while older data gradually becomes less relevant. Automating partition management helps handle this lifecycle efficiently.</p><h3>Final Thoughts</h3><p>PostgreSQL’s native partitioning provides a strong foundation for managing large datasets.</p><p>However, as systems grow, the operational overhead of managing partitions manually becomes increasingly difficult.</p><p>Extensions like <strong>pg_partman</strong> build on top of PostgreSQL’s capabilities, make partition management much more practical in real-world systems.</p><p>By combining native partitioning with automation, it becomes possible to handle large, time-based datasets in a way that is both efficient and reliable.</p><h3>A message from our Founder</h3><p>Hey, <a href="https://linkedin.com/in/sunilsandhu">Sunil</a> here. I wanted to take a moment to thank you for reading until the end and for being a part of this community. Did you know that our team run these publications as a volunteer effort to over 3.5m monthly readers? We don’t receive any funding, we do this to support the community.</p><p>If you want to show some love, please take a moment to follow me on <a href="https://linkedin.com/in/sunilsandhu">LinkedIn</a>, <a href="https://tiktok.com/@messyfounder">TikTok</a>, <a href="https://instagram.com/sunilsandhu">Instagram</a>. You can also subscribe to our <a href="https://newsletter.plainenglish.io/">weekly newsletter</a>. And before you go, don’t forget to clap and follow the writer️!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=4e8138a68c39" width="1" height="1" alt=""><hr><p><a href="https://blog.stackademic.com/managing-large-postgresql-tables-with-native-partitioning-and-pg-partman-4e8138a68c39">Managing Large PostgreSQL Tables with Native Partitioning and pg_partman</a> was originally published in <a href="https://blog.stackademic.com">Stackademic</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[PostgreSQL Backups and Point-in-Time Recovery with pgBackRest]]></title>
            <link>https://blog.stackademic.com/postgresql-backups-and-point-in-time-recovery-with-pgbackrest-a215c1294d90?source=rss-1f4643fe3841------2</link>
            <guid isPermaLink="false">https://medium.com/p/a215c1294d90</guid>
            <category><![CDATA[open-source]]></category>
            <category><![CDATA[postgresql]]></category>
            <category><![CDATA[backend]]></category>
            <category><![CDATA[devops]]></category>
            <category><![CDATA[database]]></category>
            <dc:creator><![CDATA[Mohamed Hussain S]]></dc:creator>
            <pubDate>Fri, 13 Mar 2026 15:11:53 GMT</pubDate>
            <atom:updated>2026-03-23T13:38:54.259Z</atom:updated>
            <content:encoded><![CDATA[<p>In the previous articles of this series, we explored how PostgreSQL records every change using <strong>Write-Ahead Logging (WAL)</strong> and how systems like Debezium can read those WAL records to stream database changes in real time.</p><p>But WAL is not only useful for streaming data changes.</p><p>It also enables one of the most powerful recovery mechanisms in PostgreSQL: <strong>Point-in-Time Recovery (PITR)</strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*5eceDZmIV3PojPFxCVIv3Q.png" /></figure><p>Instead of restoring a database only to the moment when a backup was taken, PITR allows PostgreSQL to reconstruct the database state <strong>at any specific point in time</strong>.</p><p>This capability becomes possible by combining two things:</p><ul><li>a <strong>base backup</strong> of the database</li><li>a continuous archive of <strong>WAL segments</strong></li></ul><p>In this article, we’ll look at how PostgreSQL backups work and how tools like pgBackRest simplify backup management and recovery workflows.</p><h3>Why Backups Alone Are Not Enough</h3><p>A common backup strategy is to take periodic full backups of the database.</p><p>Imagine a system where a full backup is taken every night at 02:00.</p><p>Later in the day, something goes wrong.</p><pre>02:00  → backup taken<br>14:00  → accidental DELETE executed</pre><p>If we restore the backup taken at 02:00, the database returns to that state.</p><p>But everything that happened between <strong>02:00 and 14:00 is lost</strong>.</p><p>This is the limitation of relying only on periodic backups.</p><p>PostgreSQL solves this problem by continuously recording changes in <strong>WAL</strong>. By archiving WAL segments, the database can replay changes and rebuild its state up to a specific moment.</p><p>This is the idea behind <strong>Point-in-Time Recovery</strong>.</p><h3>WAL and the Recovery Process</h3><p>When PostgreSQL restores a backup, it does not simply copy database files and start running again.</p><p>Instead, the recovery process happens in stages.</p><pre>                              Restore base backup<br>                                      ↓<br>                          Replay archived WAL segments<br>                                      ↓<br>                       Reach a consistent database state</pre><p>Because WAL records every change in chronological order, PostgreSQL can replay those changes to rebuild the database state exactly as it existed at a particular moment.</p><p>Backup tools such as pgBackRest manage this entire process.</p><h3>Managing Backups with pgBackRest</h3><p>pgBackRest is a widely used PostgreSQL backup tool designed for reliability and automation.</p><p>It integrates directly with PostgreSQL’s WAL archiving system and supports storing backups in multiple storage backends, including object storage.</p><p>In the setup used for testing, the architecture consisted of two containers:</p><pre>                         PostgreSQL + pgBackRest<br>                                    │<br>                                    ▼<br>                                  MinIO</pre><p>The PostgreSQL container runs the database along with the pgBackRest binary, while MinIO acts as an object storage system where backups and WAL archives are stored.</p><p>These options allow systems to balance backup frequency, storage usage, and recovery speed.</p><h3>Restoring the Database</h3><p>Restoring a PostgreSQL backup requires the database to be stopped.</p><p>However, in this Docker-based setup PostgreSQL and pgBackRest were running inside the <strong>same container</strong>. When PostgreSQL stops, the container stops as well, which means pgBackRest inside that container becomes inaccessible.</p><p>To work around this, a <strong>temporary container</strong> was started with the same PostgreSQL data volume. This allowed the restore operation to run while the database remained stopped.</p><p>The recovery workflow looked like this:</p><pre>                              Restore base backup<br>                                      ↓<br>                              Replay archived WAL<br>                                      ↓<br>                       Database reaches consistent state</pre><p>After the restore finished, the original PostgreSQL container was started again.</p><p>At first glance, everything seemed to work. But verifying the data revealed something unexpected.</p><h3>A PITR Behavior Observed During Testing</h3><p>The following sequence had occurred:</p><pre>table created<br>↓<br>backup taken<br>↓<br>table deleted<br>↓<br>restore base backup<br>↓<br>WAL replay<br>↓<br>table still deleted</pre><p>Even though the backup was taken <strong>before the table was deleted</strong>, the restored database still showed the table as deleted.</p><p>This happens because PostgreSQL automatically continues replaying <strong>all available WAL records</strong> during recovery.</p><p>Since the DROP TABLE operation had been recorded in WAL after the backup was taken, that change was replayed during recovery.</p><p>To prevent this behavior, the restore command used the option:</p><pre>--type=immediate</pre><p>This tells PostgreSQL to stop recovery <strong>immediately after reaching the backup point</strong>, rather than replaying all WAL records.</p><p>Once recovery stopped at the correct point, the table appeared again as expected.</p><p>Understanding how WAL replay works is essential when performing <strong>Point-in-Time Recovery</strong>, because PostgreSQL will always try to bring the database to the most recent consistent state unless instructed otherwise.</p><h3>Automating Backups</h3><p>Backups are typically automated using scheduled jobs.</p><p>In this setup, the following cron schedule was used:</p><pre>0 2 * * 0 pgbackrest --stanza=demo backup --type=full<br>0 2 * * 1-6 pgbackrest --stanza=demo backup --type=diff</pre><p>This configuration creates:</p><ul><li>a <strong>full backup every Sunday</strong></li><li><strong>differential backups during the rest of the week</strong></li></ul><p>This approach balances storage usage with recovery efficiency.</p><h3>Final Thoughts</h3><p>Throughout this series we’ve seen how <strong>Write-Ahead Logging (WAL)</strong> sits at the center of PostgreSQL’s design.</p><p>The same WAL mechanism powers several critical capabilities:</p><ul><li>crash recovery</li><li>replication</li><li>change data capture</li><li>backups and point-in-time recovery</li></ul><p>Tools like pgBackRest build on top of these mechanisms to provide reliable backup and restore workflows.</p><p>Understanding WAL not only helps explain how PostgreSQL ensures durability, but also how modern data systems build reliable pipelines and recovery strategies on top of it.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=a215c1294d90" width="1" height="1" alt=""><hr><p><a href="https://blog.stackademic.com/postgresql-backups-and-point-in-time-recovery-with-pgbackrest-a215c1294d90">PostgreSQL Backups and Point-in-Time Recovery with pgBackRest</a> was originally published in <a href="https://blog.stackademic.com">Stackademic</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How Debezium Uses PostgreSQL WAL for Change Data Capture]]></title>
            <link>https://blog.stackademic.com/how-debezium-uses-postgresql-wal-for-change-data-capture-a365b4ec4221?source=rss-1f4643fe3841------2</link>
            <guid isPermaLink="false">https://medium.com/p/a365b4ec4221</guid>
            <category><![CDATA[backend]]></category>
            <category><![CDATA[postgres]]></category>
            <category><![CDATA[devops]]></category>
            <category><![CDATA[database]]></category>
            <category><![CDATA[postgresql]]></category>
            <dc:creator><![CDATA[Mohamed Hussain S]]></dc:creator>
            <pubDate>Fri, 13 Mar 2026 08:57:31 GMT</pubDate>
            <atom:updated>2026-03-17T07:31:19.443Z</atom:updated>
            <content:encoded><![CDATA[<p>In the previous article, we explored what <strong>Write-Ahead Logging (WAL)</strong> is and why it sits at the center of PostgreSQL’s architecture.</p><p>Every change that happens inside PostgreSQL is first written to WAL before being applied to the actual table files.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*QVxKrOk-xpglJ1-LfSvHHA.png" /></figure><p>This design ensures that the database can recover to a consistent state after a failure.</p><p>Because WAL records every change happening inside the database, it can also be used to <strong>stream those changes to other systems in real time</strong>.</p><p>This is exactly the idea behind <strong>Change Data Capture (CDC)</strong>.</p><p>Instead of repeatedly querying database tables to detect updates, CDC systems read the database’s change log and convert those changes into events that other systems can consume.</p><p>One of the most widely used CDC platforms for PostgreSQL is Debezium.</p><p>Debezium reads PostgreSQL’s WAL and turns database operations into a continuous stream of events that can be processed by downstream systems.</p><h3>Why Applications Need Change Data Capture</h3><p>Modern applications rarely operate in isolation.</p><p>A single database update might need to trigger actions across several systems.</p><p>Consider a simple update:</p><pre>UPDATE users<br>SET name = &#39;Alice Cooper&#39;<br>WHERE id = 1;</pre><p>This change might need to propagate to multiple places.</p><p>For example, an application might need to:</p><ul><li>update a search index</li><li>refresh analytics dashboards</li><li>synchronize data into a data warehouse</li><li>trigger downstream workflows</li></ul><p>A common approach is to periodically query the database to detect changes.</p><p>But this approach quickly runs into problems.</p><p>Polling databases repeatedly increases load, introduces delays, and becomes difficult to scale as systems grow.</p><p>Change Data Capture solves this problem by <strong>streaming database changes as they happen</strong>.</p><h3>Why WAL Makes CDC Possible</h3><p>PostgreSQL records every modification inside WAL before writing the change to the actual data files.</p><p>This means WAL effectively contains a <strong>chronological record of everything that happens inside the database</strong>.</p><p>Because of this, CDC systems do not need to read tables directly.</p><p>Instead, they simply read the WAL stream.</p><p>A simplified view of the process looks like this:</p><pre>                              Application query<br>                                    ↓<br>                        PostgreSQL writes change to WAL<br>                                    ↓<br>                            CDC system reads WAL<br>                                    ↓<br>                           Change event generated</pre><p>This approach is both efficient and reliable because the CDC system processes changes <strong>in the exact order they occurred</strong>.</p><p>Debezium follows this approach.</p><p>Rather than querying tables, it reads PostgreSQL’s WAL and converts the changes into structured events.</p><h3>Enabling Logical Decoding in PostgreSQL</h3><p>For external systems to interpret WAL records, PostgreSQL must include enough information about row-level changes.</p><p>This is controlled by the wal_level configuration.</p><pre>wal_level = logical</pre><p>PostgreSQL supports several WAL levels.</p><ul><li><strong>minimal - </strong>used only for crash recovery</li><li><strong>replica - </strong>supports physical replication</li><li><strong>logical - </strong>includes additional metadata required for logical decoding</li></ul><p>When the WAL level is set to <strong>logical</strong>, PostgreSQL records enough information for external systems to reconstruct database changes.</p><p>This capability is known as <strong>logical decoding</strong>.</p><h3>Replication Slots and Tracking WAL Consumption</h3><p>When Debezium connects to PostgreSQL, it creates something called a <strong>replication slot</strong>.</p><p>A replication slot is a mechanism that allows PostgreSQL to track how far a consumer has read the WAL stream.</p><p>This ensures that WAL segments are not removed before the CDC system has processed them.</p><p>The process works like this:</p><pre>                            Debezium reads WAL<br>                                    ↓<br>                  PostgreSQL records last consumed position<br>                                    ↓<br>                     WAL segments retained until consumed</pre><p>The position in WAL is identified using a value called the <strong>Log Sequence Number (LSN)</strong>.</p><p>Because the replication slot tracks the last processed LSN, Debezium can resume streaming changes from the correct position even if the connector restarts.</p><h3>Logical Decoding Plugins</h3><p>WAL records are stored in a binary format that is not directly readable.</p><p>To convert these records into logical database changes, PostgreSQL uses <strong>logical decoding plugins</strong>.</p><p>These plugins interpret WAL records and transform them into a structured format.</p><p>Some commonly used plugins include:</p><ul><li>pgoutput</li><li>wal2json</li><li>decoderbufs</li></ul><p>The default plugin used by PostgreSQL is <strong>pgoutput</strong>, which is also commonly used with Debezium.</p><h3>How Debezium Streams Database Changes</h3><p>Once everything is configured, the change data capture pipeline works as follows.</p><p>An application writes data to PostgreSQL.</p><p>PostgreSQL records the change in WAL.</p><p>The logical decoding plugin interprets the WAL records.</p><p>Debezium reads those decoded changes through PostgreSQL’s replication protocol.</p><p>The connector then publishes the changes as events to a messaging system such as Kafka.</p><p>A simplified architecture might look like this:</p><pre>                                 PostgreSQL<br>                                     ↓<br>                                 Debezium<br>                                     ↓<br>                                   Kafka<br>                                     ↓<br>                                 Consumers </pre><p>Downstream systems can then consume these events and react accordingly.</p><p>This architecture enables event-driven systems where database changes propagate automatically to other services.</p><h3>Example Change Event</h3><p>When Debezium captures a change from WAL, it converts that change into a structured event.</p><p>A simplified example might look like this:</p><pre>{<br>  &quot;op&quot;: &quot;c&quot;,<br>  &quot;table&quot;: &quot;users&quot;,<br>  &quot;before&quot;: null,<br>  &quot;after&quot;: {<br>    &quot;id&quot;: 1,<br>    &quot;name&quot;: &quot;Alice&quot;<br>  }<br>}</pre><p>The op field represents the operation type:</p><ul><li><strong>c -</strong> create</li><li><strong>u - </strong>update</li><li><strong>d -</strong> delete</li></ul><p>Applications consuming these events can use them to trigger additional processing.</p><h3>WAL as the Foundation for Multiple Systems</h3><p>One interesting aspect of PostgreSQL’s design is that <strong>many different systems rely on the same WAL stream</strong>.</p><p>The same WAL records used for CDC are also used for:</p><ul><li>database replication</li><li>backup systems</li><li>point-in-time recovery</li></ul><p>In many ways, WAL acts as the <strong>complete history of changes in a PostgreSQL database</strong>.</p><h3>Final Thoughts</h3><p>Change Data Capture allows systems to react to database changes without constantly querying tables.</p><p>PostgreSQL makes this possible because every modification is recorded in WAL.</p><p>Tools like Debezium simply read that stream of changes and convert it into events that other systems can consume.</p><p>This makes it possible to build scalable, event-driven architectures directly on top of relational databases.</p><h3>Next in This Series</h3><p>In the next article, we’ll explore how PostgreSQL WAL is used for <strong>database backups and point-in-time recovery</strong> using tools like pgBackRest.</p><p>We’ll see how WAL archiving enables restoring a database to a specific moment in time.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=a365b4ec4221" width="1" height="1" alt=""><hr><p><a href="https://blog.stackademic.com/how-debezium-uses-postgresql-wal-for-change-data-capture-a365b4ec4221">How Debezium Uses PostgreSQL WAL for Change Data Capture</a> was originally published in <a href="https://blog.stackademic.com">Stackademic</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>