Stories by Mohamed Hussain S on Medium

PostgreSQL Restore Failures: Stop Fighting Recovery Logic

Mohamed Hussain S — Mon, 18 May 2026 06:30:14 GMT

PostgreSQL Restore Failures: It Wasn’t pgBackRest, It Was My Recovery Logic

I was building what I thought was a foolproof PostgreSQL backup and restore workflow using pgBackRest. The plan was simple: automate backups, automate restores, validate the data, and make recovery predictable.

Instead, I ended up repeatedly breaking PostgreSQL recovery itself.

At one point, the database refused to start entirely. Applications crashed, authentication failed, and my logs were a graveyard of cryptic errors:

invalid checkpoint record
could not locate a valid checkpoint record at 0/DEAD

Shortly after, I hit timeline mismatch errors that made my head spin: ERROR: [058]: target timeline 3 forked from backup timeline 2.

My first instinct? “pgBackRest is corrupting my restores.” I was wrong. The real culprit wasn’t the tool — it was the way I was handling the recovery logic.

The Architecture of a Failure

I moved the testing to my local machine to isolate the issue. I wanted a clean environment where I could trigger restores via API workflows. My logic followed a standard script:

Download the backup repository.
Stop PostgreSQL.
Restore the backup.
Start PostgreSQL.
Validate the database.

It sounds straightforward, but this is where the “hidden” mechanics of PostgreSQL began to clash with my automation.

My Fatal Assumption: “Cleaning Up” the Garbage

Every time a restore failed, I noticed a few files sitting in my data directory: backup_label, recovery.signal, and standby.signal.

To my untrained eye, these looked like leftover artifacts from a crashed process — debris that needed to be cleared for a “clean” start. I added a line to my automation that I thought was clever:

# The line that broke everything
rm -f recovery.signal standby.signal backup_label

I genuinely believed I was helping PostgreSQL start with a blank slate. In reality, I was deleting the exact recovery metadata PostgreSQL needed to survive.

What PostgreSQL Was Actually Trying To Do

This was my turning point. I realized pgBackRest wasn’t writing “junk” files; it was orchestrating a delicate handoff.

backup_label: This tells PostgreSQL exactly where the recovery WAL (Write-Ahead Log) replay should begin.
recovery.signal: This acts as a flag telling PostgreSQL to enter recovery mode rather than normal startup.
WAL Replay: This is the process of reconstructing a consistent state from logs.

By deleting these files, I was essentially blindfolding PostgreSQL and asking it to run through a minefield. It didn’t know where it was in time, so it panicked.

The “Timeline Mismatch” Rabbit Hole

As I kept “fixing” the restores, I hit a new wall: Timeline Mismatches.

ERROR: [058]: target timeline 3 forked from backup timeline 2

PostgreSQL backups aren’t just static snapshots; they are living histories. Because my earlier failed attempts had partially started the database, I had inadvertently created “forked” histories. I wasn’t just debugging a restore anymore; I was debugging a fractured timeline of events.

The Lesson: PostgreSQL backups are tightly coupled with WAL history. If you break the continuity of the recovery, you break the future of your backups.

The Fix: Stop Fighting the Engine

The biggest realization was simple: Stop trying to “help” PostgreSQL recover.

Instead of manual cleanup, I pivoted to a “scorched earth” approach for the data directory and let the official tools do their jobs.

The Corrected Flow:

Stop PostgreSQL cleanly.
Completely empty the data directory (don’t just overwrite; delete everything).
Run pgBackRest restore with --type=immediate.
Let PostgreSQL start and manage its own internal recovery.
Promote the instance only after it confirms readiness.

The addition of --type=immediate was the final piece of the puzzle. It told pgBackRest to restore to the latest consistent point available, ensuring the WAL replay stayed valid and predictable.

# The stable approach
self._empty_directory(self.pg_data_dir)
self._run_pgbackrest("restore", "--type=immediate")

Hard-Learned Takeaways

If you are managing PostgreSQL restores, keep these rules in mind to avoid the week of debugging I endured:

backup_label is sacred: Never delete it manually during a restore.
Isolation is key: Never test restore automation against a live application environment. The “blast radius” of a failed database startup is huge.
Clean starts only: A “delta” restore is fast, but if you’re seeing corruption, wipe the directory and start fresh.
Trust the Tool: pgBackRest is designed to orchestrate recovery. If you find yourself writing rm commands in your data folder, you’re probably doing it wrong.

Final Thought:

The system wasn’t broken.
My mental model of how PostgreSQL recovers was.

Once I stopped fighting the recovery logic and started supporting it, the “corrupt” restores vanished.

PostgreSQL Restore Failures: Stop Fighting Recovery Logic was originally published in Stackademic on Medium, where people are continuing the conversation by highlighting and responding to this story.

Debugging a Pipeline That “Should Have Worked”

Mohamed Hussain S — Sat, 09 May 2026 06:12:35 GMT

Part 4 of a series on building a metrics pipeline into ClickHouse

When Everything Looks Correct

At this stage, everything was in place.

The pipeline model was clear
Transforms were working
Data matched the schema

By all accounts, it should have worked.

But it didn’t.

The Kind of Error That Breaks Your Flow

Then I saw this:

There exists no table monitoring.cpu in database monitoring

That didn’t make sense.

I hadn’t:

Created that table
Referenced it
Used it anywhere in my pipeline

And yet, it kept appearing.

When Debugging Goes in the Wrong Direction

At first, I assumed the problem was in my Vector configuration.

So I checked:

Transforms
Sinks
Table mappings

Everything looked correct.
Which made it worse.
Because now there was no clear direction.

The Turning Point: Logs

This is where I stopped guessing and started looking at the system.

sudo tail -f /var/log/clickhouse-server/clickhouse-server.err.log

Logs don’t lie.

They don’t explain — but they point.

The Actual Problem

The issue wasn’t in my pipeline.

It was outside of it.

A previously used Telegraf process was still running.

Even after:

Removing configs
Switching to Vector
Rebuilding everything

The old process was still alive.

And still sending data.

The Invisible Interference

That meant two things were happening at once:

My new Vector pipeline
An old Telegraf pipeline

Both writing to the same system.
Which explained everything.
The error wasn’t coming from what I built.
It was coming from what I forgot to stop.

The Fix

Simple.

But easy to miss.

ps aux | grep telegraf

And then:

sudo systemctl stop telegraf

That was it.

The error disappeared.

The Real Lesson

This wasn’t a configuration issue.

It was a system issue.

You were debugging the wrong pipeline.

And that’s more common than it sounds.

What This Changes

After this, my approach to debugging changed completely.

I stopped asking:

“What’s wrong with my config?”

And started asking:

“What is actually running right now?”

The Bigger Picture

Data pipelines don’t exist in isolation.

They live inside systems.

And systems have:

History
State
Residue

Ignoring that leads to confusion.

Series Recap

Part 1: Metrics, ClickHouse, and the “Simple” Setup That Wasn’t
Part 2: Vector Pipelines and the Mental Shift from Configs to Flow
Part 3: Writing transforms and handling data correctly (this post)
Part 4: Debugging the pipeline and making it work

Final Thought

The hardest part wasn’t learning a new tool.

It was learning how to think about the system.

Tools don’t break pipelines. Assumptions do.

A message from our Founder

Hey, Sunil here. I wanted to take a moment to thank you for reading until the end and for being a part of this community. Did you know that our team run these publications as a volunteer effort to over 3.5m monthly readers? We don’t receive any funding, we do this to support the community.

If you want to show some love, please take a moment to follow me on LinkedIn, TikTok, Instagram. You can also subscribe to our weekly newsletter. And before you go, don’t forget to clap and follow the writer️!

Debugging a Pipeline That “Should Have Worked” was originally published in Stackademic on Medium, where people are continuing the conversation by highlighting and responding to this story.

Writing Transforms in Vector: Where the Pipeline Actually Breaks

Mohamed Hussain S — Wed, 06 May 2026 07:06:35 GMT

Part 3 of a series on building a metrics pipeline into ClickHouse
Read Part 2: Vector Pipelines and the Mental Shift from Configs to Flow

Where the Architecture Stops Helping

By this point, things finally made sense.

I had moved away from the “Telegraf trap.”
I understood the Vector pipeline model.
Sources, transforms, sinks — it all felt clean.

For the first time, the system didn’t feel like a black box.

And yet, the pipeline still wasn’t working.

That’s when it became obvious:

The structure was correct. The data wasn’t.

The Reality of Raw Metrics

Up until now, I had been thinking at a structural level:

Where data comes from
Where it goes
How it flows

But ClickHouse doesn’t care about your architecture.

It cares about data shape.

And raw metrics are messy.

They come in:

Different types
Different formats
Different structures

Even when everything looks correct, a single mismatch can break ingestion.

No errors. No warnings. Just… nothing.

Enter the Transformation Layer

This is where Vector becomes both powerful — and unforgiving.

Transforms are written using Vector Remap Language.

At a glance, VRL feels simple.

In practice, it behaves more like a strict compiler than a scripting language.

Types must match
Fields must exist
Conversions must be explicit

There is no “close enough.”

Either the data is correct, or it doesn’t flow.

The Timestamp Problem

If there was one issue that kept breaking everything, it was this:

timestamps.

It sounds trivial. It wasn’t.

The incoming metrics had timestamps.
ClickHouse expects timestamps.

But not those timestamps.

They needed to be:

Parsed correctly
Converted explicitly
Stored in the expected format

The fix looked like this:

.timestamp = to_unix_timestamp!(parse_timestamp!(.timestamp, "%+"))

One line.

Hours of debugging.

When Data Doesn’t Match the System

The deeper issue wasn’t just timestamps.

It was the realization that:

Raw metrics are not designed for your database.

They are designed for collection — not storage.

So everything had to be reshaped.

For host metrics, that meant:

Filtering only relevant metrics
Renaming fields to match schema
Converting values into consistent numeric types

For GPU metrics, it was worse.

The GPU Problem

GPU metrics didn’t come as structured data.

They came as a CSV string from nvidia-smi.

Something like:

0, Tesla T4, 35, 20, 4000, 15109, 11109, 65

Which meant:

Parse the string
Split it into fields
Clean whitespace
Convert everything manually

And then, reshape it.

Because even after parsing, the data still wasn’t usable.

From Wide Data to Rows

One of the biggest shifts was how metrics were stored.

Initially, it was tempting to keep everything in a single record:

cpu, memory, disk, temperature

But that doesn’t scale well in ClickHouse.

So the data had to be transformed into a row-based format:

metric_name = "cpu", value = ...
metric_name = "memory", value = ...

Each metric becomes its own row.

This wasn’t just a formatting decision.

It changed:

How queries are written
How aggregations work
How flexible the system becomes

The Iteration Loop

Most of the real work happened here:

Write transform → Run → Nothing works → Check → Fix → Repeat

Not once.
Not twice.
Multiple times.

Each iteration uncovered something new:

A missing field
A wrong type
A format mismatch

This wasn’t configuration anymore.

This was data engineering.

When It Finally Clicked

At some point, things stopped breaking.

Data started flowing.

ClickHouse stopped rejecting inserts.

Queries started returning meaningful results.

Nothing “magical” happened.

The data just… matched the system.

What This Changes

This part of the pipeline taught me something important:

Pipelines don’t fail at collection. They fail at transformation.

You can have:

The right tools
The right architecture
The right setup

And still fail — because the data isn’t shaped correctly.

What’s Next

At this point, the pipeline was almost working.

But there was still one problem left.
A confusing one.
An error that didn’t make sense.
And it had nothing to do with Vector.

In the final part, I’ll walk through:

How I debugged the pipeline
What ClickHouse logs revealed
And a mistake that had nothing to do with the code — but broke everything

Series Overview

Part 1: Metrics, ClickHouse, and the “Simple” Setup That Wasn’t
Part 2: Vector Pipelines and the Mental Shift from Configs to Flow
Part 3: Writing transforms and handling data correctly (this post)
Part 4: Debugging the pipeline and making it work

Final Thought

The pipeline didn’t start working when I changed tools.

It started working when I understood the data.

Writing Transforms in Vector: Where the Pipeline Actually Breaks was originally published in Stackademic on Medium, where people are continuing the conversation by highlighting and responding to this story.

Vector Pipelines and the Mental Shift from Configs to Flow

Mohamed Hussain S — Mon, 04 May 2026 07:41:41 GMT

This post is Part 2 of a series on building a metrics pipeline into ClickHouse. Read Part 1: Metrics, ClickHouse, and the “Simple” Setup That Wasn’t.

Picking Up Where Things Broke

In the previous part, I talked about the “Telegraf Trap” — the moment I realized that writing a configuration file isn’t the same as building a resilient system. I was fighting the tool rather than focusing on the data.

The biggest issue wasn’t just a lack of plugins. It was a lack of visibility. I had reached a point where I didn’t have enough control over how data moved from point A to point B.

I needed to move from a “run-and-hope” mindset to a pipeline mindset.

Why Vector?

I came across Vector while looking for something that treated data as a first-class citizen.

If Telegraf felt like a Swiss Army knife, Vector felt like a factory floor. Instead of a single monolithic configuration that tries to do everything at once, Vector is built around an explicit pipeline model.

This shift in architecture changed how I approached the problem.

The Core Idea: The Directed Acyclic Graph (DAG)

In Part 1, I mentioned that I needed a model that looked like:

Data source → Transformation → Destination

Vector formalizes this into three distinct components:

Sources: Where the data begins its life.
Transforms: The logic layer where you shape, filter, and fix data.
Sinks: The final destination (in my case, ClickHouse).

By separating these, Vector makes the flow of data explicit. There’s no guessing what happens to a metric between the collector and the database you have to define it.

From Config Files to Data Flow

With Telegraf, my workflow was a repetitive loop of frustration:

Write config -> Run -> Debug errors.

With Vector, the process became an architectural design:

Collect -> Transform -> Route -> Store.

The focus moved from: “What parameters does this plugin need?” To: “How must this data evolve at each stage to be useful?”

The New Learning Curve

Of course, a new mental model doesn’t mean zero friction. Moving to Vector introduced its own set of challenges. It wasn’t just about learning a new tool; it was about learning to be an architect.

More importantly, I realized that the pipeline only works if every stage is defined with absolute precision. Early on, I hit the “Silent Failure” wall:

Data would enter the source but never reach the sink.
A single type mismatch in a transform would stall the entire flow.
The pipeline felt like a high-performance engine: when it works, it’s beautiful; when a single bolt is loose, nothing moves.

The Realization: Everything Is Connected

One lesson I learned quickly: In a pipeline, you cannot treat components as independent islands.

In my previous setup, I viewed the “collector” and the “database” as two separate problems. In Vector, they are part of the same nervous system. A bad transform doesn’t just produce bad data — it breaks the end-to-end flow.

This forced me to stop thinking about individual “tools” and start thinking about the integrity of the stream.

What Improved Immediately

Despite the initial configuration hurdles, three things became clear:

Observability: I could finally see exactly where data was being dropped.
Granularity: I had surgical control over the metrics before they hit ClickHouse.
Flexibility: I could finally shape the data to fit my schema, rather than forcing my schema to fit the tool.

The Hardest Part Was Still Coming

I had the structure. I had the pipeline. But one massive hurdle remained: The Data Shape.

ClickHouse is a powerful beast, but it is a fastidious one. It demands data in a very specific format, and this is where the “Transformation” stage of the pipeline becomes the most critical part of the entire architecture.

What’s Next

In Part 3, we get into the “heavy lifting”:

A first look at Vector Remap Language (VRL) for data shaping
Solving the strict data type requirements of ClickHouse.
Fixing timestamp issues and shaping metrics for high-speed ingestion.

Series Overview

Part 1: Metrics, ClickHouse, and the “Simple” Setup That Wasn’t
Part 2: Vector Pipelines and the Mental Shift from Configs to Flow (This post)
Part 3: Writing transforms and handling data correctly
Part 4: Debugging the pipeline and making it work

Vector Pipelines and the Mental Shift from Configs to Flow was originally published in Stackademic on Medium, where people are continuing the conversation by highlighting and responding to this story.

Metrics, ClickHouse, and the “Simple” Setup That Wasn’t

Mohamed Hussain S — Fri, 10 Apr 2026 00:22:20 GMT

Part 1: Why I walked away from Telegraf and shifted my mental model on observability.

Collecting metrics is easy. Shipping them to an analytical database without losing your mind? That’s the hard part.

We’ve all been there. You start with a straightforward requirement: “Collect system metrics (CPU, memory, GPU) and store them in ClickHouse for analysis.” It’s a classic observability use case. You collect, you send, you query. Simple, right?

But as I quickly learned, there is a massive gap between “running a tool” and “building a system.”

The Initial Approach: The Telegraf Trap

I started where most people do: Telegraf.

It’s the industry standard for a reason. It’s plugin-based, mature, and has a massive community. This was also my formal introduction to TOML configuration. At first, the mindset was: “I just need to write a config and hit run.”

But configuration isn’t just syntax — it’s the DNA of your system’s behavior.

My goals were modest:

Capture host-level metrics.
Monitor GPU performance.
Pipe it all into ClickHouse for high-speed OLAP queries.
Prepare it for dashboard.

Where the Gears Started Grinding

On paper, Telegraf is a Swiss Army knife. In practice, I found myself trying to cut down a tree with a corkscrew. I hit four specific walls:

The ClickHouse Gap: There wasn’t a native, “it-just-works” output plugin for ClickHouse that handled the schema the way I envisioned.
The “Black Box” Problem: Debugging why a metric didn’t arrive where it should have felt unintuitive.
Rigidity: As my requirements grew, my TOML file became a brittle monolith.
Tool vs. Task: I was spending 80% of my time fighting the tool and only 20% thinking about my data.

The Shift: From Configs to Pipelines

This is where the lightbulb finally flickered on. I had been thinking in a linear, static way:

Write config → Run tool → Expect output

That’s a recipe for brittle infrastructure. I needed to stop thinking about “tools” and start thinking about Data Flow. I needed a mental model that looked like this:

Data source → Transformation → Destination

The problem wasn’t that Telegraf was “bad” — it was that I lacked control over the Transformation step. I needed a system that treated data like a moving stream, not a static delivery.

Why I Switched to Vector

I realized that for a ClickHouse-backed pipeline, I needed three things: Granular control, visibility, and flexibility.

That search led me to Vector. Unlike traditional collectors, Vector doesn’t just “send” data; it builds a directed acyclic graph (DAG) of your information. It treats configuration as a pipeline.

Coming up in Part 2…

In the next post, I’ll dive into the technical implementation of Vector, why the sources → transforms → sinks model is a game-changer for ClickHouse, and how my architecture finally stabilized.

The lesson so far? Tools don’t solve problems - understanding systems does.

Part 2 — Vector Pipelines and the Mental Shift from Configs to Flow

Metrics, ClickHouse, and the “Simple” Setup That Wasn’t was originally published in Stackademic on Medium, where people are continuing the conversation by highlighting and responding to this story.

ClickHouse Full-Text Search in 2026: What Actually Works (and What Doesn’t)

Mohamed Hussain S — Fri, 03 Apr 2026 08:55:41 GMT

For years, the answer was simple:

Don’t use ClickHouse for full-text search.

If you needed “real” search, you ended up syncing your data to Elasticsearch or OpenSearch anyway. It was just part of the architecture.

But that assumption is starting to change.

With the introduction of inverted indices and native ranking capabilities, the question is no longer if ClickHouse can handle search — but how far you can push it before needing a dedicated search engine.

What Do We Mean by Full-Text Search?

Full-text search is not just about matching strings.

In a typical search system, it involves:

Breaking text into tokens (words)
Mapping those tokens efficiently (inverted index)
Ranking results based on relevance (BM25 or similar)
Handling variations, ordering, and importance

For example, searching:

error connecting database

A search engine doesn’t just look for exact matches. It understands relationships between words, ranks results, and returns the most relevant entries first.

Historically, this is where ClickHouse fell short.

What Changed in 2026

ClickHouse now provides native capabilities that significantly close this gap.

Inverted Indices

Instead of scanning entire datasets with LIKE queries, ClickHouse can now use inverted indices to map tokens directly to data locations.

ALTER TABLE logs 
ADD INDEX inv_idx message TYPE inverted(0) GRANULARITY 1;

This allows queries to skip irrelevant data blocks and focus only on matching segments.

Relevance Scoring (BM25)

Search is not just about matching it’s about ranking.

ClickHouse now supports BM25-style scoring, allowing you to prioritize more relevant results.

SELECT *,
       BM25(message) AS score
FROM logs
WHERE hasToken(message, 'error')
ORDER BY score DESC
LIMIT 10;

This brings it closer to how traditional search engines behave.

Token-Based Matching

Functions like hasToken() introduce awareness of word boundaries.

This avoids issues like:

matching “log” inside “biological”
or partial, unintended matches

It’s a small detail, but an important one.

Where ClickHouse Now Works Really Well

There are specific scenarios where ClickHouse is no longer just “good enough” - it’s actually a strong choice.

Log Analytics and Observability

This is where things get interesting.

You can:

Search logs for a specific error
Filter by time
Aggregate results

All in a single query.

SELECT count(*)
FROM logs
WHERE hasToken(message, 'error')
AND timestamp >= now() - INTERVAL 1 HOUR;

In many real-world setups, maintaining both ClickHouse and a search engine introduces unnecessary complexity — duplicate pipelines, higher costs, and operational overhead.

Consolidating these into a single system can simplify things significantly.

Hybrid Queries (Search + Analytics)

This is something traditional search engines struggle with.

SELECT service, count(*)
FROM logs
WHERE hasToken(message, 'timeout')
GROUP BY service;

Combining search with aggregations is where ClickHouse has a clear advantage.

Architectural Simplicity

Running multiple systems for analytics and search has always been an “accepted cost.”

But it comes with trade-offs:

Data duplication
Pipeline complexity
Increased infrastructure overhead

With newer capabilities, ClickHouse can now handle a larger share of that workload.

Where It Still Falls Short

Despite the progress, there are still clear limitations.

Complex Linguistic Processing

ClickHouse does not provide deep language features like:

stemming
synonym expansion
advanced language models

Dedicated search engines still handle this better.

Fuzzy Matching and Auto-Correct

Handling typos or approximate matches is still limited.

While functions like edit distance exist, they are not optimized for large-scale, real-time fuzzy search.

Search-First Applications

If search is your primary product — for example:

e-commerce search
document discovery systems
user-facing search platforms

Then specialized search engines still provide better control and tuning.

ClickHouse vs Search Engines (A Practical View)

A simple way to think about it:

ClickHouse

Analytics-first
Strong aggregation performance
Now supports structured search

Search engines (Elasticsearch / OpenSearch)

Search-first
Advanced relevance tuning
Mature text processing ecosystem

So… Can ClickHouse Replace a Search Engine?

In some cases, yes.

If your workload looks like:

log analytics
internal tools
analytics-heavy search

ClickHouse is now capable enough to handle both search and aggregation in one place.

But if your system depends heavily on:

relevance tuning
fuzzy matching
language-aware search

Then a dedicated search engine still makes sense.

Final Thoughts

The boundary between analytics and search is no longer as clear as it used to be.

ClickHouse has evolved from being “not suitable for search” to supporting meaningful full-text search capabilities for many real-world workloads.

It may not replace search engines everywhere.

But for a growing set of use cases, it can simplify your architecture — while still delivering the performance and flexibility you need.

ClickHouse Full-Text Search in 2026: What Actually Works (and What Doesn’t) was originally published in Stackademic on Medium, where people are continuing the conversation by highlighting and responding to this story.

When Synthetic Data Lies: A Hidden Correlation Problem

Mohamed Hussain S — Wed, 01 Apr 2026 10:26:44 GMT

While building a small analytics setup with ClickHouse and Superset, I generated synthetic data to test queries and dashboards.

At first, everything looked fine.

The distributions seemed reasonable.
The dashboards behaved as expected.

Nothing looked suspicious.

But as I increased the dataset size, something started to feel… off.

Revenue appeared heavily concentrated in a single country.
Some countries, in certain cases, had no purchases at all.

At a glance, it looked like a simple imbalance.

But the pattern was too consistent to ignore.

It Didn’t Look Like a Query Problem

The obvious assumption was that something was wrong with the queries.

So I checked everything:

query logic
filters
materialized views
dashboard configurations

Everything looked correct.

Which left only one possibility:

The issue wasn’t in how the data was queried — it was in how the data was generated.

Going Back to the Data

Instead of relying on dashboards, I went back to the raw data.

A simple aggregation made things clearer:

one country dominating purchases
another missing entirely

Interestingly, the overall event distribution still looked normal.

Which made the issue harder to spot at first.

The First Problem: Randomness That Wasn’t Random

The initial data generation logic used rand() like this:

multiIf(
    rand() % 100 < 40, 'India',
    rand() % 100 < 65, 'US',
    rand() % 100 < 80, 'UK',
    rand() % 100 < 90, 'Germany',
    'UAE'
)

At a glance, this seems reasonable.

But there’s a subtle problem.

Each rand() call is evaluated independently.

So instead of generating a single random value and assigning a category, the logic evaluates a new random value at every step.

That small detail introduces unintended bias.

Fixing That… and Creating a New Problem

To make the data more stable, I switched to a deterministic approach:

(number * 17) % 100 AS event_rand
(number * 29) % 100 AS country_rand

This made the distributions predictable and easier to reason about.

But it also introduced a different issue — one that wasn’t obvious immediately.

Some countries now had zero purchases.

The Real Problem: Hidden Correlation

The issue wasn’t randomness anymore.

It was correlation.

Both event_type and country were derived from the same base value: number.

Even with different multipliers, they were still mathematically related.

Which meant:

certain combinations could never occur
some values were implicitly “locked together”

In this case, rows that produced "purchase" never aligned with rows that produced "UAE".

So UAE ended up with zero purchases.

Not because of probability.

But because of how the data was constructed.

What Was Actually Wrong

The core issue wasn’t randomness.

It was lack of independence.

The variables in the dataset weren’t independent of each other.

And that’s enough to produce misleading analytics.

Fixing the Data Properly

To fix this, I changed how the values were generated:

used different transformations
added offsets
ensured variables were independently distributed

For example:

(number * 17) % 100 AS event_rand
(number * 31 + 13) % 100 AS country_rand
(number * 47 + 23) % 100 AS device_rand

This breaks the alignment between variables and restores independence.

After this change, the distributions behaved as expected.

Why This Matters

At smaller scales, the issue wasn’t obvious.

The data looked fine.
The dashboards looked correct.

But as the dataset grew:

patterns became more consistent
biases became more visible
incorrect assumptions started to look like real insights

Key Takeaway

Synthetic data can look correct - and still be wrong.

The problem wasn’t query performance.

It was data correctness.

Scaling the data didn’t create the issue.

It revealed it.

Final Thoughts

This was a useful reminder that:

data generation deserves as much attention as querying
small assumptions can lead to large inconsistencies
“reasonable-looking” data isn’t always reliable

It’s easy to trust what dashboards show.

But sometimes, it’s worth questioning how that data was created in the first place.

This issue surfaced while building an analytics setup using ClickHouse and Superset, where I was also comparing raw tables and materialized views.

If you’re interested in that setup:

👉 ClickHouse + Superset

A message from our Founder

When Synthetic Data Lies: A Hidden Correlation Problem was originally published in Stackademic on Medium, where people are continuing the conversation by highlighting and responding to this story.

Managing Large PostgreSQL Tables with Native Partitioning and pg_partman

Mohamed Hussain S — Sun, 22 Mar 2026 04:18:10 GMT

As databases grow, tables that store large volumes of time-based data tend to become harder to manage.

What starts as a simple table gradually turns into a bottleneck. Queries slow down, indexes grow larger, maintenance operations like VACUUM take longer, and even simple tasks like managing old data become cumbersome.

PostgreSQL provides a built-in solution for this problem in the form of table partitioning.

Partitioning helps break a large table into smaller, more manageable pieces. But while it improves performance and data organization, managing partitions manually can introduce its own set of operational challenges.

In this article, we’ll walk through how PostgreSQL’s native partitioning works, where it starts to become difficult to manage at scale, and how pg_partman helps automate the process.

Understanding Native PostgreSQL Partitioning

At its core, partitioning allows a single logical table to be split into multiple physical tables, known as partitions.

PostgreSQL supports three types of partitioning: range, list, and hash. For most real-world systems dealing with time-based data, range partitioning is the most commonly used approach.

Consider a simple example where we store application events with a timestamp.

CREATE TABLE events (
    id BIGSERIAL,
    created_at TIMESTAMP,
    data JSONB
) PARTITION BY RANGE (created_at);

Here, events acts as the parent table. The actual data does not live in this table directly. Instead, it is stored in child tables, each representing a specific range of time.

For example:

CREATE TABLE events_2026_03_20
PARTITION OF events
FOR VALUES FROM ('2026-03-20') TO ('2026-03-21');

CREATE TABLE events_2026_03_21
PARTITION OF events
FOR VALUES FROM ('2026-03-21') TO ('2026-03-22');

Each of these partitions is a real, physical table inside PostgreSQL. However, from the application’s perspective, there is still just one table: events.

How Data Flows into Partitions

One of the powerful aspects of PostgreSQL partitioning is that applications don’t need to be aware of individual partitions.

Data is always inserted into the parent table.

INSERT INTO events (created_at, data)
VALUES ('2026-03-21 10:15:00', '{"event":"login"}');

PostgreSQL automatically determines which partition the row belongs to and routes it accordingly. In this case, the data would be written into the events_2026_03_21 partition.

Queries also continue to run against the parent table.

SELECT * FROM events
WHERE created_at >= now() - interval '7 days';

Behind the scenes, PostgreSQL uses a technique called partition pruning, which ensures that only the relevant partitions are scanned. Instead of scanning an entire large table, it limits the work to just the partitions that match the query conditions.

Why Partitioning Makes a Difference

The performance benefits of partitioning come from reducing the amount of data PostgreSQL needs to process.

If a table contains years of data, but a query only needs the last seven days, PostgreSQL can skip older partitions entirely. This significantly reduces query time and improves efficiency.

Partitioning also simplifies operational tasks. Instead of deleting millions of rows, old data can be removed instantly by dropping an entire partition.

DROP TABLE events_2024_03_01;

This is much faster and more efficient than running large delete operations on a single table.

Over time, this approach helps keep tables smaller, indexes more manageable, and maintenance operations faster.

Where Native Partitioning Becomes Difficult

While native partitioning is powerful, managing it manually introduces operational complexity.

Partitions need to be created ahead of time. If a partition is missing, inserts can fail. Old partitions need to be cleaned up regularly, and retention policies must be implemented manually.

A simple but common scenario looks like this:

                   application inserts event at midnight
                                    ↓
                       new partition does not exist
                                    ↓
                              insert fails

As systems scale and data grows, keeping track of partitions manually becomes error-prone and difficult to maintain.

This is where automation becomes important.

Automating Partition Management with pg_partman

This is where pg_partman comes in.

pg_partman is a PostgreSQL extension designed to automate the management of partitioned tables. Instead of manually creating and maintaining partitions, pg_partman handles it for you.

It builds on top of PostgreSQL’s native partitioning and provides features such as:

automatic creation of future partitions
automatic retention and cleanup of old partitions
simplified maintenance workflows

Rather than relying on external scripts or manual intervention, partition management becomes part of the database itself.

How pg_partman Fits into the Workflow

With pg_partman in place, the workflow becomes much simpler.

A parent table is configured once, and pg_partman takes care of maintaining its partitions over time.

For a table partitioned daily, pg_partman ensures that upcoming partitions are always created in advance.

events_p2026_03_22
events_p2026_03_23
events_p2026_03_24

This guarantees that new data always has a valid destination and prevents insert failures caused by missing partitions.

It can also handle retention policies, automatically removing older partitions when they are no longer needed.

Why pg_partman Makes a Difference

The biggest advantage of pg_partman is that it removes the operational burden of managing partitions manually.

Instead of worrying about creating partitions, scheduling cleanup jobs, or handling edge cases, the system becomes self-maintaining.

This leads to:

fewer operational errors
consistent partition management
simpler database workflows

It allows teams to focus on the data itself rather than the mechanics of maintaining partitions.

When This Approach Works Best

Partitioning combined with pg_partman is especially useful for workloads that deal with continuously growing, time-based data.

This includes:

event logs
analytics pipelines
application activity tracking
time-series datasets

In these systems, new data is constantly written, while older data gradually becomes less relevant. Automating partition management helps handle this lifecycle efficiently.

Final Thoughts

PostgreSQL’s native partitioning provides a strong foundation for managing large datasets.

However, as systems grow, the operational overhead of managing partitions manually becomes increasingly difficult.

Extensions like pg_partman build on top of PostgreSQL’s capabilities, make partition management much more practical in real-world systems.

By combining native partitioning with automation, it becomes possible to handle large, time-based datasets in a way that is both efficient and reliable.

A message from our Founder

Managing Large PostgreSQL Tables with Native Partitioning and pg_partman was originally published in Stackademic on Medium, where people are continuing the conversation by highlighting and responding to this story.

PostgreSQL Backups and Point-in-Time Recovery with pgBackRest

Mohamed Hussain S — Fri, 13 Mar 2026 15:11:53 GMT

In the previous articles of this series, we explored how PostgreSQL records every change using Write-Ahead Logging (WAL) and how systems like Debezium can read those WAL records to stream database changes in real time.

But WAL is not only useful for streaming data changes.

It also enables one of the most powerful recovery mechanisms in PostgreSQL: Point-in-Time Recovery (PITR).

Instead of restoring a database only to the moment when a backup was taken, PITR allows PostgreSQL to reconstruct the database state at any specific point in time.

This capability becomes possible by combining two things:

a base backup of the database
a continuous archive of WAL segments

In this article, we’ll look at how PostgreSQL backups work and how tools like pgBackRest simplify backup management and recovery workflows.

Why Backups Alone Are Not Enough

A common backup strategy is to take periodic full backups of the database.

Imagine a system where a full backup is taken every night at 02:00.

Later in the day, something goes wrong.

02:00  → backup taken
14:00  → accidental DELETE executed

If we restore the backup taken at 02:00, the database returns to that state.

But everything that happened between 02:00 and 14:00 is lost.

This is the limitation of relying only on periodic backups.

PostgreSQL solves this problem by continuously recording changes in WAL. By archiving WAL segments, the database can replay changes and rebuild its state up to a specific moment.

This is the idea behind Point-in-Time Recovery.

WAL and the Recovery Process

When PostgreSQL restores a backup, it does not simply copy database files and start running again.

Instead, the recovery process happens in stages.

                              Restore base backup
                                      ↓
                          Replay archived WAL segments
                                      ↓
                       Reach a consistent database state

Because WAL records every change in chronological order, PostgreSQL can replay those changes to rebuild the database state exactly as it existed at a particular moment.

Backup tools such as pgBackRest manage this entire process.

Managing Backups with pgBackRest

pgBackRest is a widely used PostgreSQL backup tool designed for reliability and automation.

It integrates directly with PostgreSQL’s WAL archiving system and supports storing backups in multiple storage backends, including object storage.

In the setup used for testing, the architecture consisted of two containers:

                         PostgreSQL + pgBackRest
                                    │
                                    ▼
                                  MinIO

The PostgreSQL container runs the database along with the pgBackRest binary, while MinIO acts as an object storage system where backups and WAL archives are stored.

These options allow systems to balance backup frequency, storage usage, and recovery speed.

Restoring the Database

Restoring a PostgreSQL backup requires the database to be stopped.

However, in this Docker-based setup PostgreSQL and pgBackRest were running inside the same container. When PostgreSQL stops, the container stops as well, which means pgBackRest inside that container becomes inaccessible.

To work around this, a temporary container was started with the same PostgreSQL data volume. This allowed the restore operation to run while the database remained stopped.

The recovery workflow looked like this:

                              Restore base backup
                                      ↓
                              Replay archived WAL
                                      ↓
                       Database reaches consistent state

After the restore finished, the original PostgreSQL container was started again.

At first glance, everything seemed to work. But verifying the data revealed something unexpected.

A PITR Behavior Observed During Testing

The following sequence had occurred:

table created
↓
backup taken
↓
table deleted
↓
restore base backup
↓
WAL replay
↓
table still deleted

Even though the backup was taken before the table was deleted, the restored database still showed the table as deleted.

This happens because PostgreSQL automatically continues replaying all available WAL records during recovery.

Since the DROP TABLE operation had been recorded in WAL after the backup was taken, that change was replayed during recovery.

To prevent this behavior, the restore command used the option:

--type=immediate

This tells PostgreSQL to stop recovery immediately after reaching the backup point, rather than replaying all WAL records.

Once recovery stopped at the correct point, the table appeared again as expected.

Understanding how WAL replay works is essential when performing Point-in-Time Recovery, because PostgreSQL will always try to bring the database to the most recent consistent state unless instructed otherwise.

Automating Backups

Backups are typically automated using scheduled jobs.

In this setup, the following cron schedule was used:

0 2 * * 0 pgbackrest --stanza=demo backup --type=full
0 2 * * 1-6 pgbackrest --stanza=demo backup --type=diff

This configuration creates:

a full backup every Sunday
differential backups during the rest of the week

This approach balances storage usage with recovery efficiency.

Final Thoughts

Throughout this series we’ve seen how Write-Ahead Logging (WAL) sits at the center of PostgreSQL’s design.

The same WAL mechanism powers several critical capabilities:

crash recovery
replication
change data capture
backups and point-in-time recovery

Tools like pgBackRest build on top of these mechanisms to provide reliable backup and restore workflows.

Understanding WAL not only helps explain how PostgreSQL ensures durability, but also how modern data systems build reliable pipelines and recovery strategies on top of it.

PostgreSQL Backups and Point-in-Time Recovery with pgBackRest was originally published in Stackademic on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Debezium Uses PostgreSQL WAL for Change Data Capture

Mohamed Hussain S — Fri, 13 Mar 2026 08:57:31 GMT

In the previous article, we explored what Write-Ahead Logging (WAL) is and why it sits at the center of PostgreSQL’s architecture.

Every change that happens inside PostgreSQL is first written to WAL before being applied to the actual table files.

This design ensures that the database can recover to a consistent state after a failure.

Because WAL records every change happening inside the database, it can also be used to stream those changes to other systems in real time.

This is exactly the idea behind Change Data Capture (CDC).

Instead of repeatedly querying database tables to detect updates, CDC systems read the database’s change log and convert those changes into events that other systems can consume.

One of the most widely used CDC platforms for PostgreSQL is Debezium.

Debezium reads PostgreSQL’s WAL and turns database operations into a continuous stream of events that can be processed by downstream systems.

Why Applications Need Change Data Capture

Modern applications rarely operate in isolation.

A single database update might need to trigger actions across several systems.

Consider a simple update:

UPDATE users
SET name = 'Alice Cooper'
WHERE id = 1;

This change might need to propagate to multiple places.

For example, an application might need to:

update a search index
refresh analytics dashboards
synchronize data into a data warehouse
trigger downstream workflows

A common approach is to periodically query the database to detect changes.

But this approach quickly runs into problems.

Polling databases repeatedly increases load, introduces delays, and becomes difficult to scale as systems grow.

Change Data Capture solves this problem by streaming database changes as they happen.

Why WAL Makes CDC Possible

PostgreSQL records every modification inside WAL before writing the change to the actual data files.

This means WAL effectively contains a chronological record of everything that happens inside the database.

Because of this, CDC systems do not need to read tables directly.

Instead, they simply read the WAL stream.

A simplified view of the process looks like this:

                              Application query
                                    ↓
                        PostgreSQL writes change to WAL
                                    ↓
                            CDC system reads WAL
                                    ↓
                           Change event generated

This approach is both efficient and reliable because the CDC system processes changes in the exact order they occurred.

Debezium follows this approach.

Rather than querying tables, it reads PostgreSQL’s WAL and converts the changes into structured events.

Enabling Logical Decoding in PostgreSQL

For external systems to interpret WAL records, PostgreSQL must include enough information about row-level changes.

This is controlled by the wal_level configuration.

wal_level = logical

PostgreSQL supports several WAL levels.

minimal - used only for crash recovery
replica - supports physical replication
logical - includes additional metadata required for logical decoding

When the WAL level is set to logical, PostgreSQL records enough information for external systems to reconstruct database changes.

This capability is known as logical decoding.

Replication Slots and Tracking WAL Consumption

When Debezium connects to PostgreSQL, it creates something called a replication slot.

A replication slot is a mechanism that allows PostgreSQL to track how far a consumer has read the WAL stream.

This ensures that WAL segments are not removed before the CDC system has processed them.

The process works like this:

                            Debezium reads WAL
                                    ↓
                  PostgreSQL records last consumed position
                                    ↓
                     WAL segments retained until consumed

The position in WAL is identified using a value called the Log Sequence Number (LSN).

Because the replication slot tracks the last processed LSN, Debezium can resume streaming changes from the correct position even if the connector restarts.

Logical Decoding Plugins

WAL records are stored in a binary format that is not directly readable.

To convert these records into logical database changes, PostgreSQL uses logical decoding plugins.

These plugins interpret WAL records and transform them into a structured format.

Some commonly used plugins include:

pgoutput
wal2json
decoderbufs

The default plugin used by PostgreSQL is pgoutput, which is also commonly used with Debezium.

How Debezium Streams Database Changes

Once everything is configured, the change data capture pipeline works as follows.

An application writes data to PostgreSQL.

PostgreSQL records the change in WAL.

The logical decoding plugin interprets the WAL records.

Debezium reads those decoded changes through PostgreSQL’s replication protocol.

The connector then publishes the changes as events to a messaging system such as Kafka.

A simplified architecture might look like this:

Downstream systems can then consume these events and react accordingly.

This architecture enables event-driven systems where database changes propagate automatically to other services.

Example Change Event

When Debezium captures a change from WAL, it converts that change into a structured event.

A simplified example might look like this:

{
  "op": "c",
  "table": "users",
  "before": null,
  "after": {
    "id": 1,
    "name": "Alice"
  }
}

The op field represents the operation type:

c - create
u - update
d - delete

Applications consuming these events can use them to trigger additional processing.

WAL as the Foundation for Multiple Systems

One interesting aspect of PostgreSQL’s design is that many different systems rely on the same WAL stream.

The same WAL records used for CDC are also used for:

database replication
backup systems
point-in-time recovery

In many ways, WAL acts as the complete history of changes in a PostgreSQL database.

Final Thoughts

Change Data Capture allows systems to react to database changes without constantly querying tables.

PostgreSQL makes this possible because every modification is recorded in WAL.

Tools like Debezium simply read that stream of changes and convert it into events that other systems can consume.

This makes it possible to build scalable, event-driven architectures directly on top of relational databases.

Next in This Series

In the next article, we’ll explore how PostgreSQL WAL is used for database backups and point-in-time recovery using tools like pgBackRest.

We’ll see how WAL archiving enables restoring a database to a specific moment in time.

How Debezium Uses PostgreSQL WAL for Change Data Capture was originally published in Stackademic on Medium, where people are continuing the conversation by highlighting and responding to this story.