Stories by Mohitkundu on Medium

DataPortal: Empowering Everyone to Build and Manage Code-Free Data Pipelines

Mohitkundu — Tue, 19 Aug 2025 12:04:21 GMT

In today’s fast-paced, data-driven world, teams spend far too much time manually building pipelines, managing access, and integrating data across fragmented tools. Most engineering teams end up reinventing the wheel — spending weeks stitching together orchestration frameworks, execution environments, and data connectors just to move data from Point A to Point B. This approach becomes increasingly difficult to scale when the number of pipelines grows beyond a hundred. At Zepto, we faced the same challenge. So, we built DataPortal — a no-code data platform that empowers analysts, engineers, and business users to create and manage end-to-end data pipelines without writing complex code. It doesn’t just move data — it unifies orchestration, execution, and governance into a single platform.

Today, Zepto’s DataPortal powers:

6,000+ active pipelines
2,000+ table syncs across teams
200+ TBs of data processed on a daily basis
300 daily active users building and running data workflows

Why DataPortal?

Modern companies rely on dozens of systems — from Google Sheets, Databricks, and S3 to Kafka, Slack, and internal databases. Connecting these sources for analytics and automation typically involves:

Writing and maintaining custom ETL scripts
Setting up orchestration frameworks (Airflow, Prefect, etc.)
Provisioning and tuning clusters for execution (Spark, Databricks)
Managing manual governance and access control

This creates engineering bottlenecks, scaling challenges, and significant maintenance overhead.

DataPortal simplifies this by offering:

No-code pipeline creation (visual workflow builder, with optional SQL or Python)
Seamless execution on Databricks Spark clusters
Unified orchestration powered by Airflow
Ready-made connectors for syncing data across your ecosystem
Centralized governance to manage datasets, pipelines, and compute access

Key Features

1. Unified No-Code Pipelines & Data Sync

Visual drag-and-drop interface to connect sources like Google Sheets, Databricks, Kafka, S3, Slack, and more.
Define transformations and schedules without code, with optional SQL or Python hooks for advanced users.
Bidirectional sync between systems, including:
• Databricks ↔ Google Sheets
• S3 ↔ Databricks
• Kafka ↔ Data Lakes
• Google Drive ↔ Data Warehouses
• Starrocks ↔ Databricks
Support for pipeline and workflow-level dependencies, so processes can trigger downstream actions such as Slack notifications, emails, or ML model training after data refreshes.

2. Seamless Orchestration with Airflow

Pipelines are automatically translated into Airflow DAGs — no manual DAG coding needed.
Built-in scheduling, retry mechanisms, and monitoring streamline operations.

3. Scalable Execution on Databricks

Pipelines run on Databricks-managed Spark clusters, ensuring speed and scalability.
Uses ephemeral, auto-terminating clusters to keep costs low.
Handles both batch and streaming workloads.

4. Collaboration, Governance, and Monitoring

Shared workspaces for teams to co-build and review pipelines.
Role-based access management (RBAC) to grant and revoke access to datasets, pipelines, and compute.
Real-time monitoring to track pipeline health, execution metrics, and ETAs, with automated alerts for failures and performance issues.

5. Native Streaming with Flink

Build real-time streaming pipelines using Flink.
Apply transformations and move data from one source to another with low latency and high reliability.

Architectural Overview

DataPortal Architecture

The Four Pillars of DataPortal

1. Web UI — The Control Plane

An interactive interface to create, configure, and schedule pipelines.
Handles authentication, user management, and pipeline configurations.

2. Airflow — The Brain of Orchestration

Each Workflow compiles into an Airflow DAG.
Manages dependencies, scheduling, and retries, removing complexity from users.

3. Databricks — The Muscle for Execution

Spark jobs run on ephemeral clusters, ensuring cost efficiency and scalability.
Handles everything from ETL to aggregations, streaming, and machine learning.

4. Connector Layer — The Workhorses

Custom-built connectors for Google Sheets, S3, Kafka, Slack, Databricks, and internal databases.
Provide robust read/write capabilities across all supported systems.

Additionally, DataPortal includes:

Governance Layer: Centralized resource access management, audit logs, and policy enforcement.
Observability Layer: Real-time monitoring, job tracking, and alerting to ensure system reliability and transparency.

The Workflow and Pipeline Concept

To keep the platform scalable and decoupled, DataPortal organizes all data operations into two layers: Workflows and Pipelines.

Workflow

A Workflow represents the overall job configuration and orchestration metadata.
It includes:

Airflow DAG details (for scheduling and orchestration)
Owner and SPOC information
Alerting channels for monitoring
Compute configurations (cluster size, job type, etc.)

Each Workflow can run multiple pipelines, with support for pipeline-level and workflow-level dependencies. Ultimately, a Workflow maps to one Airflow DAG and one Databricks job, acting as the container for execution.

Pipelines

Pipelines are the core units of data processing. Each defines:

The source and destination systems
Any transformations to be applied
The connections and data flow between systems

DataPortal supports two types of pipelines, based on the nature of the data:

Gold Table Pipelines (Aggregated Data)

Used by analysts, data scientists, and ML engineers.
These pipelines sync processed or aggregated datasets between systems such as Databricks, Google Sheets, S3, Kafka, or data warehouses.
Ideal for analytics, reporting, and model training.

2. Silver Table Pipelines (Raw Data)

Designed for raw, centrally managed data ingested directly from application and microservice databases or app/web events.
The flow includes:

Silver Tables Syncing Flow

A user raises a request for specific tables.
The Data Team approves or rejects the request.
Upon approval, Source and Sink connectors load the requested tables into S3.
S3 connectors deduplicate and transform the data before writing it into Delta tables.
Alerts and monitoring automatically report failures to maintain reliability.

This ensures clean, reliable raw data is always available for downstream analytics and ML use cases.

By separating Workflows (orchestration) from Pipelines (data processing) and introducing Gold and Silver data tiers, DataPortal achieves flexibility, scalability, and maintainability while meeting the diverse needs of analytics, ML, and application-driven use cases.

Key Challenges and How We Solved Them

1. Balancing Simplicity and Flexibility

Users wanted no-code simplicity, but advanced teams needed SQL and Python hooks. Supporting both required careful design.

2. Connector Reliability

APIs like Google Sheets, Slack, and Kafka have rate limits and latency quirks.
We built batching, retry logic, and fault-tolerant sync mechanisms to handle these.

3. Cost Optimization for Databricks

Running Spark clusters for every job can be expensive.
We leveraged ephemeral, auto-terminating clusters to minimize idle costs.

4. Multi-Tenant Governance

Implemented role-based access control (RBAC) with fine-grained permissions and audit logging for compliance.

Future Scope

1. AI-Assisted Pipeline Creation

We’ve already integrated AI agents to help with pipeline metadata and error resolution, reducing engineering intervention.
Soon, users will be able to describe a pipeline in plain English, and DataPortal will build it automatically.

2. Self-Optimizing Pipelines

Automatic Spark job optimization based on execution history.
Auto-tuning SQL queries using execution plans and metadata.
Dynamic query redirection to optimal compute environments (Databricks job clusters, SQL warehouses, StarRocks, or ClickHouse nodes).

Final Thoughts

DataPortal is built to democratize data engineering — allowing any team member, regardless of technical background, to build and manage data workflows at scale while maintaining governance, observability, and cost efficiency.

By bridging orchestration (Airflow), execution (Databricks), and governance, we’ve created a system that helps organizations save time, reduce complexity, and focus on insights rather than infrastructure.

Our team was battling manual ETL, fragmented scripts, and access chaos, DataPortal has proved to be the unified solution we need.

DataPortal: Empowering Everyone to Build and Manage Code-Free Data Pipelines was originally published in Zepto TechXPress on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stop Cleaning Data in Your Delta Lake: Transform It on the Fly with Debezium & Kafka Connect

Mohitkundu — Sun, 20 Jul 2025 16:49:55 GMT

Data engineers often face the same headache: dirty, incomplete, or inconsistent data landing in their data lake.
From inconsistent timestamp formats to missing columns (thanks, Postgres TOAST!), duplicates, PII leaks, and even tombstone events, it feels like every dataset needs a Spark “janitor job” before analysts can use it.

But what if you could skip the cleaning stage entirely?

By using Debezium, Kafka Connect, and a few powerful Single Message Transforms (SMTs) and post-processors, you can ensure that data lands in your Delta Lake clean, complete, and analytics-ready — without manual cleanup.

Here’s how we built a zero-cleaning ingestion pipeline.

1. Normalize Timestamps (Oryon Moose TimestampConverter)

Debezium can produce uneven timestamp formats depending on the source database:

MongoDB, MySQL emits epoch-style timestamps (numeric).
Postgres can include timezones or different formats altogether for date and timestamp types

This inconsistency can wreak havoc on your data lake and make downstream SQL painful.

To fix this, we use the Oryon Moose TimestampConverter SMT to normalize everything to ISO-8601 UTC strings.

Config:

"transforms": "convertTS",
"transforms.convertTS.type": "com.github.oryonmoose.kafka.connect.smt.TimestampConverter$Value",
"transforms.convertTS.field": "created_at",
"transforms.convertTS.target.type": "string",
"transforms.convertTS.format": "yyyy-MM-dd'T'HH:mm:ss'Z'"

Now, regardless of whether the source was Mongo, MySQL, or Postgres, every timestamp lands in S3 or Delta Lake in a consistent UTC format — no downstream conversion needed.

*For more info on this check this link

2. Aurora/Postgres Reselection Post-Processor (Fixing TOAST, Unavailable Values & Ensuring Completeness)

Postgres (and Aurora PostgreSQL) can emit events with missing or null columns for two reasons:

When the table’s replica identity is not set to FULL in Postgres then for columns having large values debezium doesn’t capture complete values when there is change in other columns. Debezium put __debezium_unavailable_value__ value placeholder for such colums.
If any columns values is not captured from logs. Value of that column could in null.

This leads to incomplete records landing in your lake, requiring costly backfill jobs.

The fix? Use Debezium’s Reselect Columns Post-Processor.
It queries the database during ingestion to fetch any missing or unavailable column values before writing the record downstream(Kafka).

Sample Connector Configuration:

"post.processors": "rsc",
"post.processor.rsc.type": "io.debezium.connector.postgresql.transforms.ReselectColumnsPostProcessor",
"post.processor.rsc.reselect.columns": "large_blob,description",
"post.processor.rsc.reselect.timeout.ms": "5000",

Extra recommended properties

"post.processor.rsc.reselect.use.event.key": "false",
"post.processor.rsc.reselect.unavailable.values": "true",
"post.processor.rsc.reselect.null.values": "false"

What This Does

reselect.columns ensures critical columns (like large_blob) are always fetched fresh.
unavailable.values=true triggers reselection for any field Debezium marks as __debezium_unavailable_value__.
null.values=false avoids reselection for truly null fields (avoiding wasted DB queries).
use.event.key=false makes sure reselection uses the proper row identifier, not just the Kafka key.

The result? Every record Debezium emits is complete, deduplicated, and analytics-ready, even when replica identity isn’t FULL.

*Read here more about Debezium Post Processors.

3. Capture Kafka Offset and Partition for Lineage (and Easier Deduplication & Deletes)

Beyond lineage tracking, Kafka metadata like offset, partition, and an ingestion timestamp can also simplify:

Deduplication: By sorting on offset, you can drop duplicates without custom keys.
Handling Deletes: Track exactly which delete events to keep or ignore based on event ordering.

This metadata makes data reconciliation, replay, and incremental processing far easier.

Sink Connector Config Example (Debezium → S3):

"transforms": "insertMeta",
"transforms.insertMeta.type": "org.apache.kafka.connect.transforms.InsertField$Value",
"transforms.insertMeta.static.field": "kafka_offset",
"transforms.insertMeta.static.value": "${offset}",
"transforms.insertMeta.static.field2": "kafka_partition",
"transforms.insertMeta.static.value2": "${partition}",
"transforms.insertMeta.static.field3": "ingestion_ts",
"transforms.insertMeta.static.value3": "${timestamp}"

With this, every record in S3 or Delta has the context needed to replay, deduplicate, or track deletes reliably.

4. Prefer the Mongo Source Connector Over Debezium Mongo

While Debezium works for MongoDB, its CDC events are often fragmented (updates to sub-documents become separate events).
For simpler ingestion, the native Mongo Source Connector can ingest full documents and apply filters before they hit Kafka.

Also Debezium MongoDB Connector converts timestamps in epoch-type timestamps which needs to be cleaned in further steps.

Config Example:

"connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
"connection.uri": "mongodb+srv://user:pass@cluster0",
"database": "appdb",
"collection": "events",
"pipeline": "[{ $match: { status: 'active' } }]",
"publish.full.document.only": true

Now you get entire, filtered Mongo documents — no downstream joins, no change in datatypes.

Read about more config paaramters for Mongo Source Connector here

5. Other SMTs That Eliminate Data Cleaning

Beyond the big fixes, these SMTs can save hours of cleanup:

a. RegexRouter (Rename Topics Dynamically)

Make topic names predictable for your Delta Lake directory structure.

"transforms": "route",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
"transforms.route.replacement": "$1_$2_$3"

db1.public.users becomes db1_public_users.

b. ReplaceField (Remove PII or Unnecessary Columns)

Scrub sensitive fields in-flight to avoid compliance headaches.

"transforms": "removePII",
"transforms.removePII.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.removePII.blacklist": "ssn,email,phone"

Use below for handling mask pii columns with customs values.

"transforms": "maskPII",
"transforms.maskPII.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.maskPII.renames": "email:REDACTED,phone:REDACTED,ssn:REDACTED"

c. InsertField (Add Metadata for Partitioning)

Add ingestion metadata for Delta partitioning.

"transforms": "insertDate",
"transforms.insertDate.type": "org.apache.kafka.connect.transforms.InsertField$Value",
"transforms.insertDate.static.field": "load_date",
"transforms.insertDate.static.value": "2025-07-20"

d. Flatten (Handle Nested JSON)

Flatten complex JSON into simple fields for analytics engines like Trino.

"transforms": "flatten",
"transforms.flatten.type": "org.apache.kafka.connect.transforms.Flatten$Value",
"transforms.flatten.delimiter": "_"

{"user": {"id": 1, "name": "John"}} becomes {"user_id": 1, "user_name": "John"}.

e. Drop Tombstone Events (Debezium Deletes)

Debezium emits tombstone (null) messages for deletes.
We can drop them to keep S3 and Delta clean.

"drop.tombstones": "true"

Or use a predicate-based SMT:

"transforms": "dropTombstone",
"transforms.dropTombstone.type": "org.apache.kafka.connect.transforms.Filter$Value",
"transforms.dropTombstone.predicate": "isTombstone",
"predicates": "isTombstone",
"predicates.isTombstone.type": "org.apache.kafka.connect.transforms.predicates.IsNull",
"predicates.isTombstone.field": "after"

The Payoff

By applying these transformations and post-processors, we made our data lake ingestion “clean by default”:

Timestamps standardized (even across Mongo, MySQL, Postgres)
TOASTed/unavailable fields auto-reselected from DB
Kafka metadata captured for lineage, deduplication, and delete handling
Nested JSON flattened
Tombstones removed
PII scrubbed in-flight

We eliminated nearly all post-load Spark cleanup jobs, simplified Trino queries, and delivered ready-to-query datasets immediately.

Final Thoughts

Instead of wasting time cleaning your Delta Lake, let Debezium, Kafka Connect, and SMTs do the heavy lifting before the data even lands.

With these techniques, your pipeline becomes clean, auditable, and analytics-ready by design — so you can spend less time fixing, and more time analyzing.

Is Debezium Eating Your Disk Space? Here Are 5 Ways to Fix It

Mohitkundu — Sun, 06 Jul 2025 10:27:39 GMT

Intro: Debezium is a powerful open-source tool for Change Data Capture (CDC), allowing real-time streaming of changes from PostgreSQL (and other databases) into Kafka. Under the hood, Debezium relies on logical decoding via replication slots to read WAL (Write-Ahead Log) records.

While this architecture enables efficient change tracking, it introduces a hidden risk: if a Debezium connector slows down, becomes idle, or is misconfigured, PostgreSQL retains WAL files — leading to replication slot bloat, increased disk usage, and potential database outages.

In this article, we’ll explore 5 effective strategies to prevent replication slot bloat when using Debezium with PostgreSQL.

1. Use a Heartbeat Table in Low-Update Environments

PostgreSQL’s WAL cleanup depends on the movement of the confirmed_flush_lsn, which advances only when Debezium acknowledges new changes. If a connector is watching low-update tables, PostgreSQL holds onto old WAL logs, thinking they’re still needed — causing disk space usage to spike.

💡 When This Happens:

The connector doesn’t include tables with frequent changes.
Another database on the same PostgreSQL instance is high-traffic, while the database Debezium tracks is idle.
On AWS RDS, even system writes that don’t generate visible events can cause WAL accumulation if the connector doesn’t produce change events regularly.

✅ What to do:

Enable Debezium’s heartbeat mechanism to periodically emit change events and move the LSN forward.

Step-by-Step:

Create a heartbeat table:

CREATE TABLE public.heartbeat_table (   id SERIAL PRIMARY KEY,   last_updated TIMESTAMP DEFAULT now() );

2. Add it to the replication publication:

ALTER PUBLICATION your_publication ADD TABLE public.heartbeat_table

3. Enable heartbeats in Debezium config:

{   "heartbeat.interval.ms": "10000" }

4. Trigger updates with heartbeat.action.query:

{   "heartbeat.action.query": "UPDATE public.heartbeat_table SET last_updated = now() WHERE id = 1" }

🧠 Why This Works:

Keeps confirmed_flush_lsn moving even in low-traffic schemas.
Prevents WAL from piling up due to inactivity.
Essential in multi-database PostgreSQL instances where WAL is shared, but Debezium is only reading from the quieter database.

2. Use slot.drop.on.stop and slot.drop.on.delete

Debezium provides configuration options to automatically drop replication slots when a connector is stopped or deleted.

✅ What to do:

Set these properties in your connector config:

{
  "slot.drop.on.stop": "true",
  "slot.drop.on.delete": "true"
}

This ensures replication slots don’t linger and hold WAL indefinitely after the connector is no longer in use.

⚠️ Caution: Use these options carefully in production. Dropping a slot requires re-snapshotting the data on restart, which might be expensive or unacceptable for large tables.

3. Separate High-Frequency Tables Into Dedicated Connectors

Grouping both high- and low-frequency tables in a single Debezium connector causes trouble when one table becomes a bottleneck. Because a single connector processes changes sequentially, lag in one table delays the processing of others — and the WAL keeps growing.

✅ What to do:

Move high-update tables into their own connectors.
For multiple high-frequency tables, divide them across connectors to maximize parallelism.

🔍 Key Benefits:

Single connectors are slow under load.
Multiple connectors process aggressively in parallel, advancing the LSN faster and allowing PostgreSQL to reclaim WAL space.

🛠 Example Setup:

connector_orders: tracks high-volume orders and transactions
connector_users: tracks medium-activity users and payments
connector_config: tracks rarely changed reference tables

This architectural separation improves reliability and WAL cleanup efficiency.

4. Remove Unused SMTs (Single Message Transforms)

Debezium supports SMTs to transform messages before sending them to Kafka. While powerful, excessive or unnecessary SMTs slow down processing, leading to connector lag and delayed WAL consumption.

✅ What to do:

Review your SMT configuration.
Remove default or unused SMTs (e.g., ExtractNewRecordState) unless they are strictly required.
Avoid chaining complex transformations in the connector pipeline — move them downstream if possible.

🧠 Example:

Instead of masking fields in Debezium, consider doing it in your Kafka consumer to reduce connector overhead and keep CDC flow fast.

5. Drop Inactive Replication Slots

In some environments, connectors might crash or be deleted without properly cleaning up their replication slots. These orphaned slots remain inactive but still retain WAL, bloating disk usage.

✅ What to do:

List inactive replication slots:

SELECT * FROM pg_replication_slots WHERE active = false;

2. Drop unused slots safely:

SELECT pg_drop_replication_slot('your_slot_name');

🔁 Automate It:

Set up a cron job, Airflow task, or monitoring alert to periodically detect and remove stale replication slots that are no longer used.

Conclusion

Debezium makes real-time syncing with PostgreSQL seamless, but it demands careful attention to replication slot and WAL management. Left unchecked, WAL bloat can bring down your database — especially in high-throughput or idle environments.

By following these 5 strategies:

Use heartbeat tables to simulate change events
Configure auto-cleanup with slot.drop.on.stop
Split large tables into dedicated connectors
Remove redundant SMTs
Periodically clean up inactive replication slots

— you can keep your system lean, performant, and production-ready.

Understanding how PostgreSQL shares and releases WAL across databases is key to operating Debezium reliably — especially in mixed workloads and multi-tenant setups.