<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Mohitkundu on Medium]]></title>
        <description><![CDATA[Stories by Mohitkundu on Medium]]></description>
        <link>https://medium.com/@mohitkundu1212?source=rss-2d97d637ff0c------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/0*dEhpZBpzk1fk7M_q</url>
            <title>Stories by Mohitkundu on Medium</title>
            <link>https://medium.com/@mohitkundu1212?source=rss-2d97d637ff0c------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Tue, 26 May 2026 22:56:54 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@mohitkundu1212/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[DataPortal: Empowering Everyone to Build and Manage Code-Free Data Pipelines]]></title>
            <link>https://blog.zeptonow.com/dataportal-empowering-everyone-to-build-and-manage-code-free-data-pipelines-fe017fb09bb1?source=rss-2d97d637ff0c------2</link>
            <guid isPermaLink="false">https://medium.com/p/fe017fb09bb1</guid>
            <dc:creator><![CDATA[Mohitkundu]]></dc:creator>
            <pubDate>Tue, 19 Aug 2025 12:04:21 GMT</pubDate>
            <atom:updated>2025-08-19T12:04:21.955Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*htIVS9bZ2xKgB3Nsy0dnbA.png" /></figure><p>In today’s fast-paced, data-driven world, teams spend far too much time manually <strong>building pipelines, managing access, and integrating data across fragmented tools</strong>. Most engineering teams end up reinventing the wheel — spending weeks stitching together orchestration frameworks, execution environments, and data connectors just to move data from Point A to Point B. This approach becomes increasingly difficult to scale when the number of pipelines grows beyond a hundred. At Zepto, we faced the same challenge. So, we built <strong>DataPortal</strong> — <strong>a no-code data platform</strong> that empowers analysts, engineers, and business users to <strong>create and manage end-to-end data pipelines</strong> <strong>without writing complex code.</strong> It doesn’t just move data — it unifies <strong>orchestration, execution, and governance</strong> into a single platform.</p><p>Today, Zepto’s DataPortal powers:</p><ul><li><strong>6,000+ active pipelines</strong></li><li><strong>2,000+ table syncs</strong> across teams</li><li><strong>200+ TBs of data processed on a daily basis</strong></li><li><strong>300 daily active users</strong> building and running data workflows</li></ul><h3>Why DataPortal?</h3><p>Modern companies rely on <strong>dozens of systems</strong> — from <strong>Google Sheets, Databricks, and S3</strong> to <strong>Kafka, Slack, and internal databases</strong>. Connecting these sources for analytics and automation typically involves:</p><ul><li>Writing and maintaining custom ETL scripts</li><li>Setting up orchestration frameworks (Airflow, Prefect, etc.)</li><li>Provisioning and tuning clusters for execution (Spark, Databricks)</li><li>Managing manual governance and access control</li></ul><p>This creates <strong>engineering bottlenecks, scaling challenges, and significant maintenance overhead</strong>.</p><p><strong>DataPortal simplifies this by offering:</strong></p><ul><li><strong>No-code pipeline creation</strong> (visual workflow builder, with optional SQL or Python)</li><li><strong>Seamless execution on Databricks Spark clusters</strong></li><li><strong>Unified orchestration powered by Airflow</strong></li><li><strong>Ready-made connectors</strong> for syncing data across your ecosystem</li><li><strong>Centralized governance</strong> to manage <strong>datasets, pipelines, and compute access</strong></li></ul><h3>Key Features</h3><h4>1. <strong>Unified No-Code Pipelines &amp; Data Sync</strong></h4><ul><li>Visual drag-and-drop interface to connect sources like <strong>Google Sheets, Databricks, Kafka, S3, Slack</strong>, and more.</li><li>Define <strong>transformations and schedules without code</strong>, with optional <strong>SQL or Python hooks</strong> for advanced users.</li><li><strong>Bidirectional sync</strong> between systems, including:<br>• Databricks ↔ Google Sheets<strong><br></strong>• S3 ↔ Databricks<br>• Kafka ↔ Data Lakes<br>• Google Drive ↔ Data Warehouses<br>• Starrocks ↔ Databricks</li><li>Support for <strong>pipeline and workflow-level dependencies</strong>, so processes can trigger <strong>downstream actions</strong> such as <strong>Slack notifications, emails, or ML model training</strong> after data refreshes.</li></ul><h4>2. Seamless Orchestration with Airflow</h4><ul><li>Pipelines are <strong>automatically translated into Airflow DAGs</strong> — no manual DAG coding needed.</li><li>Built-in <strong>scheduling, retry mechanisms, and monitoring</strong> streamline operations.</li></ul><h4>3. Scalable Execution on Databricks</h4><ul><li>Pipelines run on <strong>Databricks-managed Spark clusters</strong>, ensuring <strong>speed and scalability</strong>.</li><li>Uses <strong>ephemeral, auto-terminating clusters</strong> to keep costs low.</li><li>Handles both <strong>batch and streaming workloads</strong>.</li></ul><h4>4. Collaboration, Governance, and Monitoring</h4><ul><li><strong>Shared workspaces</strong> for teams to co-build and review pipelines.</li><li><strong>Role-based access management (RBAC)</strong> to grant and revoke access to datasets, pipelines, and compute.</li><li><strong>Real-time monitoring</strong> to track pipeline health, execution metrics, and ETAs, with <strong>automated alerts</strong> for failures and performance issues.</li></ul><h4>5. Native Streaming with Flink</h4><ul><li>Build <strong>real-time streaming pipelines</strong> using Flink.</li><li>Apply transformations and move data from one source to another with <strong>low latency and high reliability</strong>.</li></ul><h3>Architectural Overview</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*0uifNRqu5tXBFVsX5jPz5w.png" /><figcaption>DataPortal Architecture</figcaption></figure><h3>The Four Pillars of DataPortal</h3><h4><strong>1. Web UI — The Control Plane</strong></h4><ul><li>An interactive interface to <strong>create, configure, and schedule pipelines</strong>.</li><li>Handles <strong>authentication, user management, and pipeline configurations</strong>.</li></ul><h4><strong>2. Airflow — The Brain of Orchestration</strong></h4><ul><li>Each Workflow compiles into an <strong>Airflow DAG</strong>.</li><li>Manages <strong>dependencies, scheduling, and retries</strong>, removing complexity from users.</li></ul><h4><strong>3. Databricks — The Muscle for Execution</strong></h4><ul><li>Spark jobs run on <strong>ephemeral clusters</strong>, ensuring <strong>cost efficiency and scalability</strong>.</li><li>Handles everything from <strong>ETL to aggregations, streaming, and machine learning</strong>.</li></ul><h4><strong>4. Connector Layer — The Workhorses</strong></h4><ul><li>Custom-built connectors for <strong>Google Sheets, S3, Kafka, Slack, Databricks, and internal databases</strong>.</li><li>Provide robust <strong>read/write capabilities</strong> across all supported systems.</li></ul><p>Additionally, DataPortal includes:</p><ul><li><strong>Governance Layer</strong>: Centralized <strong>resource access management, audit logs, and policy enforcement</strong>.</li><li><strong>Observability Layer</strong>: Real-time <strong>monitoring</strong>, job tracking, and <strong>alerting</strong> to ensure system reliability and transparency.</li></ul><h3>The Workflow and Pipeline Concept</h3><p>To keep the platform scalable and decoupled, <strong>DataPortal organizes all data operations into two layers: Workflows and Pipelines</strong>.</p><h4>Workflow</h4><p>A Workflow represents the <strong>overall job configuration and orchestration metadata</strong>.<br> It includes:</p><ul><li><strong>Airflow DAG details</strong> (for scheduling and orchestration)</li><li><strong>Owner and SPOC information</strong></li><li><strong>Alerting channels</strong> for monitoring</li><li><strong>Compute configurations</strong> (cluster size, job type, etc.)</li></ul><p>Each Workflow can run <strong>multiple pipelines</strong>, with support for <strong>pipeline-level and workflow-level dependencies</strong>. Ultimately, a Workflow maps to <strong>one Airflow DAG and one Databricks job</strong>, acting as the container for execution.</p><h4>Pipelines</h4><p>Pipelines are the <strong>core units of data processing</strong>. Each defines:</p><ul><li>The <strong>source and destination systems</strong></li><li>Any <strong>transformations</strong> to be applied</li><li>The <strong>connections and data flow</strong> between systems</li></ul><p>DataPortal supports <strong>two types of pipelines</strong>, based on the nature of the data:</p><ol><li><strong>Gold Table Pipelines (Aggregated Data)</strong></li></ol><ul><li>Used by analysts, data scientists, and ML engineers.</li><li>These pipelines sync <strong>processed or aggregated datasets</strong> between systems such as <strong>Databricks, Google Sheets, S3, Kafka, or data warehouses</strong>.</li><li>Ideal for <strong>analytics, reporting, and model training</strong>.</li></ul><p><strong>2. Silver Table Pipelines (Raw Data)</strong></p><ul><li>Designed for <strong>raw, centrally managed data</strong> ingested directly from <strong>application and microservice databases or app/web events</strong>.</li><li>The flow includes:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KHu2XWfwqJWTU-ZqhMvi3A.png" /><figcaption>Silver Tables Syncing Flow</figcaption></figure><ol><li>A user raises a request for specific tables.</li><li>The <strong>Data Team approves or rejects</strong> the request.</li><li>Upon approval, <strong>Source and Sink connectors</strong> load the requested tables into <strong>S3</strong>.</li><li><strong>S3 connectors deduplicate and transform</strong> the data before writing it into <strong>Delta tables</strong>.</li><li><strong>Alerts and monitoring</strong> automatically report failures to maintain reliability.</li></ol><p>This ensures <strong>clean, reliable raw data</strong> is always available for downstream analytics and ML use cases.</p><p>By separating <strong>Workflows (orchestration)</strong> from <strong>Pipelines (data processing)</strong> and introducing <strong>Gold and Silver data tiers</strong>, DataPortal achieves <strong>flexibility, scalability, and maintainability</strong> while meeting the diverse needs of analytics, ML, and application-driven use cases.</p><h3>Key Challenges and How We Solved Them</h3><h4><strong>1. Balancing Simplicity and Flexibility</strong></h4><ul><li>Users wanted <strong>no-code simplicity</strong>, but advanced teams needed <strong>SQL and Python hooks</strong>. Supporting both required careful design.</li></ul><h4><strong>2. Connector Reliability</strong></h4><ul><li>APIs like Google Sheets, Slack, and Kafka have <strong>rate limits and latency quirks</strong>.</li><li>We built <strong>batching, retry logic, and fault-tolerant sync mechanisms</strong> to handle these.</li></ul><h4><strong>3. Cost Optimization for Databricks</strong></h4><ul><li>Running Spark clusters for every job can be expensive.</li><li>We leveraged <strong>ephemeral, auto-terminating clusters</strong> to minimize idle costs.</li></ul><h4><strong>4. Multi-Tenant Governance</strong></h4><ul><li>Implemented <strong>role-based access control (RBAC)</strong> with fine-grained permissions and <strong>audit logging</strong> for compliance.</li></ul><h3>Future Scope</h3><h4><strong>1. AI-Assisted Pipeline Creation</strong></h4><ul><li>We’ve already integrated <strong>AI agents</strong> to help with <strong>pipeline metadata and error resolution</strong>, reducing engineering intervention.</li><li>Soon, users will be able to <strong>describe a pipeline in plain English</strong>, and DataPortal will build it automatically.</li></ul><h4><strong>2. Self-Optimizing Pipelines</strong></h4><ul><li><strong>Automatic Spark job optimization</strong> based on execution history.</li><li><strong>Auto-tuning SQL queries</strong> using execution plans and metadata.</li><li><strong>Dynamic query redirection</strong> to optimal compute environments (Databricks job clusters, SQL warehouses, StarRocks, or ClickHouse nodes).</li></ul><h3>Final Thoughts</h3><p><strong>DataPortal is built to democratize data engineering</strong> — allowing any team member, regardless of technical background, to build and manage data workflows at scale while maintaining <strong>governance, observability, and cost efficiency</strong>.</p><p>By bridging <strong>orchestration (Airflow)</strong>, <strong>execution (Databricks)</strong>, and <strong>governance</strong>, we’ve created a system that helps organizations <strong>save time, reduce complexity, and focus on insights rather than infrastructure</strong>.</p><p>Our team was battling <strong>manual ETL, fragmented scripts, and access chaos</strong>, DataPortal has proved to be the <strong>unified solution we need</strong>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=fe017fb09bb1" width="1" height="1" alt=""><hr><p><a href="https://blog.zeptonow.com/dataportal-empowering-everyone-to-build-and-manage-code-free-data-pipelines-fe017fb09bb1">DataPortal: Empowering Everyone to Build and Manage Code-Free Data Pipelines</a> was originally published in <a href="https://blog.zeptonow.com">Zepto TechXPress</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Stop Cleaning Data in Your Delta Lake: Transform It on the Fly with Debezium & Kafka Connect]]></title>
            <link>https://medium.com/@mohitkundu1212/stop-cleaning-data-in-your-delta-lake-transform-it-on-the-fly-with-debezium-kafka-connect-b087b67430af?source=rss-2d97d637ff0c------2</link>
            <guid isPermaLink="false">https://medium.com/p/b087b67430af</guid>
            <category><![CDATA[real-time-streaming-data]]></category>
            <category><![CDATA[smt]]></category>
            <category><![CDATA[debezium]]></category>
            <category><![CDATA[etl]]></category>
            <category><![CDATA[kafka-connect]]></category>
            <dc:creator><![CDATA[Mohitkundu]]></dc:creator>
            <pubDate>Sun, 20 Jul 2025 16:49:55 GMT</pubDate>
            <atom:updated>2025-07-20T16:49:55.521Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*3vCqmZObG9nACtTjjzjs_Q.jpeg" /></figure><p>Data engineers often face the same headache: <strong>dirty, incomplete, or inconsistent data landing in their data lake</strong>.<br> From inconsistent timestamp formats to missing columns (thanks, Postgres TOAST!), duplicates, PII leaks, and even tombstone events, it feels like every dataset needs a Spark “janitor job” before analysts can use it.</p><p>But what if you could <strong>skip the cleaning stage entirely</strong>?</p><p>By using <strong>Debezium</strong>, <strong>Kafka Connect</strong>, and a few powerful <strong>Single Message Transforms (SMTs)</strong> and <strong>post-processors</strong>, you can ensure that <strong>data lands in your Delta Lake clean, complete, and analytics-ready</strong> — without manual cleanup.</p><p>Here’s how we built a <strong>zero-cleaning ingestion pipeline</strong>.</p><h3>1. Normalize Timestamps (Oryon Moose TimestampConverter)</h3><p>Debezium can produce <strong>uneven timestamp formats</strong> depending on the source database:</p><ul><li><strong>MongoDB, MySQL </strong>emits epoch-style timestamps (numeric).</li><li><strong>Postgres</strong> can include timezones or different formats altogether for date and timestamp types</li></ul><p>This inconsistency can wreak havoc on your data lake and make downstream SQL painful.</p><p>To fix this, we use the <strong>Oryon Moose TimestampConverter SMT</strong> to normalize everything to <strong>ISO-8601 UTC strings</strong>.</p><p><strong>Config:</strong></p><pre>&quot;transforms&quot;: &quot;convertTS&quot;,<br>&quot;transforms.convertTS.type&quot;: &quot;com.github.oryonmoose.kafka.connect.smt.TimestampConverter$Value&quot;,<br>&quot;transforms.convertTS.field&quot;: &quot;created_at&quot;,<br>&quot;transforms.convertTS.target.type&quot;: &quot;string&quot;,<br>&quot;transforms.convertTS.format&quot;: &quot;yyyy-MM-dd&#39;T&#39;HH:mm:ss&#39;Z&#39;&quot;</pre><p>Now, regardless of whether the source was Mongo, MySQL, or Postgres, every timestamp lands in S3 or Delta Lake in a <strong>consistent UTC format</strong> — no downstream conversion needed.</p><p>*For more info on this check <a href="https://github.com/lensesio/kafka-connect-smt/blob/main/TimestampConverter.md">this</a> link</p><h3>2. Aurora/Postgres Reselection Post-Processor (Fixing TOAST, Unavailable Values &amp; Ensuring Completeness)</h3><p>Postgres (and Aurora PostgreSQL) can emit events with <strong>missing or null columns</strong> for two reasons:</p><ol><li>When the table’s <strong>replica identity is not set to FULL</strong> in Postgres then for columns having large values debezium doesn’t capture complete values when there is change in other columns. Debezium put <strong><em>__debezium_unavailable_value__ </em></strong>value placeholder for such colums.</li><li>If any columns values is not captured from logs. Value of that column could in null.</li></ol><p>This leads to incomplete records landing in your lake, requiring costly backfill jobs.</p><p>The fix? Use <strong>Debezium’s Reselect Columns Post-Processor</strong>.<br> It <strong>queries the database during ingestion</strong> to fetch any missing or unavailable column values before writing the record downstream(Kafka).</p><p><strong>Sample Connector Configuration:</strong></p><pre>&quot;post.processors&quot;: &quot;rsc&quot;,<br>&quot;post.processor.rsc.type&quot;: &quot;io.debezium.connector.postgresql.transforms.ReselectColumnsPostProcessor&quot;,<br>&quot;post.processor.rsc.reselect.columns&quot;: &quot;large_blob,description&quot;,<br>&quot;post.processor.rsc.reselect.timeout.ms&quot;: &quot;5000&quot;,</pre><p>Extra recommended properties</p><pre>&quot;post.processor.rsc.reselect.use.event.key&quot;: &quot;false&quot;,<br>&quot;post.processor.rsc.reselect.unavailable.values&quot;: &quot;true&quot;,<br>&quot;post.processor.rsc.reselect.null.values&quot;: &quot;false&quot;</pre><h3>What This Does</h3><ul><li><strong>reselect.columns</strong> ensures critical columns (like large_blob) are always fetched fresh.</li><li><strong>unavailable.values=true</strong> triggers reselection for any field Debezium marks as __debezium_unavailable_value__.</li><li><strong>null.values=false</strong> avoids reselection for truly null fields (avoiding wasted DB queries).</li><li><strong>use.event.key=false</strong> makes sure reselection uses the proper row identifier, not just the Kafka key.</li></ul><p>The result? <strong>Every record Debezium emits is complete, deduplicated, and analytics-ready</strong>, even when replica identity isn’t FULL.</p><p>*Read <a href="https://debezium.io/documentation/reference/stable/post-processors/index.html">here</a> more about Debezium Post Processors.</p><h3>3. Capture Kafka Offset and Partition for Lineage (and Easier Deduplication &amp; Deletes)</h3><p>Beyond lineage tracking, <strong>Kafka metadata</strong> like offset, partition, and an ingestion timestamp can also simplify:</p><ul><li><strong>Deduplication</strong>: By sorting on offset, you can drop duplicates without custom keys.</li><li><strong>Handling Deletes</strong>: Track exactly which delete events to keep or ignore based on event ordering.</li></ul><p>This metadata makes <strong>data reconciliation, replay, and incremental processing far easier</strong>.</p><p><strong>Sink Connector Config Example (Debezium → S3):</strong></p><pre>&quot;transforms&quot;: &quot;insertMeta&quot;,<br>&quot;transforms.insertMeta.type&quot;: &quot;org.apache.kafka.connect.transforms.InsertField$Value&quot;,<br>&quot;transforms.insertMeta.static.field&quot;: &quot;kafka_offset&quot;,<br>&quot;transforms.insertMeta.static.value&quot;: &quot;${offset}&quot;,<br>&quot;transforms.insertMeta.static.field2&quot;: &quot;kafka_partition&quot;,<br>&quot;transforms.insertMeta.static.value2&quot;: &quot;${partition}&quot;,<br>&quot;transforms.insertMeta.static.field3&quot;: &quot;ingestion_ts&quot;,<br>&quot;transforms.insertMeta.static.value3&quot;: &quot;${timestamp}&quot;</pre><p>With this, every record in S3 or Delta has the <strong>context needed to replay, deduplicate, or track deletes reliably</strong>.</p><h3>4. Prefer the Mongo Source Connector Over Debezium Mongo</h3><p>While Debezium works for MongoDB, its CDC events are often fragmented (updates to sub-documents become separate events).<br> For simpler ingestion, the <strong>native Mongo Source Connector</strong> can ingest <strong>full documents</strong> and apply filters before they hit Kafka.</p><p>Also Debezium MongoDB Connector converts timestamps in epoch-type timestamps which needs to be cleaned in further steps.</p><p><strong>Config Example:</strong></p><pre>&quot;connector.class&quot;: &quot;com.mongodb.kafka.connect.MongoSourceConnector&quot;,<br>&quot;connection.uri&quot;: &quot;mongodb+srv://user:pass@cluster0&quot;,<br>&quot;database&quot;: &quot;appdb&quot;,<br>&quot;collection&quot;: &quot;events&quot;,<br>&quot;pipeline&quot;: &quot;[{ $match: { status: &#39;active&#39; } }]&quot;,<br>&quot;publish.full.document.only&quot;: true</pre><p>Now you get <strong>entire, filtered Mongo documents</strong> — no downstream joins, no change in datatypes.</p><p>Read about more config paaramters for Mongo Source Connector <a href="https://www.mongodb.com/docs/kafka-connector/current/source-connector/">here</a></p><h3>5. Other SMTs That Eliminate Data Cleaning</h3><p>Beyond the big fixes, these SMTs can save hours of cleanup:</p><h3>a. RegexRouter (Rename Topics Dynamically)</h3><p>Make topic names predictable for your Delta Lake directory structure.</p><pre>&quot;transforms&quot;: &quot;route&quot;,<br>&quot;transforms.route.type&quot;: &quot;org.apache.kafka.connect.transforms.RegexRouter&quot;,<br>&quot;transforms.route.regex&quot;: &quot;([^.]+)\\.([^.]+)\\.([^.]+)&quot;,<br>&quot;transforms.route.replacement&quot;: &quot;$1_$2_$3&quot;</pre><p>db1.public.users becomes db1_public_users.</p><h3>b. ReplaceField (Remove PII or Unnecessary Columns)</h3><p>Scrub sensitive fields in-flight to avoid compliance headaches.</p><pre>&quot;transforms&quot;: &quot;removePII&quot;,<br>&quot;transforms.removePII.type&quot;: &quot;org.apache.kafka.connect.transforms.ReplaceField$Value&quot;,<br>&quot;transforms.removePII.blacklist&quot;: &quot;ssn,email,phone&quot;</pre><p>or</p><p>Use below for handling mask pii columns with customs values.</p><pre>&quot;transforms&quot;: &quot;maskPII&quot;,<br>&quot;transforms.maskPII.type&quot;: &quot;org.apache.kafka.connect.transforms.ReplaceField$Value&quot;,<br>&quot;transforms.maskPII.renames&quot;: &quot;email:REDACTED,phone:REDACTED,ssn:REDACTED&quot;</pre><h3>c. InsertField (Add Metadata for Partitioning)</h3><p>Add ingestion metadata for Delta partitioning.</p><pre>&quot;transforms&quot;: &quot;insertDate&quot;,<br>&quot;transforms.insertDate.type&quot;: &quot;org.apache.kafka.connect.transforms.InsertField$Value&quot;,<br>&quot;transforms.insertDate.static.field&quot;: &quot;load_date&quot;,<br>&quot;transforms.insertDate.static.value&quot;: &quot;2025-07-20&quot;</pre><h3>d. Flatten (Handle Nested JSON)</h3><p>Flatten complex JSON into simple fields for analytics engines like Trino.</p><pre>&quot;transforms&quot;: &quot;flatten&quot;,<br>&quot;transforms.flatten.type&quot;: &quot;org.apache.kafka.connect.transforms.Flatten$Value&quot;,<br>&quot;transforms.flatten.delimiter&quot;: &quot;_&quot;</pre><p>{&quot;user&quot;: {&quot;id&quot;: 1, &quot;name&quot;: &quot;John&quot;}} becomes {&quot;user_id&quot;: 1, &quot;user_name&quot;: &quot;John&quot;}.</p><h3>e. Drop Tombstone Events (Debezium Deletes)</h3><p>Debezium emits <strong>tombstone (null) messages</strong> for deletes.<br> We can drop them to keep S3 and Delta clean.</p><pre>&quot;drop.tombstones&quot;: &quot;true&quot;</pre><p>Or use a predicate-based SMT:</p><pre>&quot;transforms&quot;: &quot;dropTombstone&quot;,<br>&quot;transforms.dropTombstone.type&quot;: &quot;org.apache.kafka.connect.transforms.Filter$Value&quot;,<br>&quot;transforms.dropTombstone.predicate&quot;: &quot;isTombstone&quot;,<br>&quot;predicates&quot;: &quot;isTombstone&quot;,<br>&quot;predicates.isTombstone.type&quot;: &quot;org.apache.kafka.connect.transforms.predicates.IsNull&quot;,<br>&quot;predicates.isTombstone.field&quot;: &quot;after&quot;</pre><h3>The Payoff</h3><p>By applying these transformations and post-processors, we made our <strong>data lake ingestion “clean by default”</strong>:</p><ul><li>Timestamps standardized (even across Mongo, MySQL, Postgres)</li><li>TOASTed/unavailable fields auto-reselected from DB</li><li>Kafka metadata captured for lineage, <strong>deduplication, and delete handling</strong></li><li>Nested JSON flattened</li><li>Tombstones removed</li><li>PII scrubbed in-flight</li></ul><p>We <strong>eliminated nearly all post-load Spark cleanup jobs</strong>, simplified Trino queries, and delivered <strong>ready-to-query datasets immediately</strong>.</p><h3>Final Thoughts</h3><p>Instead of wasting time cleaning your Delta Lake, let <strong>Debezium, Kafka Connect, and SMTs</strong> do the heavy lifting <strong>before the data even lands</strong>.</p><p>With these techniques, your pipeline becomes <strong>clean, auditable, and analytics-ready by design</strong> — so you can spend less time fixing, and more time analyzing.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b087b67430af" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Is Debezium Eating Your Disk Space? Here Are 5 Ways to Fix It]]></title>
            <link>https://medium.com/@mohitkundu1212/is-debezium-eating-your-disk-space-here-are-5-ways-to-fix-it-f18b9c007d02?source=rss-2d97d637ff0c------2</link>
            <guid isPermaLink="false">https://medium.com/p/f18b9c007d02</guid>
            <category><![CDATA[postgres]]></category>
            <category><![CDATA[kafka-connect]]></category>
            <category><![CDATA[replication-slot]]></category>
            <category><![CDATA[debezium]]></category>
            <category><![CDATA[wal]]></category>
            <dc:creator><![CDATA[Mohitkundu]]></dc:creator>
            <pubDate>Sun, 06 Jul 2025 10:27:39 GMT</pubDate>
            <atom:updated>2025-07-06T10:27:39.268Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*cBgresOyUQ_w0PxOJ2qHoQ.jpeg" /></figure><p><strong>Intro: </strong><a href="https://debezium.io">Debezium</a> is a powerful open-source tool for <strong>Change Data Capture (CDC)</strong>, allowing real-time streaming of changes from PostgreSQL (and other databases) into Kafka. Under the hood, Debezium relies on <strong>logical decoding</strong> via <strong>replication slots</strong> to read WAL (Write-Ahead Log) records.</p><p>While this architecture enables efficient change tracking, it introduces a hidden risk: if a Debezium connector slows down, becomes idle, or is misconfigured, PostgreSQL <strong>retains WAL files</strong> — leading to <strong>replication slot bloat</strong>, increased disk usage, and potential database outages.</p><p>In this article, we’ll explore <strong>5 effective strategies</strong> to prevent replication slot bloat when using Debezium with PostgreSQL.</p><h3>1. Use a Heartbeat Table in Low-Update Environments</h3><p>PostgreSQL’s WAL cleanup depends on the movement of the <strong>confirmed_flush_lsn</strong>, which advances only when Debezium acknowledges new changes. If a connector is watching <strong>low-update tables</strong>, PostgreSQL holds onto old WAL logs, thinking they’re still needed — causing disk space usage to spike.</p><h3>💡 When This Happens:</h3><ul><li>The connector <strong>doesn’t include</strong> tables with frequent changes.</li><li>Another <strong>database on the same PostgreSQL instance</strong> is high-traffic, while the database Debezium tracks is idle.</li><li>On <strong>AWS RDS</strong>, even system writes that don’t generate visible events can cause WAL accumulation if the connector doesn’t produce change events regularly.</li></ul><h3>✅ What to do:</h3><p>Enable Debezium’s <strong>heartbeat mechanism</strong> to periodically emit change events and move the LSN forward.</p><p><strong>Step-by-Step:</strong></p><ol><li>Create a heartbeat table:</li></ol><pre>CREATE TABLE public.heartbeat_table (   id SERIAL PRIMARY KEY,   last_updated TIMESTAMP DEFAULT now() );</pre><p>2. Add it to the replication publication:</p><pre>ALTER PUBLICATION your_publication ADD TABLE public.heartbeat_table</pre><p>3. Enable heartbeats in Debezium config:</p><pre>{   &quot;heartbeat.interval.ms&quot;: &quot;10000&quot; }</pre><p>4. Trigger updates with heartbeat.action.query:</p><pre>{   &quot;heartbeat.action.query&quot;: &quot;UPDATE public.heartbeat_table SET last_updated = now() WHERE id = 1&quot; }</pre><h3>🧠 Why This Works:</h3><ul><li>Keeps confirmed_flush_lsn moving even in low-traffic schemas.</li><li>Prevents WAL from piling up due to inactivity.</li><li>Essential in <strong>multi-database PostgreSQL instances</strong> where WAL is <strong>shared</strong>, but Debezium is only reading from the quieter database.</li></ul><h3>2. Use slot.drop.on.stop and slot.drop.on.delete</h3><p>Debezium provides configuration options to automatically drop replication slots when a connector is stopped or deleted.</p><h3>✅ What to do:</h3><p>Set these properties in your connector config:</p><pre>{<br>  &quot;slot.drop.on.stop&quot;: &quot;true&quot;,<br>  &quot;slot.drop.on.delete&quot;: &quot;true&quot;<br>}</pre><p>This ensures replication slots don’t linger and hold WAL indefinitely after the connector is no longer in use.</p><p><strong>⚠️ Caution:</strong> Use these options <strong>carefully in production</strong>. Dropping a slot requires re-snapshotting the data on restart, which might be expensive or unacceptable for large tables.</p><h3>3. Separate High-Frequency Tables Into Dedicated Connectors</h3><p>Grouping both high- and low-frequency tables in a single Debezium connector causes trouble when one table becomes a bottleneck. Because a single connector processes changes <strong>sequentially</strong>, lag in one table delays the processing of others — and the WAL keeps growing.</p><h3>✅ What to do:</h3><ul><li>Move <strong>high-update tables</strong> into <strong>their own connectors</strong>.</li><li>For multiple high-frequency tables, divide them across connectors to <strong>maximize parallelism</strong>.</li></ul><h3>🔍 Key Benefits:</h3><ul><li><strong>Single connectors are slow</strong> under load.</li><li><strong>Multiple connectors process aggressively in parallel</strong>, advancing the LSN faster and allowing PostgreSQL to reclaim WAL space.</li></ul><h3>🛠 Example Setup:</h3><ul><li>connector_orders: tracks high-volume orders and transactions</li><li>connector_users: tracks medium-activity users and payments</li><li>connector_config: tracks rarely changed reference tables</li></ul><p>This architectural separation improves reliability and WAL cleanup efficiency.</p><h3>4. Remove Unused SMTs (Single Message Transforms)</h3><p>Debezium supports SMTs to transform messages before sending them to Kafka. While powerful, excessive or unnecessary SMTs <strong>slow down processing</strong>, leading to connector lag and delayed WAL consumption.</p><h3>✅ What to do:</h3><ul><li>Review your SMT configuration.</li><li>Remove default or unused SMTs (e.g., ExtractNewRecordState) unless they are strictly required.</li><li>Avoid chaining complex transformations in the connector pipeline — move them downstream if possible.</li></ul><h3>🧠 Example:</h3><p>Instead of masking fields in Debezium, consider doing it in your Kafka consumer to reduce connector overhead and keep CDC flow fast.</p><h3>5. Drop Inactive Replication Slots</h3><p>In some environments, connectors might crash or be deleted without properly cleaning up their replication slots. These orphaned slots remain <strong>inactive but still retain WAL</strong>, bloating disk usage.</p><h3>✅ What to do:</h3><ol><li>List inactive replication slots:</li></ol><pre>SELECT * FROM pg_replication_slots WHERE active = false;</pre><p>2. Drop unused slots safely:</p><pre>SELECT pg_drop_replication_slot(&#39;your_slot_name&#39;);</pre><h3>🔁 Automate It:</h3><p>Set up a cron job, Airflow task, or monitoring alert to periodically detect and remove stale replication slots that are no longer used.</p><h3>Conclusion</h3><p>Debezium makes real-time syncing with PostgreSQL seamless, but it demands careful attention to <strong>replication slot and WAL management</strong>. Left unchecked, WAL bloat can bring down your database — especially in high-throughput or idle environments.</p><p>By following these 5 strategies:</p><ul><li>Use <strong>heartbeat tables</strong> to simulate change events</li><li>Configure <strong>auto-cleanup</strong> with slot.drop.on.stop</li><li><strong>Split large tables</strong> into dedicated connectors</li><li>Remove <strong>redundant SMTs</strong></li><li>Periodically <strong>clean up inactive replication slots</strong></li></ul><p>— you can keep your system lean, performant, and production-ready.</p><p>Understanding <strong>how PostgreSQL shares and releases WAL across databases</strong> is key to operating Debezium reliably — especially in mixed workloads and multi-tenant setups.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f18b9c007d02" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>