Stories by Mansi Bhadani on Medium

Watermarks in Spark Structured Streaming: What They Actually Do

Mansi Bhadani — Sun, 05 Apr 2026 19:22:13 GMT

A practical guide to event-time watermarks — when late data gets dropped, why stateful aggregation memory grows, and how to tune window size for your SLA.

Watermarks are one of those Spark concepts that look simple in the documentation and turn confusing the moment you hit production.

The official description is something like: “a watermark defines how late data can arrive before being dropped.” That’s technically correct and practically useless until you’ve watched Spark silently discard events and spent an afternoon figuring out why.

This is the guide I wish I’d had when I built the NJ Transit streaming pipeline. I’ll explain what watermarks actually do to Spark’s internal state, when late data gets dropped (and when it doesn’t), why memory grows if you tune watermarks wrong, and how to pick the right window and watermark configuration for a given latency SLA.

Why Event Time Is Different From Processing Time

Before watermarks make sense, you need to internalize the event time vs. processing time distinction — because the entire watermark mechanism exists to handle the gap between them.

Processing time is when Spark sees the event. Event time is when the event actually occurred, embedded in the payload.

In a transit pipeline, a vehicle position update might be generated at 14:32:07 on the vehicle, transmitted over a cellular network, buffered in Kafka, and consumed by Spark at 14:32:45. The event time is 14:32:07. The processing time is 14:32:45. The gap is 38 seconds.

That gap is usually small. But networks fail, vehicles go underground, Kafka consumers fall behind. Sometimes that gap is 5 minutes. Sometimes it’s 2 hours.

If you’re computing windowed aggregations (e.g., average delay per route over 5-minute windows), you need to decide: when is a 5-minute window “done” and ready to emit? With processing time, you just wait 5 minutes of wall-clock time. With event time, you don’t know when all the events for a window have arrived — because some of them might still be in transit.

Watermarks are Spark’s answer to that problem.

What a Watermark Actually Does

When you define a watermark in Spark:

df.withWatermark("event_timestamp", "2 minutes")

You’re telling Spark: “The maximum amount of time an event can be delayed beyond its event time is 2 minutes. Any event that arrives more than 2 minutes late (relative to the current watermark) will be dropped.”

The watermark itself is computed as:

watermark = max(event_timestamp seen so far) - delay_threshold

So if the latest event Spark has seen has a timestamp of 14:40:00, and the watermark delay is 2 minutes, the current watermark is 14:38:00.

Any event with an event_timestamp earlier than 14:38:00 will be dropped.

This is the part that surprises people: the watermark isn’t a fixed time threshold. It’s a moving threshold driven by the maximum event time Spark has observed. If your pipeline stalls and no new events arrive, the watermark doesn’t advance — it stays frozen, and no windows close.

The State Store: Why Memory Grows

Here’s the mechanism that bites people in production.

Spark Structured Streaming maintains a state store for stateful operations like windowed aggregations. For every open window, Spark keeps partial aggregation state in memory (and on disk as a checkpoint) until the window is finalized and emitted.

A window is finalized when the watermark advances past the window’s end time.

If your watermark delay is set too generously (say, 30 minutes), Spark keeps all windows open for at least 30 minutes. If you have 5-minute tumbling windows and 30 minutes of watermark delay, that’s at minimum 7 windows open simultaneously, each accumulating state.

For a high-volume topic with many distinct grouping keys (route × vehicle × direction), this state can be substantial:

# This configuration creates a LOT of state
df.withWatermark("event_timestamp", "30 minutes") \
  .groupBy(
      window("event_timestamp", "5 minutes"),
      col("route_id"),
      col("vehicle_id"),
      col("direction")
  ) \
  .agg(avg("delay_seconds").alias("avg_delay"))

With 300 routes × 50 vehicles × 2 directions = 30,000 grouping key combinations, and 7 open windows, you’re maintaining state for up to 210,000 partial aggregations simultaneously.

Increase the watermark delay or the window size, and that number grows proportionally.

When Late Data Gets Dropped vs. Included

The behavior here is more nuanced than “late = dropped.”

An event is dropped if its event_timestamp is less than the current watermark when it arrives. The current watermark is max_event_time_seen - delay_threshold.

An event is included if it arrives before the watermark advances past its window’s end time — even if it arrives “late” relative to processing time.

Example with a 5-minute window and a 2-minute watermark delay:

Window: [14:30:00, 14:35:00)
Event arrives at processing time 14:37:30 with event_time 14:34:45

Current watermark = max_seen_event_time - 2min

If max_seen_event_time = 14:36:30:
  watermark = 14:34:30
  event_time 14:34:45 > watermark → EVENT INCLUDED ✓

If max_seen_event_time = 14:37:30:
  watermark = 14:35:30
  window end 14:35:00 < watermark → WINDOW CLOSED
  event_time 14:34:45 → EVENT DROPPED ✗

The event’s fate depends not on how late it arrived, but on where the watermark is when it arrives.

Tuning Watermarks for Your SLA

Here’s a practical framework for picking watermark delay values:

Step 1: Measure Your Actual Event Delay Distribution

Before guessing a watermark value, instrument your pipeline to measure the real delay distribution. In the transit pipeline, I added this to a monitoring job:

from pyspark.sql.functions import col, current_timestamp, unix_timestamp

delay_stats = df \
    .withColumn("processing_delay_seconds",
                unix_timestamp(current_timestamp()) -
                unix_timestamp(col("event_timestamp"))) \
    .agg(
        percentile_approx("processing_delay_seconds", 0.50).alias("p50_delay"),
        percentile_approx("processing_delay_seconds", 0.95).alias("p95_delay"),
        percentile_approx("processing_delay_seconds", 0.99).alias("p99_delay"),
        max("processing_delay_seconds").alias("max_delay")
    )

Run this for a week. The 99th percentile of your event delay distribution is your starting watermark value. If p99 is 45 seconds, use "1 minute". If p99 is 4 minutes, use "5 minutes".

Step 2: Balance Latency vs. Completeness

Watermark delay is a direct tradeoff between output latency and result completeness:

Watermark Delay Output Latency Late Data Included State Size Short (30s) Low ~95% Small Medium (2min) Medium ~99% Medium Long (10min) High ~99.9% Large

For the transit pipeline, the SLA was sub-60 second end-to-end latency. That meant I couldn’t use a watermark longer than about 90 seconds — otherwise windows would never close within the SLA window.

The measured p99 delay was ~45 seconds (mostly Kafka consumer lag plus network). I used a 90-second watermark, which covered ~99.2% of events and kept state size manageable.

transit_agg = df \
    .withWatermark("event_timestamp", "90 seconds") \
    .groupBy(
        window("event_timestamp", "5 minutes", "1 minute"),  # 5min window, 1min slide
        col("route_id")
    ) \
    .agg(
        avg("delay_seconds").alias("avg_delay"),
        count("*").alias("event_count")
    )

Step 3: Monitor Window Closure Rate

Add a metric to track how many windows are closing per micro-batch. If windows aren’t closing, the watermark isn’t advancing — usually because no new events are arriving, or because you’re seeing a flood of very old events.

# In your streaming query listener
def process_batch(df, epoch_id):
    closed_windows = df.filter(col("window.end") < current_watermark())
    print(f"Epoch {epoch_id}: {closed_windows.count()} windows closed")

Common Watermark Bugs

Bug 1: Watermark Never Advances

Symptom: State store grows without bound. Windows never emit.

Cause: No new events arriving with recent event timestamps. If your source has a dead period (overnight, weekends), the watermark freezes and nothing closes.

Fix: Either use processing time for windows that must close on a schedule, or send periodic heartbeat events with current timestamps to advance the watermark.

Bug 2: All Late Data Gets Dropped

Symptom: Aggregation counts are consistently 20–30% lower than expected.

Cause: Watermark delay is set shorter than the actual p99 event delay.

Fix: Instrument your actual delay distribution (see Step 1 above). Increase the watermark delay to cover the p99.

Bug 3: Out-of-Order Events Causing Wrong Aggregations

Symptom: Window aggregations are correct in total but wrong per-window.

Cause: Events from window A arriving during the processing of window B, with a watermark that still includes them, but they’re being assigned to the wrong window due to processing time assumptions.

Fix: Always use event_timestamp (not processing timestamp) as the window column. This is the whole point of event-time windows.

# WRONG - uses processing time
df.groupBy(window(current_timestamp(), "5 minutes"))

# RIGHT - uses event time
df.groupBy(window(col("event_timestamp"), "5 minutes"))

Output Modes and Watermarks

One more thing that trips people up: not all output modes work with watermarks.

Append mode: Only emits rows once their window is finalized (past the watermark). Best for downstream sinks that can’t handle updates (Kafka, Snowflake via COPY). Requires watermark.
Update mode: Emits rows every time they change. Doesn’t require watermark but state never gets cleaned up without one.
Complete mode: Emits the full result table every batch. No watermark support. Only practical for small result sets.

For the transit pipeline, I used append mode — Snowflake COPY INTO doesn’t handle upserts efficiently, and I wanted to emit finalized aggregations only once.

query = transit_agg \
    .writeStream \
    .outputMode("append") \  # Only emit finalized windows
    .format("snowflake") \
    .option("sfURL", snowflake_url) \
    .option("dbtable", "fct_transit_aggregations") \
    .option("checkpointLocation", "/checkpoints/transit-agg") \
    .start()

Takeaway

Watermarks aren’t just a “drop late data” switch. They’re Spark’s mechanism for deciding when windowed state is safe to finalize and clean up. Get them wrong in one direction and you drop events you needed. Get them wrong in the other direction and your memory grows until the job crashes.

The right watermark value is determined empirically: measure your actual event delay distribution, set the delay to cover p99, and monitor window closure rate in production. Then adjust.

The goal isn’t zero late data. The goal is a predictable, bounded tradeoff between latency and completeness — and watermarks are the knob.

Building a Dead-Letter Queue for Your Kafka Pipeline the Right Way

Mansi Bhadani — Sun, 05 Apr 2026 19:21:26 GMT

Every streaming pipeline drops records. Here’s how to route them to a dead-letter topic with structured failure reasons, and build a Slack alert before your downstream notices.

Every Kafka pipeline drops records. The question isn’t if — it’s whether you know when it happens, why it happened, and whether you can recover.

Most pipelines I’ve seen handle this one of two ways: they silently swallow failures (log an error, increment a counter, move on), or they crash the consumer entirely and page someone at 2 AM. Neither is good.

A dead-letter queue (DLQ) — a dedicated Kafka topic where failed records land with structured failure metadata — is the third option. It gives you visibility into failures, preserves the original event for replay, and keeps your main pipeline running while you investigate.

Here’s how to build one properly.

What a DLQ Solves (and Doesn’t)

A DLQ handles processing failures — events that arrive at your consumer but can’t be processed due to:

Schema validation failures (malformed Avro, missing required fields)
Deserialization errors (corrupted bytes, wrong schema version)
Business logic rejections (null trip_id on a transit event, negative loan amount)
Downstream write failures (Snowflake temporarily unavailable)

A DLQ does not handle events that never arrived (network partition between producer and broker) or events that were produced to the wrong topic. Those are upstream problems.

The DLQ Topic Design

The DLQ is a regular Kafka topic. What makes it useful is the structure of the messages you write to it.

A DLQ message should contain three things:

The original event bytes — exact bytes from the source topic, unmodified
Failure metadata — what failed, why, which consumer, which offset
Routing information — which source topic and partition the event came from

Here’s the Avro schema I use for DLQ messages:

{
  "type": "record",
  "name": "DeadLetterRecord",
  "namespace": "com.pipeline.dlq",
  "fields": [
    {"name": "original_topic", "type": "string"},
    {"name": "original_partition", "type": "int"},
    {"name": "original_offset", "type": "long"},
    {"name": "original_timestamp", "type": "long"},
    {"name": "original_key", "type": ["null", "bytes"], "default": null},
    {"name": "original_value", "type": "bytes"},
    {"name": "failure_timestamp", "type": "long"},
    {"name": "failure_reason", "type": {
      "type": "enum",
      "name": "FailureReason",
      "symbols": [
        "SCHEMA_VALIDATION_FAILED",
        "DESERIALIZATION_ERROR",
        "BUSINESS_RULE_VIOLATION",
        "DOWNSTREAM_WRITE_FAILED",
        "UNKNOWN"
      ]
    }},
    {"name": "failure_message", "type": "string"},
    {"name": "consumer_group", "type": "string"},
    {"name": "pipeline_version", "type": "string"}
  ]
}

The failure_reason enum is important. It lets you filter the DLQ topic by failure type — schema failures are usually upstream producer bugs; downstream write failures are usually transient and worth retrying; business rule violations need human review.

The Python Implementation

Here’s a production-ready DLQ producer that wraps your consumer logic:

from confluent_kafka import Consumer, Producer, KafkaError
from confluent_kafka.avro import AvroConsumer, AvroProducer
from confluent_kafka.schema_registry import SchemaRegistryClient
import json
import time
import logging

logger = logging.getLogger(__name__)

class DLQProducer:
    def __init__(self, bootstrap_servers: str, dlq_topic: str, consumer_group: str):
        self.dlq_topic = dlq_topic
        self.consumer_group = consumer_group
        self.producer = Producer({"bootstrap.servers": bootstrap_servers})

    def send_to_dlq(
        self,
        original_message,
        failure_reason: str,
        failure_message: str,
        pipeline_version: str = "1.0.0"
    ):
        dlq_record = {
            "original_topic": original_message.topic(),
            "original_partition": original_message.partition(),
            "original_offset": original_message.offset(),
            "original_timestamp": original_message.timestamp()[1],
            "original_key": original_message.key(),
            "original_value": original_message.value(),
            "failure_timestamp": int(time.time() * 1000),
            "failure_reason": failure_reason,
            "failure_message": str(failure_message)[:2000],  # cap message length
            "consumer_group": self.consumer_group,
            "pipeline_version": pipeline_version,
        }

        self.producer.produce(
            topic=self.dlq_topic,
            key=original_message.key(),
            value=json.dumps(dlq_record).encode("utf-8"),
            on_delivery=self._delivery_callback,
        )
        self.producer.poll(0)

    def _delivery_callback(self, err, msg):
        if err:
            logger.error(f"DLQ delivery failed: {err}")
        else:
            logger.debug(f"DLQ record delivered: {msg.topic()}[{msg.partition()}]@{msg.offset()}")

    def flush(self):
        self.producer.flush()

Now wrap your consumer’s processing logic:

def process_transit_event(message, dlq_producer: DLQProducer):
    try:
        # Deserialize
        try:
            event = deserialize_avro(message.value())
        except Exception as e:
            dlq_producer.send_to_dlq(
                message,
                failure_reason="DESERIALIZATION_ERROR",
                failure_message=str(e)
            )
            return

        # Validate business rules
        validation_errors = validate_transit_event(event)
        if validation_errors:
            dlq_producer.send_to_dlq(
                message,
                failure_reason="BUSINESS_RULE_VIOLATION",
                failure_message="; ".join(validation_errors)
            )
            return

        # Write to Snowflake
        try:
            write_to_snowflake(event)
        except SnowflakeWriteError as e:
            dlq_producer.send_to_dlq(
                message,
                failure_reason="DOWNSTREAM_WRITE_FAILED",
                failure_message=str(e)
            )
            return

    except Exception as e:
        # Catch-all for unexpected failures
        dlq_producer.send_to_dlq(
            message,
            failure_reason="UNKNOWN",
            failure_message=str(e)
        )
        logger.exception(f"Unexpected error processing message: {e}")

def validate_transit_event(event: dict) -> list[str]:
    errors = []
    if not event.get("trip_id"):
        errors.append("trip_id is null or empty")
    if not event.get("route_id"):
        errors.append("route_id is null or empty")
    if event.get("delay_seconds") is not None and abs(event["delay_seconds"]) > 3600:
        errors.append(f"delay_seconds {event['delay_seconds']} exceeds plausible range")
    return errors

The Consumer Loop

def run_consumer(
    bootstrap_servers: str,
    source_topic: str,
    dlq_topic: str,
    consumer_group: str
):
    consumer = Consumer({
        "bootstrap.servers": bootstrap_servers,
        "group.id": consumer_group,
        "auto.offset.reset": "earliest",
        "enable.auto.commit": False,  # Manual commit ONLY after processing
    })

    dlq_producer = DLQProducer(bootstrap_servers, dlq_topic, consumer_group)

    consumer.subscribe([source_topic])

    try:
        while True:
            msg = consumer.poll(timeout=1.0)

            if msg is None:
                continue

            if msg.error():
                if msg.error().code() == KafkaError._PARTITION_EOF:
                    continue
                else:
                    logger.error(f"Consumer error: {msg.error()}")
                    continue

            process_transit_event(msg, dlq_producer)

            # Commit AFTER processing (or sending to DLQ)
            # This ensures at-least-once delivery
            consumer.commit(asynchronous=False)

    finally:
        dlq_producer.flush()
        consumer.close()

The key detail: commit only after the message has been either successfully processed or sent to the DLQ. If your process crashes mid-handling, the message will be redelivered — which is correct behavior.

Slack Alerting on DLQ Volume Spikes

A DLQ is only useful if someone knows when it’s filling up. I use a separate monitoring job that tails the DLQ topic and sends Slack alerts when failure rate exceeds a threshold.

import requests
from collections import defaultdict, deque
from datetime import datetime, timedelta

class DLQMonitor:
    def __init__(self, slack_webhook_url: str, alert_threshold: int = 10):
        self.slack_webhook_url = slack_webhook_url
        self.alert_threshold = alert_threshold
        self.failure_counts = defaultdict(lambda: deque(maxlen=100))
        self.last_alert_time = {}

    def record_failure(self, failure_reason: str, source_topic: str):
        key = f"{source_topic}:{failure_reason}"
        self.failure_counts[key].append(datetime.utcnow())
        self._check_alert(key, failure_reason, source_topic)

    def _check_alert(self, key: str, failure_reason: str, source_topic: str):
        # Count failures in the last 5 minutes
        cutoff = datetime.utcnow() - timedelta(minutes=5)
        recent_failures = sum(
            1 for ts in self.failure_counts[key] if ts > cutoff
        )

        # Alert if threshold exceeded and no alert sent in last 15 minutes
        last_alert = self.last_alert_time.get(key, datetime.min)
        if (recent_failures >= self.alert_threshold and
                datetime.utcnow() - last_alert > timedelta(minutes=15)):
            self._send_slack_alert(failure_reason, source_topic, recent_failures)
            self.last_alert_time[key] = datetime.utcnow()

    def _send_slack_alert(self, failure_reason: str, source_topic: str, count: int):
        message = {
            "text": f"🚨 *DLQ Alert* — `{source_topic}`",
            "attachments": [{
                "color": "danger",
                "fields": [
                    {"title": "Failure Reason", "value": failure_reason, "short": True},
                    {"title": "Count (last 5 min)", "value": str(count), "short": True},
                    {"title": "Time", "value": datetime.utcnow().strftime("%Y-%m-%d %H:%M UTC"), "short": True},
                    {"title": "Action", "value": "Check DLQ topic for details", "short": False},
                ]
            }]
        }

        try:
            response = requests.post(
                self.slack_webhook_url,
                json=message,
                timeout=5
            )
            response.raise_for_status()
        except Exception as e:
            logger.error(f"Failed to send Slack alert: {e}")

Replaying DLQ Events

The whole point of the DLQ is that you can replay events after fixing the underlying issue. A replay script reads from the DLQ, filters by failure reason or time range, and re-produces the original events to the source topic:

def replay_dlq_events(
    bootstrap_servers: str,
    dlq_topic: str,
    source_topic: str,
    failure_reason_filter: str = None,
    since_timestamp: int = None
):
    consumer = Consumer({
        "bootstrap.servers": bootstrap_servers,
        "group.id": "dlq-replay-job",
        "auto.offset.reset": "earliest",
        "enable.auto.commit": False,
    })

    producer = Producer({"bootstrap.servers": bootstrap_servers})
    consumer.subscribe([dlq_topic])

    replayed = 0
    skipped = 0

    try:
        while True:
            msg = consumer.poll(timeout=5.0)
            if msg is None:
                break

            dlq_record = json.loads(msg.value())

            # Apply filters
            if failure_reason_filter and dlq_record["failure_reason"] != failure_reason_filter:
                skipped += 1
                continue

            if since_timestamp and dlq_record["failure_timestamp"] < since_timestamp:
                skipped += 1
                continue

            # Re-produce original event to source topic
            producer.produce(
                topic=source_topic,
                key=dlq_record.get("original_key"),
                value=bytes(dlq_record["original_value"]),
            )
            replayed += 1

    finally:
        producer.flush()
        consumer.close()
        logger.info(f"Replay complete: {replayed} events replayed, {skipped} skipped")

Operational Tips

Use a separate consumer group for DLQ monitoring. Don’t reuse the main consumer group — you want to be able to tail the DLQ independently without affecting main pipeline offsets.

Set a retention policy on the DLQ topic. 7–14 days is usually sufficient. Use compaction only if your events have meaningful keys; otherwise, time-based retention is cleaner.

Alert on DLQ rate, not just DLQ size. A spike in failures in the last 5 minutes is more actionable than “the DLQ has 10,000 records.” The size accumulates over time; the rate tells you something is actively wrong.

Never auto-replay from DLQ without human review. The whole point of routing to a DLQ is that something needs investigation. Automated replay without a fix in place just moves the problem around.

Takeaway

A dead-letter queue isn’t just a safety net — it’s observability infrastructure. It transforms “the pipeline is dropping records” from a mystery into a structured log you can query, alert on, and replay from.

Build it before you need it. Wire the Slack alert. The first time your DLQ fires and you can tell a stakeholder exactly what failed, why, and when you’ll have it fixed, you’ll be glad you did.

dbt Tests That Actually Catch Real Data Quality Issues (Not Just Null Checks)

Mansi Bhadani — Wed, 01 Apr 2026 20:21:52 GMT

Beyond not_null and unique: building custom dbt tests for range validation, cross-table referential integrity, and statistical drift detection.

Every dbt project I’ve seen starts the same way: someone adds not_null and unique tests to the primary key column, runs dbt test, sees green, and calls it "tested."

Then three months later a downstream dashboard shows negative revenue or a count that’s off by 30%, and the investigation reveals the dbt tests were checking the wrong things.

I’ve built data quality frameworks on top of dbt for two projects now — a mortgage risk pipeline and the NJ Transit streaming system — and the tests that actually caught real production issues weren’t the generic ones. They were custom tests built around the specific failure modes of those pipelines.

This is a walkthrough of those tests, how they work, and when to use them.

The Tests That Don’t Catch Real Issues

Let’s be clear about what generic tests can catch, so we know what gap we’re filling.

not_null catches missing required fields. unique catches duplicate primary keys. accepted_values catches invalid enum values. relationships catches broken foreign keys. These are all worth running. They catch a class of issues.

What they don’t catch:

A value that’s present, unique, and valid — but statistically wrong (LTV ratio of 0.002 instead of 2.0)
A column whose distribution has drifted significantly between runs
A referential relationship that’s technically valid but logically broken (matching IDs across tables that were joined on the wrong key)
Aggregations that are technically non-null but wrong by 40%

That’s the gap. Here’s how to fill it.

Test 1: Range Validation with Context

The most common real-world data quality failure I’ve seen isn’t nulls — it’s values outside their expected business range. A loan-to-value ratio of 400 is technically a float, passes not_null, but represents a data pipeline bug (usually a unit conversion error or a denominator-zero edge case).

Generic dbt doesn’t have a built-in range test, but it’s trivial to write:

-- tests/generic/assert_column_in_range.sql
{% test assert_column_in_range(model, column_name, min_value, max_value) %}

SELECT *
FROM {{ model }}
WHERE {{ column_name }} IS NOT NULL
  AND (
    {{ column_name }} < {{ min_value }}
    OR {{ column_name }} > {{ max_value }}
  )

{% endtest %}

Apply it in your schema.yml:

models:
  - name: mortgage_features
    columns:
      - name: loan_to_value_ratio
        tests:
          - assert_column_in_range:
              min_value: 0.01
              max_value: 2.0
      - name: interest_rate
        tests:
          - assert_column_in_range:
              min_value: 0.001
              max_value: 0.25
      - name: credit_score
        tests:
          - assert_column_in_range:
              min_value: 300
              max_value: 850

The test fails if any row has a value outside the range. The query returns those rows, so you can inspect them directly.

Why this catches real issues: In the mortgage pipeline, this test caught a batch where LTV ratios had been accidentally divided by 100 during a feature transformation refactor. The values were present, non-null, and unique — but completely wrong. not_null would never have caught it.

Test 2: Cross-Table Referential Integrity with Business Logic

dbt’s built-in relationships test checks that a foreign key exists in the referenced table. That's necessary but not sufficient. What you often need is a join that validates business logic, not just key existence.

Example: in the transit pipeline, every trip event should have a corresponding schedule entry. A foreign key check ensures the trip_id exists in the schedule table — but it doesn't check that the event timestamp falls within the scheduled window for that trip.

-- tests/generic/assert_event_within_schedule_window.sql
{% test assert_event_within_schedule_window(model, trip_id_col, event_time_col, buffer_minutes=30) %}

SELECT e.*
FROM {{ model }} e
LEFT JOIN {{ ref('dim_schedule') }} s
  ON e.{{ trip_id_col }} = s.trip_id
WHERE s.trip_id IS NULL
   OR e.{{ event_time_col }} < DATEADD('minute', -{{ buffer_minutes }}, s.scheduled_departure)
   OR e.{{ event_time_col }} > DATEADD('minute', {{ buffer_minutes }}, s.scheduled_arrival)

{% endtest %}

This returns events that either have no matching schedule or fall outside the expected time window by more than the buffer. A standard relationships test would pass; this one catches the logical mismatch.

Test 3: Row Count Drift Detection

One of the most useful tests I’ve built is a row count sanity check. For tables that are loaded incrementally, a batch that’s 50% smaller than the previous batch is usually a bug — a partition filter issue, a source system outage, or a pipeline logic error.

-- tests/generic/assert_row_count_within_threshold.sql
{% test assert_row_count_within_threshold(model, date_column, lookback_days=7, min_ratio=0.5, max_ratio=2.0) %}

WITH daily_counts AS (
    SELECT
        DATE_TRUNC('day', {{ date_column }}) AS load_date,
        COUNT(*) AS row_count
    FROM {{ model }}
    WHERE {{ date_column }} >= DATEADD('day', -({{ lookback_days }} + 1), CURRENT_DATE)
    GROUP BY 1
),
with_lag AS (
    SELECT
        load_date,
        row_count,
        LAG(row_count) OVER (ORDER BY load_date) AS prev_row_count
    FROM daily_counts
)
SELECT *
FROM with_lag
WHERE prev_row_count IS NOT NULL
  AND (
    row_count < prev_row_count * {{ min_ratio }}
    OR row_count > prev_row_count * {{ max_ratio }}
  )

{% endtest %}

Apply it to any incrementally loaded model:

- name: fct_transit_events
  tests:
    - assert_row_count_within_threshold:
        date_column: event_timestamp
        lookback_days: 7
        min_ratio: 0.6
        max_ratio: 1.5

This test fails if any day in the past week had a row count less than 60% or more than 150% of the previous day. Adjust the ratios for tables with known weekend/weekday variance.

Why this matters: This test caught two separate issues in the transit pipeline — once when a Kafka consumer group fell behind and batches were being dropped, and once when a Spark watermark was configured too aggressively and late-arriving events were being discarded.

Test 4: Statistical Distribution Drift

For ML feature tables, row count drift isn’t enough. You need to know if the distribution of a column has shifted significantly — which indicates a data quality issue, a schema change upstream, or model drift.

This is a simplified Z-score test comparing the current period’s mean and stddev against a historical baseline:

-- tests/generic/assert_column_distribution_stable.sql
{% test assert_column_distribution_stable(model, column_name, date_column, zscore_threshold=3.0, lookback_days=30) %}

WITH historical_stats AS (
    SELECT
        AVG({{ column_name }}) AS hist_mean,
        STDDEV({{ column_name }}) AS hist_stddev
    FROM {{ model }}
    WHERE {{ date_column }} BETWEEN
        DATEADD('day', -({{ lookback_days }} + 7), CURRENT_DATE)
        AND DATEADD('day', -7, CURRENT_DATE)
),
current_stats AS (
    SELECT
        AVG({{ column_name }}) AS curr_mean
    FROM {{ model }}
    WHERE {{ date_column }} >= DATEADD('day', -7, CURRENT_DATE)
),
zscore_calc AS (
    SELECT
        ABS(c.curr_mean - h.hist_mean) / NULLIF(h.hist_stddev, 0) AS zscore
    FROM current_stats c, historical_stats h
)
SELECT *
FROM zscore_calc
WHERE zscore > {{ zscore_threshold }}

{% endtest %}

Apply to key feature columns:

- name: mortgage_features
  tests:
    - assert_column_distribution_stable:
        column_name: debt_to_income_ratio
        date_column: origination_date
        zscore_threshold: 3.0
        lookback_days: 30

If the current week’s mean is more than 3 standard deviations from the 30-day historical mean, the test fails. This catches upstream data issues that would silently degrade model performance.

Test 5: Aggregation Reconciliation

For financial pipelines, you often need to ensure that a fact table’s aggregations reconcile with a source-of-truth. This is a source-to-target reconciliation test:

-- tests/generic/assert_aggregate_reconciles.sql
{% test assert_aggregate_reconciles(model, agg_column, source_model, source_agg_column, join_column, tolerance=0.01) %}

WITH model_agg AS (
    SELECT
        {{ join_column }},
        SUM({{ agg_column }}) AS model_total
    FROM {{ model }}
    GROUP BY {{ join_column }}
),
source_agg AS (
    SELECT
        {{ join_column }},
        SUM({{ source_agg_column }}) AS source_total
    FROM {{ source_model }}
    GROUP BY {{ join_column }}
)
SELECT m.{{ join_column }},
       m.model_total,
       s.source_total,
       ABS(m.model_total - s.source_total) / NULLIF(s.source_total, 0) AS pct_diff
FROM model_agg m
JOIN source_agg s USING ({{ join_column }})
WHERE ABS(m.model_total - s.source_total) / NULLIF(s.source_total, 0) > {{ tolerance }}

{% endtest %}

This fails if any grouping key has a discrepancy greater than the tolerance (default 1%). Useful for ensuring that a transformed fact table’s totals reconcile with the raw source.

Putting It Together: A Testing Strategy

Generic tests (not_null, unique, relationships) should cover your staging models — catching issues at the source layer before they propagate.

Custom tests (range_validation, distribution_drift, row_count_drift) should cover your mart and feature models — catching semantic issues that only become visible after transformation.

A practical structure:

# models/staging/schema.yml
models:
  - name: stg_loan_applications
    columns:
      - name: application_id
        tests: [not_null, unique]
      - name: applicant_ssn_hash
        tests: [not_null]

# models/marts/schema.yml  
models:
  - name: fct_loan_features
    tests:
      - assert_row_count_within_threshold:
          date_column: load_date
    columns:
      - name: loan_to_value_ratio
        tests:
          - assert_column_in_range:
              min_value: 0.01
              max_value: 2.0
          - assert_column_distribution_stable:
              date_column: origination_date

Run Tests in CI, Alert on Failure

None of this matters if you only run tests manually. Wire dbt test into your Airflow DAG after every model run:

# dags/dbt_pipeline.py
run_models = BashOperator(
    task_id="dbt_run",
    bash_command="dbt run --select marts.+"
)

test_models = BashOperator(
    task_id="dbt_test",
    bash_command="dbt test --select marts.+ --store-failures"
)

run_models >> test_models

With --store-failures, dbt writes failing rows to a dbt_test_failures schema in your warehouse. When a test fails, you can query those rows directly to understand what happened.

Takeaway

The not_null and unique tests are table stakes. They catch a real class of issues and you should run them. But they don't catch the issues that actually hurt you in production — wrong values, drifting distributions, aggregation discrepancies.

Build custom tests around your pipeline’s specific failure modes. Start with range validation (catches unit errors) and row count drift (catches pipeline failures). Add distribution drift if you’re feeding ML models. Add aggregation reconciliation if you’re working with financial data.

The tests that matter are the ones that would have caught the last bug.

How I Passed the SnowPro Core Exam in 3 Weeks While in Grad School

Mansi Bhadani — Wed, 01 Apr 2026 20:21:20 GMT

Study plan, resource list, the 5 topic areas that actually appear on the exam, and the one practice test worth paying for.

I passed the SnowPro Core Certification exam in November 2024 while taking 12 credits at Pace University and working on the NJ Transit capstone pipeline. Total study time: 3 weeks, roughly 1–2 hours per evening.

This guide is what I wish someone had handed me before I started. Not the “read all 800 pages of Snowflake documentation” advice — the actual focused plan that gets you through the exam without burning out.

What the SnowPro Core Exam Actually Tests

The exam is 100 questions, 115 minutes, passing score 750/1000. Snowflake publishes an official exam guide, but it’s broad enough to be almost useless for prioritization.

Here’s what actually shows up based on my experience and the experience of other candidates I’ve talked to:

1. Snowflake Architecture (~20–25% of questions) Virtual warehouses, micro-partitioning, clustering keys, the storage/compute/cloud services layer separation. You need to understand how each layer works and why this architecture enables independent scaling.

2. Virtual Warehouses and Performance (~15–20%) Warehouse sizes, auto-suspend and auto-resume, multi-cluster warehouses, query acceleration service, result caching vs. local disk caching vs. remote disk caching. The three caching layers are a common question area.

3. Data Loading and Unloading (~15%) COPY INTO, stages (internal vs. external), file formats, VARIANT for semi-structured data, FLATTEN for JSON/Parquet arrays. Know the difference between COPY INTO

and COPY INTO .

4. Data Sharing and Collaboration (~10–15%) Secure data sharing, data marketplace, reader accounts, direct share vs. listing. This gets more questions than you’d expect — it’s a Snowflake differentiator so they emphasize it.

5. Security and Access Control (~15–20%) Role-based access control, DAC vs. MAC, column-level security, row access policies, dynamic data masking, network policies. This is heavily tested and people underestimate it.

The remaining questions are spread across: account administration, performance optimization (query profiling, clustering), Time Travel and Fail-safe, and Snowflake editions.

The 3-Week Study Plan

Week 1: Architecture and Core Concepts

Goal: Understand Snowflake’s architecture deeply enough to answer “why” questions, not just “what” questions.

What to study:

Snowflake Architecture documentation (the official docs on multi-cluster shared data architecture)
Virtual warehouses: sizes, credit consumption, auto-suspend behavior
Micro-partitioning: how Snowflake stores data, natural clustering vs. explicit clustering keys
The three caching layers: result cache (24hr, query-level), local SSD cache (warehouse-level), remote storage cache

A question you should be able to answer cold: A user runs the same query twice in 5 minutes. The second query returns instantly. Which cache is responsible?

Answer: Result cache — query results are cached for 24 hours if the underlying data hasn’t changed.

Time commitment: 45–60 min/day, 5 days

Week 2: Loading, Security, and Data Sharing

Goal: Know the operational details that appear as specific scenario questions.

Data Loading — what to know:

Stages: user stage (@~), table stage (@%tablename), named stage (@stagename)
File formats: CSV options that affect COPY INTO behavior (SKIP_HEADER, NULL_IF, EMPTY_FIELD_AS_NULL)
Semi-structured: VARIANT column, PARSE_JSON, FLATTEN, the : and :: operators

-- Know how to query nested JSON in VARIANT columns
SELECT
    src:event_type::string AS event_type,
    src:payload:trip_id::string AS trip_id,
    f.value::float AS delay_seconds
FROM transit_events,
LATERAL FLATTEN(input => src:delays) f

Security — what to know:

RBAC hierarchy: ORGADMIN → ACCOUNTADMIN → SYSADMIN → USERADMIN → PUBLIC
The difference between a role and a privilege
Column-level security (column masking policies) vs. row-level security (row access policies)
Network policies: IP allow/block lists at account and user level

Data Sharing — what to know:

A data share is a named object containing database objects to be shared
Consumers don’t copy data — they query the provider’s storage directly
Reader accounts: Snowflake-managed accounts for non-Snowflake customers
Data marketplace vs. private listing

Time commitment: 60 min/day, 5 days

Week 3: Performance, Time Travel, and Practice Exams

Goal: Solidify weaker areas, burn through practice questions, identify gaps.

Performance — what to know:

Query profiling in the web UI: which operators consume the most time
Clustering keys vs. natural clustering: when to add an explicit clustering key
Materialized views vs. regular views vs. dynamic tables
Query acceleration service: for large, irregular queries with partial scans

Time Travel and Fail-safe:

Time Travel: 0–90 days depending on edition (Standard = 1 day max, Enterprise = 90 days max)
AT and BEFORE clauses for querying historical data
UNDROP TABLE/SCHEMA/DATABASE within Time Travel window
Fail-safe: additional 7-day recovery period managed by Snowflake (not user-accessible)

The key distinction: Time Travel is for you. Fail-safe is for Snowflake. You can query Time Travel data yourself; you need to contact Snowflake Support to recover data from Fail-safe.

Time commitment: 30–45 min content, 30–45 min practice questions, 5 days

Resources: What to Use and What to Skip

Use These

Snowflake Official Documentation The actual source of truth. For topics you don’t understand from practice questions, go here. Don’t try to read it end-to-end — use it as a reference.

Udemy: SnowPro Core Certification Course (Nikolai Schuler) This is the one paid resource worth buying. ~$15–20 on sale (they’re almost always on sale). The video explanations of the architecture layers and caching hierarchy are genuinely clearer than the official docs.

Snowflake’s Sample Questions (Official) Snowflake publishes sample questions on their certification page. Do these first to understand the question style.

ExamTopics SnowPro Core Free, community-sourced practice questions. Quality varies — some questions are outdated or debated in the comments. Use for volume practice, but don’t treat every answer as authoritative.

Skip These

Any study guide that’s more than 18 months old — Snowflake moves fast and some features have changed significantly (Dynamic Tables, Cortex, data marketplace changes).

Video courses that are 20+ hours. You don’t have time, and the marginal value past 8–10 hours of video is low.

The Practice Test Worth Paying For

SnowPro Core Practice Exams by David Fradin on Udemy (~$15 on sale)

This is the one I used the week before the exam. 300+ questions across multiple practice exams, explanations for every answer, questions that are close to actual exam difficulty and style.

Don’t buy it to pass — buy it to identify gaps. If you’re getting 70–75% on these practice exams, you’re in the range to pass the real exam. If you’re under 65%, you have specific areas to shore up.

The 10 Concepts Most Likely to Appear

If I had to bet on what shows up on your exam, I’d put chips on these:

Caching layers — result cache vs. warehouse cache vs. storage cache, and what invalidates each
Time Travel retention periods — which edition gets how many days, and the AT/BEFORE syntax
Fail-safe — 7 days, Snowflake-managed, not user-accessible
Micro-partitioning — ~16MB compressed, automatic, metadata-driven pruning
Clustering keys — when to add them, what columns make good clustering keys, cost implications
COPY INTO options — ON_ERROR behavior (ABORT_STATEMENT, CONTINUE, SKIP_FILE), PURGE
Role hierarchy — which default role does what, USERADMIN vs. SYSADMIN responsibilities
Data sharing — share object model, consumer doesn’t copy data, reader accounts
Multi-cluster warehouses — scaling policy (Economy vs. Standard), when to use vs. larger single warehouse
VARIANT and FLATTEN — querying semi-structured data, the lateral flatten pattern

Exam Day Tips

Flag and move on. If you’re unsure, flag the question and keep going. The exam is 115 minutes for 100 questions — you have time to come back.

Eliminate obviously wrong answers. Snowflake exam questions often have two plausible answers and two clearly wrong ones. Getting to a 50/50 and guessing is better than burning 5 minutes trying to be certain.

Watch for edition-specific features. Many questions have “this feature requires X edition” as the distinguishing factor. Know that Dynamic Data Masking, Row Access Policies, and multi-cluster warehouses require Enterprise edition or higher.

Trust your first instinct on architecture questions. These have clear right answers rooted in how Snowflake actually works. If you’ve studied the architecture well, your gut is usually right.

After the Exam

The exam result is shown immediately after submission. If you pass, your badge is issued within 48 hours via Credly.

The certification is valid for 2 years, after which you can take a shorter renewal exam.

My Actual Study Schedule (Week by Week)

Week 1 (Architecture): Schuler Udemy course chapters 1–5 (architecture, virtual warehouses, storage). 1 hour/day on my commute or after class.

Week 2 (Operations): Schuler Udemy chapters 6–10 (loading, security, sharing). Official Snowflake docs for anything unclear.

Week 3 (Practice): Fradin practice exams. Every wrong answer → look up the official doc. 30 minutes of new content, 30 minutes of practice questions. Last 2 days: full practice exam under timed conditions.

Total hours: ~25–30 hours over 3 weeks.

Takeaway

The SnowPro Core is not a hard exam if you study the right things. The architecture and caching layers are the foundation — if you understand why Snowflake’s design choices exist, a lot of the specific feature behavior becomes logical rather than memorizable.

Don’t study everything. Study architecture deeply, security and data sharing thoroughly, and use practice exams to find your gaps in Week 3.

Good luck — and yes, the SnowPro Core is worth putting on your resume.

Apache Iceberg’s Hidden Superpower: Time-Travel Queries in Production

Mansi Bhadani — Tue, 31 Mar 2026 19:28:44 GMT

How I migrated 10TB of mortgage data from Snowflake to Iceberg on S3, what I gained, what broke, and why snapshot isolation changed how analysts work.

Most people hear “Apache Iceberg” and think: open table format, better than Parquet, replaces Hive. That’s true, but it undersells the feature that actually changed how the analysts I worked with queried data: time-travel.

Not the marketing version — “query historical data!” — but the production reality: snapshot isolation, zero-copy branching, and the ability to surgically rewind a table to any point in its history without ETL reruns or backup restores.

This is the story of migrating 10TB of mortgage risk data from Snowflake to Apache Iceberg on S3, what broke, what we gained, and the specific moment time-travel went from a demo feature to a workflow dependency.

Why We Left Snowflake for Iceberg

The mortgage risk pipeline ran batch feature engineering on 1M+ loan records using PySpark, producing outputs consumed by ML models for prepayment risk prediction. It lived in Snowflake, which worked fine — until it didn’t.

Three pain points drove the migration:

1. Storage costs at scale. Snowflake’s compressed columnar storage is efficient, but at 10TB with 12-month retention requirements, the cost was significant. S3 with Iceberg cut storage costs by 60% — the same data, same query performance, fraction of the price.

2. Vendor lock-in on the lakehouse layer. The ML team wanted to query the same tables directly from PySpark without routing through Snowflake’s Spark connector. Iceberg on S3 with the Iceberg REST catalog meant any engine (Spark, Trino, Athena, DuckDB) could query the same tables natively.

3. Schema evolution in production. Snowflake handles schema evolution, but not gracefully across external tools. Adding a new feature column to a 1M-row table in Snowflake meant a full table copy. Iceberg handles it with metadata-only operations.

How Iceberg Time-Travel Actually Works

Before getting into the migration story, it’s worth understanding the mechanics — because the “hidden superpower” only makes sense once you see what’s under the hood.

Every Iceberg write operation creates a new snapshot. A snapshot is an immutable record of the table state at that point in time — which files belong to the table, which have been deleted, what the schema was.

Table metadata pointer
  └── snapshot-004 (current)  ← latest write
        └── manifest-list
              ├── manifest-A (data-file-01.parquet, data-file-02.parquet)
              └── manifest-B (data-file-03.parquet)
  └── snapshot-003
  └── snapshot-002
  └── snapshot-001  ← table creation

Time-travel works by pointing your query at a historical snapshot instead of the current one. No data is copied, no backup is restored — you’re just reading from a different snapshot in the same metadata chain.

# Read the table as it existed at a specific timestamp
df = spark.read \
    .option("as-of-timestamp", "2025-08-01T00:00:00") \
    .format("iceberg") \
    .load("s3://my-bucket/mortgage-features")

# Or by snapshot ID
df = spark.read \
    .option("snapshot-id", 4847894586449951480) \
    .format("iceberg") \
    .load("s3://my-bucket/mortgage-features")

That’s it. No special infrastructure. Just a read option.

The Migration: What Broke

The migration itself was a full snapshot export from Snowflake and re-ingestion into Iceberg on S3. Here’s what we hit:

Problem 1: Timestamp Precision Mismatch

Snowflake stores TIMESTAMP_NTZ with nanosecond precision internally. When we exported to Parquet and re-read in Spark, timestamps were truncating to microseconds. Downstream models that used timestamp features as inputs started producing slightly different outputs.

Fix: Explicitly cast timestamps during migration:

from pyspark.sql.functions import col, to_timestamp

df = spark.read.parquet("s3://export-bucket/snowflake-dump/") \
    .withColumn("origination_date", 
                col("origination_date").cast("timestamp"))

Problem 2: Null Handling in Partition Columns

Iceberg supports null values in partition columns; Hive-style partitioning does not. When we migrated a table partitioned by loan_state, the ~0.3% of records with null states caused silent failures — Spark wrote them to a __HIVE_DEFAULT_PARTITION__ path that Iceberg's reader didn't recognize.

Fix: Either filter nulls before writing, or use Iceberg’s native null partition handling:

# Write with Iceberg's partition spec that handles nulls correctly
spark.sql("""
    CREATE TABLE iceberg_catalog.mortgage.features
    USING iceberg
    PARTITIONED BY (loan_state)
    AS SELECT * FROM staging_table
""")

Problem 3: Snapshot Accumulation

Three weeks into production, metadata queries started slowing down. The reason: we were creating dozens of snapshots per day (append + overwrite operations) and never running EXPIRE_SNAPSHOTS. The manifest list had grown to thousands of entries.

Fix: Schedule regular maintenance:

spark.sql("""
    CALL iceberg_catalog.system.expire_snapshots(
        table => 'mortgage.features',
        older_than => TIMESTAMP '2025-07-01 00:00:00',
        retain_last => 10
    )
""")

The Moment Time-Travel Became a Workflow Dependency

Six weeks post-migration, the ML team ran a model refresh and noticed prepayment risk predictions had shifted significantly on a segment of loans. Not a model bug — the feature table had been updated with corrected LTV ratios from the data provider, and the old features were gone.

Before Iceberg: this would have required a backup restore or a full re-ingestion of the historical snapshot. With Iceberg:

# Find the snapshot before the LTV correction landed
snapshots_df = spark.sql("""
    SELECT snapshot_id, committed_at, operation 
    FROM iceberg_catalog.mortgage.features.snapshots
    ORDER BY committed_at DESC
""")

# Read features as they existed before the correction
historical_features = spark.read \
    .option("as-of-timestamp", "2025-08-14T06:00:00") \
    .format("iceberg") \
    .load("iceberg_catalog.mortgage.features")

The ML team could retrain against the exact feature set the original model had seen, reproduce the prediction delta, and validate the correction’s impact — all without involving the data engineering team or waiting for a restore.

That’s when time-travel stopped being a “nice to have” and became a first-class requirement.

Snapshot Isolation Changed How Analysts Queried

Beyond debugging, snapshot isolation changed the analytics workflow in a subtler way. When analysts ran long-running queries (sometimes 20–30 minutes for full-portfolio aggregations), they used to get inconsistent results if a background write job ran mid-query — a classic dirty read problem.

With Iceberg, every query runs against a snapshot. Once your query starts, the snapshot it’s reading is frozen. Background writes create new snapshots; your query never sees them.

-- This query will read a consistent snapshot even if 
-- a background job writes new data during execution
SELECT 
    loan_state,
    AVG(predicted_risk_score) as avg_risk,
    COUNT(*) as loan_count
FROM iceberg_catalog.mortgage.features
GROUP BY loan_state

No read locks. No query failures from concurrent writes. Consistent results every time.

Schema Evolution Without Downtime

The other migration win worth highlighting: schema evolution.

Three times during the project, the feature pipeline needed new columns. In Snowflake, adding a column to a 1M-row table meant an ALTER TABLE that locked writes while Snowflake re-materialized the metadata. In practice: 2-5 minutes of pipeline downtime per schema change.

In Iceberg, adding a column is a metadata-only operation:

ALTER TABLE iceberg_catalog.mortgage.features
ADD COLUMNS (
    debt_to_income_ratio DOUBLE,
    appraisal_method STRING
)

Execution time: under a second. Old data files aren’t touched — they just return null for the new columns when read. New files include the columns. No downtime, no data copy.

What I’d Do Differently

A few things I’d change with hindsight:

Set a snapshot retention policy on day one. We didn’t, and the snapshot accumulation problem was entirely preventable.

Use the Iceberg REST catalog from the start. We initially used a Hadoop catalog, which required HDFS. Switching to the REST catalog mid-project was painful. REST catalog is engine-agnostic and trivially configurable.

Test the partition evolution path before you need it. Iceberg supports changing partition specs without rewriting data (partition evolution), but the behavior during the transition window is subtle. Test it in staging before you need it in production.

Takeaway

The 60% storage cost reduction was the reason we migrated. The schema evolution improvements were the expected win. But time-travel queries and snapshot isolation were the features that changed how the team actually worked with the data.

If you’re on Snowflake and considering a lakehouse migration, the cost story is often what drives the conversation. Don’t undersell the operational story — consistent reads, schema evolution without downtime, and the ability to surgically rewind a table are workflow improvements that compound over time.

Why I Chose Kafka Over Kinesis for the NJ Transit Real-Time Pipeline

Mansi Bhadani — Tue, 31 Mar 2026 19:21:17 GMT

A breakdown of the architectural decision — partition semantics, consumer group replay, and why Confluent’s Schema Registry was the deciding factor over AWS Kinesis.

When I started building the NJ Transit Smart Management System for my capstone at Pace University, one of the first real decisions I had to make was: Kafka or Kinesis?

Both are battle-tested streaming platforms. Both can handle millions of events per day. Both have managed cloud offerings. On paper, either would have worked. But the more I dug into the architecture requirements of a real-time transit pipeline, the clearer it became that Kafka — specifically Confluent Cloud — was the right call.

Here’s exactly why.

The Problem I Was Solving

The pipeline needed to ingest GTFS-RT (General Transit Feed Specification — Realtime) transit event data, process it through Spark Structured Streaming, apply a 5-point data quality gate, and land clean records into Snowflake — all within a sub-60 second latency window.

The design constraints that shaped the streaming decision:

Multiple independent consumers — Spark Streaming, a DQ monitoring service, and an alerting layer all needed to read the same events independently
Schema evolution — GTFS-RT feeds aren’t perfectly stable; fields get added, types shift
Consumer replay — when a Spark job fails mid-batch, I needed to reprocess from a known offset, not lose events
Local development parity — I needed to run the exact same stack on my laptop that runs in the cloud

Where Kinesis Falls Short

Kinesis is a great product if you’re already deep in the AWS ecosystem and need a managed, low-ops streaming bus. But three specific limitations ruled it out for this project.

1. Shard Semantics vs. Partition Semantics

Kinesis uses shards with a fixed capacity model: each shard handles 1 MB/s ingest and 2 MB/s reads. When you need more throughput, you split shards — but you can’t unsplit them cleanly, and the resharding process interrupts consumers.

Kafka partitions are far more flexible. You set the partition count upfront, and Kafka handles rebalancing across consumer group members automatically. For a transit pipeline where event volume varies significantly between rush hour and overnight, Kafka’s partition model gave me cleaner elasticity without manual intervention.

2. Consumer Group Replay

This was the dealbreaker.

Kinesis retains data for 7 days maximum (with Enhanced Fan-Out). More critically, Kinesis doesn’t support consumer group offset management the way Kafka does. Each shard iterator is stateless on the Kinesis side — you manage offsets yourself, or you use something like DynamoDB to track them.

Kafka’s consumer group protocol tracks committed offsets per group, per partition. When my Spark job crashed during a watermark calculation (it happened — twice), I could reset the consumer group offset to exactly the checkpoint before the failure and replay without writing a single line of custom offset management code.

# Reset a consumer group offset to a specific timestamp
kafka-consumer-groups.sh \
  --bootstrap-server  \
  --group spark-transit-consumer \
  --topic gtfs-rt-events \
  --reset-offsets \
  --to-datetime 2025-09-01T14:00:00.000 \
  --execute

Try doing that cleanly in Kinesis. You can’t — not without custom DynamoDB logic.

3. No Native Schema Registry

Kinesis has no native schema registry. You’d need AWS Glue Schema Registry, which requires additional IAM plumbing and only supports Avro and JSON Schema (no Protobuf in all regions at the time of this project).

Confluent’s Schema Registry is tightly integrated with the Kafka ecosystem, supports Avro/JSON/Protobuf, and enforces compatibility modes (BACKWARD, FORWARD, FULL) so that a producer publishing a new schema version can’t silently break a downstream consumer.

Why Confluent Kafka Won

Schema Registry Was Non-Negotiable

GTFS-RT data is typed but not rigid. Over the course of the project, I encountered three cases where the feed added optional fields or changed enum values. With Schema Registry enforcing BACKWARD compatibility, I could evolve the Avro schema without touching my Spark deserialization code.

{
  "type": "record",
  "name": "TransitEvent",
  "fields": [
    {"name": "trip_id", "type": "string"},
    {"name": "route_id", "type": "string"},
    {"name": "timestamp", "type": "long"},
    {"name": "delay_seconds", "type": ["null", "int"], "default": null},
    {"name": "vehicle_id", "type": ["null", "string"], "default": null}
  ]
}

New nullable fields with defaults are backward compatible — old consumers can read new messages without any changes. This saved me from a class of bugs that would have been brutal to debug at 2 AM.

Docker Parity

Running Confluent locally is one docker-compose.yml away:

services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.5.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181

  broker:
    image: confluentinc/cp-kafka:7.5.0
    depends_on: [zookeeper]
    ports:
      - "9092:9092"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092

  schema-registry:
    image: confluentinc/cp-schema-registry:7.5.0
    depends_on: [broker]
    ports:
      - "8081:8081"

Local dev was identical to Confluent Cloud. No “works on my machine” surprises when deploying.

Consumer Groups at Scale

The pipeline ultimately had three independent consumer groups:

Consumer Group Purpose spark-transit-consumer Main Spark Structured Streaming job dq-monitor-consumer Real-time data quality checks alert-consumer Delay threshold alerting

Each reads the same topic at its own pace, maintains its own offset, and can replay independently. Kinesis’s enhanced fan-out gets close to this, but the per-shard $0.015/hour cost would have added up across 3 consumers reading a high-volume topic.

The Numbers

The pipeline processed over 2 million simulated transit events per day with sub-60 second end-to-end latency. During load testing, consumer lag stayed under 500ms at peak throughput.

Kinesis could have handled the throughput. But it couldn’t have given me schema evolution safety, clean consumer replay, or Docker-local parity at the same operational cost.

When I’d Choose Kinesis Instead

To be fair: Kinesis makes sense when:

You’re fully AWS-native and want zero additional managed services
Your consumers are Lambda functions triggered by Kinesis (the integration is tight and cheap)
You don’t need consumer group semantics and are okay managing offsets in DynamoDB
Schema evolution isn’t a concern or you’re using Glue Schema Registry already

For greenfield projects where the team is already deep in AWS and streaming requirements are simple, Kinesis is a totally reasonable choice. It’s not the wrong tool — it just wasn’t the right tool for this pipeline.

Takeaway

The decision wasn’t about Kafka being “better” in some abstract sense. It was about three concrete requirements:

Partition-level consumer groups with managed offsets → Kafka wins
Schema evolution with backward compatibility enforcement → Confluent Schema Registry wins
Local dev/prod parity without cloud costs → Docker Compose wins

If you’re building a multi-consumer streaming pipeline where schema stability matters and you need reliable replay, reach for Kafka. You’ll thank yourself when the first schema change rolls in at week three.