Striim - Medium

Beyond Materialized Views: Using DuckDB for In-Process Columnar Caching

John Kutay — Wed, 02 Apr 2025 17:35:33 GMT

Written by John Kutay (Director of Product & Engineering at Striim) and Kim Dang (Senior Software Engineer at Striim)

In this post we will talk about using DuckDB as the operational analytics store for the control plane of Striim Developer — a serverless Stream Processing and Change Data Capture service. By moving analytical queries from PostgreSQL to an in-process cache in DuckDB, we measured 5–10x performance improvement with zero additional infrastructure or cost.

In this post, we explore how DuckDB was integrated as the analytical backend for the control plane of Striim Developer — a serverless Stream Processing and Change Data Capture (CDC) service. The Striim Developer Control Plane handles critical tasks such as managing tenant assignments, monitoring free-tier usage thresholds, orchestrating asynchronous jobs, and triggering operational alerts. Initially, this control plane relied on direct queries to PostgreSQL, which serves as the metadata repository. However, this design presented challenges in maintaining performance and meeting SLAs without inflating the cost of offering a free-tier service.

For example, when usage exceeds capacity for an individual tenant, we need to alert and orchestrate a response immediately. Yet, our previous implementation — based on periodic analytical jobs — was constrained to 15–20-minute intervals, requiring PostgreSQL query optimization and materialized views to perform better. This latency posed a significant bottleneck for operational analytics in a serverless platform that necessitates near real-time responses.

After evaluating alternatives, we adopted DuckDB as an in-process, columnar caching layer. This decision was driven by DuckDB’s balance of performance, simplicity, and minimal overhead, addressing our need for high-frequency analytics while offloading PostgreSQL. In this post, we detail the challenges, implementation, and benefits of this architecture and discuss whether this approach qualifies as Hybrid Transactional Analytical Processing (spoiler: it’s not, and I’ll explain why).

The Challenge: Optimizing Operational Analytical Workloads

The Striim Developer control plane frequently queries PostgreSQL to perform:

Aggregated usage calculations
Threshold monitoring for user and tenant consumption.
Operational alerting
Asynchronous calls to cloud services that trigger workflows (in-product messages, cloud resource management)

These queries are analytical and involve scanning and joining usage data normalized across many tables, leading to slower execution times and increased load on our PostgreSQL instance. However our ability to frequently run this operation was bottlenecked by disk-bound operations to query PostgreSQL. We require these analytical jobs to run as frequently as possible to execute to respond to scenarios like resource over-utilization or exceeding usage thresholds.

Database Application Behind Striim Developer

A Word About Buffer Caches

Before we went down this rabbit hole, we of course considered the stark reality that database buffer caches exist. And our AWS-managed database has some higher level interfaces to manage the buffer cache that adds some convenience for DBAs.

PostgreSQL’s buffer cache can improve query performance by storing frequently accessed data pages in memory, minimizing disk I/O. While tuning shared_buffers and leveraging tools like pg_prewarm could help keep critical data in cache, this approach is inherently reactive. The cache is populated dynamically based on usage patterns, which means frequent refreshes or large scans may still involve costly disk reads. Additionally, tuning buffer sizes in a multi-tenant environment adds complexity and can strain memory resources shared with other PostgreSQL processes.

It’s also important to note that the Striim application is the software most frequently accessing the PostgreSQL metadata repository with frequent, write-heavy operations that naturally dominate the PostgreSQL buffer cache. It handles critical transactional operations like updating metering metrics and managing the Striim component DDL as streaming applications are implemented by users. This competition for cache space means analytical queries, such as aggregations on usage data, are less likely to benefit from cached pages, resulting in increased disk I/O and slower performance. Given this dynamic, relying solely on PostgreSQL’s buffer cache to optimize both OLTP and OLAP workloads would have been inefficient, reinforcing our decision to offload analytical queries to DuckDB. By decoupling these workloads, we ensured that transactional operations in PostgreSQL remain fast while analytical workloads benefit from DuckDB’s optimized in-memory columnar engine.

Materialized views provide precomputed query results and might seem like a fit for aggregating our usage data. However, refreshing MVs frequently (e.g., every minute for usage data) imposes significant write overhead on the database. Even with REFRESH CONCURRENTLY, the performance cost grows as data scales. Managing MVs for our multi-tenant architecture, with differing refresh cadences for namespaces, applications, and usage data, would have added operational burden.

Why Not Materialized Views?

Materialized views in PostgreSQL were an option we considered for caching precomputed results. However we had to address the following variables:

Analytical Queries: Our queries involve dynamic aggregations, joins and filters on frequently changing usage data, which are cumbersome to maintain with materialized views. Our team prefers working with imperative languages for cache maintenance logic, thus maintaining the views in PostgreSQL was not ideal.
Infrastructure Overhead: Refreshing materialized views for frequently updated data (e.g., usage metrics) every minute would increase PostgreSQL’s load rather than alleviate it. PostgreSQL read replicas would add more complexity and infrastructure overhead without obvious performance benefits.
Limited Flexibility: PostgreSQL materialized views speed up complex queries but require manual refreshes, lacking incremental updates for frequent changes.

The ideal solution for us would require no additional infrastructure (replicas, database cores, storage) at the current size of the workload while allowing us to tightly control the cache maintenance logic in Python — the language we use in our Control Plane.

Our Solution: DuckDB as a OLAP Cache with PostgreSQL Extension

DuckDB, a lightweight, in-process SQL analytics engine, provided the perfect middle ground between raw PostgreSQL queries and materialized views. Here’s why:

High Performance for Analytical Workloads: DuckDB is automatically optimized for analytical and columnar processing, offering built-in performance gains for our use case.
Simple Integration with Python: Its seamless embedded Python API made it easy to implement into our existing control logic without additional infrastructure.
Dynamic Caching: We cache infrequently changing data (e.g., users, tenants) every 24 hours and refresh frequently changing data (e.g., usage metrics) every minute. DuckDB makes this easy to manage with minimal code.
PostgreSQL Extension: DuckDB’s native extension lets us execute PostgreSQL queries to fetch data within the DuckDB runtime

DuckDB’s in-process instantiation made this insanely easy to implement in the control plane application…

Instantiating DuckDB In-Memory

Implementation Overview

Data Refresh Pipeline:

Daily Refresh: Less frequently updated metadata loaded into DuckDB every 24 hours.
Minute-Level Refresh: Frequently queried usage metrics are updated every minute.

Query Execution:

Operational tasks and alerts query the DuckDB cache for aggregated results.
Async cloud operations work with fresh, fast data.
PostgreSQL remains the source of truth, used only for updates and less frequent transactional queries.

The architecture ensures that DuckDB serves as a high-speed cache for read-heavy operations, while PostgreSQL is offloaded from expensive analytical workloads.

Performance Benchmarks

Our performance benchmarks highlight the substantial improvements gained by integrating DuckDB into our architecture. A Striim Controller instance operates on 4 vCPUs and 7 GB of RAM, emphasizing efficiency in constrained environments. Below are the results from the latest test runs:

Results with DuckDB Caching:

Throughput of jobs: Improved from 3.95 to 11.71 tasks/sec (cached + PostgreSQL replication) and up to 11.71 tasks/sec (full analytical query caching).
Average Latency: Consistent at 0.19–0.2 seconds per task.
Memory Usage: Stabilized at ~141 MB (cached namespace scenario) and ~120 MB (full analytical query caching).
Execution Time: Reduced to as low as .8 seconds per job.

To ensure the accuracy and consistency of our benchmarks, we conducted multiple runs for each configuration — No Caching, Caching data with SQLAlchemy into native python data structures, and Caching with DuckDB — with explicit cache warming captured by mocking multiple runs from cold to hot cache scenarios. By allowing the cache to warm up, we ensured that frequently accessed data was preloaded, highlighting the true impact of caching mechanisms on throughput, execution time, and memory stabilization under high-concurrency workloads. This approach provided a clear view of how caching progressively improves performance in real-world scenarios and DuckDB provides built-in high performance analytical queries.

Key Takeaways:

5–10x Execution Time Reduction: Analytical workloads experienced dramatic speed-ups without impacting transactional performance.
High-Throughput Performance: Throughput improved significantly for tasks, maintaining low latency.
Zero Additional Infrastructure Costs: DuckDB runs in-process, avoiding the need for more servers or storage.
DuckDB-based caching with PostgreSQL extension was faster than native python solution

This setup is designed for the price-performance balance that makes our Striim Developer offering scalable while remaining cost-effective. While scaling CPU and RAM might have solved the problem, this approach aligns with our mission to deliver generous free-tier usage without compromising on performance.

Memory management

While performance improved by 5–10x, we need to account for the increased memory usage and how that may scale over time.

Memory Management & Cache Eviction Strategy

We will monitor and log our cache hit rate over time while managing a simple heuristic based memory management process. Rather than using fixed memory limits, we designed a percentage-based approach where the cache can use up to 75% of system RAM total, with individual caches limited to 25%. This provides flexibility across different deployment environments while preventing memory exhaustion. We designed an LRU (Least Recently Used) eviction strategy that automatically removes older entries when memory limits are approached, along with a fallback mechanism to persist data to disk if needed. The cache monitors system memory usage and can trigger eviction or disk fallback (DuckDB PG extension queries) automatically.

Why DuckDB Cache is Superior

Compared to materialized views or other caching alternatives (e.g., Redis, Native Python Data Structures), DuckDB offers:

Performance Gains Without Complexity:

No additional servers, clusters, or infrastructure were required.
No need to change or provision additional components in PostgreSQL.
All cache maintenance logic implemented in python

Flexibility and Simplicity:

Easy to define and refresh datasets based on different refresh cadences (e.g., 24 hours for static data, 1 minute for dynamic data).
Dynamic and on-the-fly SQL queries supported without materialized view limitations.
SQL interface for aggregations and analytical queries

Cost-Effective Scaling:

DuckDB runs within the existing Python application with optimized query execution and memory management, avoiding the need for external caching systems or compute resources at the size of the current workload.

So… is this HTAP?

While this approach leverages DuckDB’s fast analytical performance and offers many benefits that you would get from an HTAP system, it doesn’t qualify as full HTAP in my mind. HTAP systems unify OLTP and OLAP workloads within a single platform, providing real-time analytics on live transactional data, an example being Oracle’s embedded column store. In our case, PostgreSQL handles transactions, while DuckDB serves as a pluggable, in-memory column store for read-heavy analytical queries. DuckDB PostgreSQL Extension and our controller logic glues OLTP and OLAP together by copying the data. This separation allows us to periodically refresh cached data without adding complexity or infrastructure.

Source: Oracle

Hybrid transactional/analytical processing (HTAP) architectures integrate OLTP (transactional) and OLAP (analytical) workloads in a single system, eliminating the need for ETL pipelines, reducing data duplication, and offering real-time consistency. Oracle’s In-Memory Column Store is a sophisticated example of HTAP, maintaining dual-format storage — data resides simultaneously in a row format (optimized for OLTP) in the buffer cache and in a columnar format (optimized for OLAP) in memory. Oracle ensures transactional consistency by synchronizing updates between the row and column formats using metadata and transaction journals. This allows analytic and transactional queries to run on the same dataset seamlessly, without duplication or delays.

Our DuckDB PostgreSQL extension approach, while not HTAP, delivers basic hybrid OLAP functionality by enabling analytical queries to run in a high-performance columnar format, separate from the OLTP workload in PostgreSQL. Unlike Oracle’s dual-format architecture, this solution doesn’t maintain real-time consistency or simultaneous access to row and columnar data. However, it provides us with a lightweight, pluggable option for analytical workloads with minimal setup. Given these are analytical operational queries, we can live without transactional consistency. DuckDB’s columnar processing and in-memory execution make it ideal for scenarios where transactional workloads can tolerate slightly delayed analytics without requiring a full HTAP system.

DuckDB works great as a modular, lightweight columnar engine that complements any database, delivering OLAP-like performance without needing a full HTAP database.

Scaling Considerations

While our DuckDB caching layer delivers excellent performance for individual clusters, you might wonder how we’ll scale as Striim Developer adoption grows. The answer lies in horizontal partitioning — we’ll distribute users across multiple independent Striim clusters, each with its own PostgreSQL database and controller instance running our DuckDB cache.

We leverage Striim Deployment Groups (DG) to create separation between logical and physical resources. A Deployment Group represents some unit of compute that a Striim pipeline can be deployed to on one or multiple nodes. That unit can be a single server, a shared pool, or a dedicated multi-node cluster of EC2 instances in AWS. Each user has an assigned DG that maps to some physical resource. The physical resources can be scaled up vertically or horizontally, but the DG’s association to these resources remains the same. When a Striim app is deployed to a DG, it will take whatever compute is associated with it.

A global routing layer using consistent hashing will direct users to their assigned cluster based on their tenant ID, with the ability to dynamically add new clusters as needed. This approach is particularly clean because Striim Developer users operate independently, meaning there’s no cross-user data sharing or multi-tenant querying that would complicate the architecture. The beauty of this design is that our DuckDB caching layer requires no modifications — it continues to work as designed within each cluster’s scope, while we achieve horizontal scale through simple user partitioning.

Conclusion

By adopting DuckDB as a caching layer for our PostgreSQL database, we achieved a substantial improvement in performance while simplifying our architecture. This approach highlights the power of embedding lightweight, high-performance solutions like DuckDB into existing systems. For anyone dealing with analytical workloads on operational data, DuckDB offers an elegant, cost-effective alternative to materialized views or external caching systems.

The best part of this? It was easy to implement thanks to DuckDB’s simplicity. Zero infrastructure changes. Zero additional spend. Just some well-written Python and DuckDB PostgreSQL Extension.

These types of performance improvements allow us to provide Striim Developer as a free, serverless experience for data engineers to explore Striim’s data movement and stream processing capabilities.

Beyond Materialized Views: Using DuckDB for In-Process Columnar Caching was originally published in Striim on Medium, where people are continuing the conversation by highlighting and responding to this story.

Microsoft Fabric Open Mirroring — Mirror once query anywhere — Striim

karthikeyan G — Sun, 30 Mar 2025 23:58:19 GMT

Open Mirroring with Microsoft Fabric: Mirror Once, Query Anywhere with Striim

Exploring Microsoft Fabric Open Mirroring: Striim’s SQL2Fabric-Mirroring Integration, Use Cases, and Data Access and Interoperability across DuckDB, Databricks, Snowflake, Spark Engines, and More

∘ Audience
∘ How Microsoft Fabric Open Mirroring Works
∘ What is the Cost Structure?
∘ What Are the Pain Points?
∘ Open mirroring — powered by Striim
∘ Use Cases after mirroring
∘ Access data within fabric ecosystem
∘ Access from Snowflake
∘ Access from Databricks
∘ Access from Spark Engines
∘ Access from DuckDB
∘ Access from Google BigQuery
∘ Conclusion :

Audience

This blog is aimed at individuals with interest in the Microsoft Fabric ecosystem and general Data Engineering experience.

How Microsoft Fabric Open Mirroring Works

Microsoft Fabric Open Mirroring is a powerful feature that allows organizations to seamlessly replicate change data from multiple sources into OneLake. This functionality enables client applications to write both change data and initial load data from various source systems — such as databases, data warehouses, or files — directly into a designated Landing Zone in Parquet file format. Once the data is stored in Parquet, the Fabric Replicator Service processes the files and converts them into DeltaLake format within OneLake. These Delta tables are then readily accessible by applications such as Power BI, SQL Analytics, and other query engines that work with OneLake.

Designed with interoperability and scalability in mind, Open Mirroring supports data replication from a wide range of source types, including:

OLTP sources (e.g., Oracle, PostgreSQL)
OLAP sources (e.g., Snowflake, BigQuery, Databricks)
NoSQL sources (e.g., MongoDB, CosmosDB, Couchbase)
Files and other unstructured data sources

Additionally, Open Mirroring includes features to handle schema evolution, which is essential for maintaining data integrity and consistency. However, managing certain types of schema changes can be complex.

For detailed specifications on handling Change Data and Schema Evolution, please refer to the documentation [here].

What is the Cost Structure?

Microsoft Fabric offers free storage for mirrored data in OneLake based on the capacity units in use. For example, with an F2capacity, users receive 2 TB of free storage in OneLake. If data usage exceeds this limit, additional charges will apply for the excess storage.

The compute resources involved in replicating data from the Landing Zone to Delta tables are provided by Fabric, and the compute is separate from any user-purchased resources. As a result, Mirroring is generally a low-cost approach to replication.

What Are the Pain Points?

Organizations often operate multiple storage systems for different use cases. For instance, they may have one system for live applications (often an OLTP or NoSQL system) and another for storing historical data used in analytics and reporting (such as data warehouses, data lakes, or lakehouse architectures). Streaming Change Data Capture (CDC) and initial load data from these diverse storage systems can be challenging.

Furthermore, managing schema evolution and propagating schema changes to the Landing Zone can be particularly complex for certain types of schema changes.

Another key challenge is maintaining data type integrity between source and target systems. The source data types must be compatible with Avro and Parquet format’s data types, as the Fabric Replicator Service uses the schema of Parquet files to create Delta Tables.

Finally, handling data recovery and deduplication in the event of a pipeline failure is a critical concern and can be complex to address.

To resolve the above challenges and make the users sit back and relax while the pipeline runs , we need a reliable and robust system that takes care of the above challenges.

Open mirroring — powered by Striim

Striim’s SQL2Fabric — Mirroring solution offers automated data pipelines that enable the seamless streaming of real-time changes from various data sources to OneLake’s Landing Zone,. The platform efficiently handles:

Initial load replication
Change Data Capture
Schema evolution and changes
Data type integrity
Pipeline failure recovery

Through its partnership with Microsoft (check here), Striim has introduced SQL2Fabric X— Mirroring, a fully managed SaaS solution that provides automated data pipelines to mirror both initial load and Change Data Capture (CDC) streaming from SQL Server(whether on-premises, RDS, or hosted on MS Azure and GCP instances) to Fabric OneLake and Azure Databricks

The pipeline is engineered to handle large data loads, manage schema evolution, and ensure robust recovery in the event of pipeline failures.

SQL2Fabric X can also be launched directly from the Fabric UI.

Striim Automated Pipelines in Microsoft Fabric

Now let us setup a pipeline in minutes

https://medium.com/media/1863ca0ba16bee7b3778f04ab6637324/href

Azure Marketplace : https://azuremarketplace.microsoft.com/en-us/marketplace/apps/striim.sql2fabric-mirroring?tab=Overview

Microsoft — Striim document : https://learn.microsoft.com/en-us/fabric/database/mirrored-database/open-mirroring-partners-ecosystem#striim

Use Cases after mirroring

One of the most exciting benefits of OneLake is its ability to mitigate vendor lock-in, especially in the context of data access and integration. The Delta tables stored within the Lakehouse container are now accessible across multiple query engines that support the Delta Lake format. This cross-vendor accessibility not only enables greater flexibility but also empowers organizations to leverage the strengths of different platforms without the need to duplicate data.

By eliminating the need to copy data between systems, organizations can significantly reduce both storage and reverse ETL costs. This seamless interoperability facilitates more efficient data workflows, enhances scalability, and allows businesses to implement vendor-specific features without being constrained by the limitations of any single provider.

Below are few POCs that demonstrate the accessibility of mirrored Delta Tables across popular querying engines:

Access data within fabric ecosystem

Data mirrored in OneLake is seamlessly accessible by applications and query engines within the Fabric ecosystem. This enables users to leverage a wide range of powerful tools for Business Intelligence, Reporting, Data Science, Machine Learning, and Monitoring to gain insights and drive decision-making using the data.

Access from Snowflake

Snowflake has introduced a public preview feature that allows users to create Iceberg tables from Delta Lake files stored in external storage. This enables users to access Delta tables in OneLake as Iceberg tables in Snowflake, though these tables are read-only.

To integrate Delta Lake tables with Snowflake, follow these steps:

Step 1: Create a shortcut in Fabric LakeHouse.

Navigate to Fabric LakeHouse
Go to the LakeHouse page in the Fabric UI, right-click the Files section, and select New Shortcut.
Choose Onelake as the Source
Under the Internal Sources category, click Onelake. This will display a list of available data entities in your Fabric ecosystem.
Select the Mirrored Delta Tables
Pick the mirrored database created by Striim and click Next.
Choose the Tables for Snowflake Access
In the Tables section, select the specific Delta tables you wish to access in Snowflake and click Next.
Create the Shortcut
Review the selections and click Create to establish the shortcut.

This will create a shortcut in the Files section of your Fabric LakeHouse, allowing Snowflake to access the Delta tables. It’s important to note that this process does not copy the data; it simply references the original Delta files.

Step 2 : Configure access requirements for Snowflake

Create an external volume pointing to the onelake container.

CREATE OR REPLACE EXTERNAL VOLUME FabricExVol4
STORAGE_LOCATIONS =
(
  (
    NAME = 'Name your volume'
    STORAGE_PROVIDER = 'AZURE'
    STORAGE_BASE_URL = 'azure://onelake.dfs.fabric.microsoft.com/.Lakehouse/Files/'
    AZURE_TENANT_ID = ''
  )
);

Follow this document from Step 2 — till point number 3 . (until providing the consent)
Navigate to your workspace in fabric portal where you have created the lakehouse, click manage access and add the AZURE_MULTI_TENANT_APP_NAME as contributor to the workspace.
Now you have configured the access requirements for Snowflake.

Step 3 : create a catalog integration

CREATE OR REPLACE CATALOG INTEGRATION 
  CATALOG_SOURCE = OBJECT_STORE
  TABLE_FORMAT = DELTA
  ENABLED=true;

Step 4 : Create a Table

BaseLocationPath is the relative path for Files folder in lakehouse. In my case the complete path of the shortcut is “https://onelake.dfs.fabric.microsoft.com/eastusspace/DLH.Lakehouse/Files/dbo_ORDER_LIST” , So the BaseLocationPath is “dbo_ORDER_LIST”

CREATE OR REPLACE ICEBERG TABLE my_delta_iceberg_table
CATALOG = 
EXTERNAL_VOLUME = 
BASE_LOCATION = ;

Step 5 : Query and refresh the table.

The table has to be refreshed for getting the latest snapshot as the data resides outside snowflake.
Or schedule a task to refresh tables automatically.

select count(*) from my_delta_iceberg_table;

alter iceberg table  my_delta_iceberg_table refresh

CREATE OR REPLACE TASK refresh_delta_tables
  USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE = 'XSMALL'
  SCHEDULE = 'USING CRON  0 * * * * America/Los_Angeles'
  AS
    BEGIN
        alter iceberg table  my_delta_iceberg_table refresh;
    END;

ALTER TASK refresh_delta_tables resume;

Access from Databricks

Databricks enables you to create external tables that refer directly to OneLake paths, allowing seamless access to your data. This process is straightforward and can be achieved with the support of Spark compute in Databricks.

Step 1: Create a Shortcut in Fabric LakeHouse

Create a LakeHouse or use an existing one to set up a shortcut in the Files section pointing to the mirrored Delta tables.
In the Fabric UI, navigate to the LakeHouse page, right-click the Files section, and select New Shortcut.
Under the Internal Sources category, choose Onelake.
You’ll now see a list of data entities within your Fabric ecosystem. Select the mirrored database created by Striim and click Next.
In the Tables section, choose the specific Delta tables you wish to access from Databricks and click Next.
Review your selections and click Create. This will create a shortcut in the Files section of your Fabric LakeHouse.
Note: This action does not copy any data; it simply creates a reference to the original Delta files.
Right-click the shortcut folder, click Properties, and note the ABFSS path for future use.

Step 2: Set Up Authentication

Register an Azure Entra ID app in the Azure portal. Provide a name for the app; no additional configurations are required.
Create a client secret for the app and save the value.
In the Fabric portal, go to your workspace, click Manage Access, and add the Entra app (created earlier) as a Contributor to the workspace.

Step 3 : Open notebook and create a Spark Session with the below configs.

Replace with Client ID (ApplicationID) of the Entra App. Replace with the tenant ID of your Fabric account . Replace with the secret value noted previously.

 
from pyspark.sql import SparkSession

spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", "")
spark.conf.set("fs.azure.account.oauth2.client.secret", "")
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com//oauth2/token")

Step 4 : Create an external table in databricks.

Replace with the path copied earlier

spark.sql(f"""
    CREATE TABLE default.my_delta_table
    USING DELTA
    LOCATION ''
""")

refresh table default.my_delta_table;

SELECT * FROM default.my_delta_table;

Table has to be refreshed for getting the latest snapshot.

Access from Spark Engines

Spark engines from Local machine, AWS EMR, Azure HDInsight and Google DataProc can be used to effectively query and compute the mirrored tables inside OneLake.

Step 1: Create a Shortcut in Fabric LakeHouse

Create a LakeHouse or use an existing one to set up a shortcut in the Files section pointing to the mirrored Delta tables.
In the Fabric UI, navigate to the LakeHouse page, right-click the Files section, and select New Shortcut.
Under the Internal Sources category, choose Onelake.
You’ll now see a list of data entities within your Fabric ecosystem. Select the mirrored database created by Striim and click Next.
In the Tables section, choose the specific Delta tables you wish to access from Databricks and click Next.
Review your selections and click Create. This will create a shortcut in the Files section of your Fabric LakeHouse.
Note: This action does not copy any data; it simply creates a reference to the original Delta files.
Right-click the shortcut folder, click Properties, and note the ABFSS path for future use.

Step 2: Set Up Authentication

Register an Azure Entra ID app in the Azure portal. Provide a name for the app; no additional configurations are required.
Create a client secret for the app and save the value.
In the Fabric portal, go to your workspace, click Manage Access, and add the Entra app (created earlier) as a Contributor to the workspace.

Step 3 : Prepare the code

Replace with Client ID (ApplicationID) of the Entra App. Replace with the tenant ID of your Fabric account . Replace with the secret value noted previously.

Maven dependencies :

   
      io.delta
      delta-core_2.12
      2.4.0
    
  
    
      org.apache.hadoop
      hadoop-azure
      3.4.1

public class Onelake {
    public static void main(String[] args) {
        SparkSession session = getSession();
        session.read().format("delta").load("ABFSS Path").show();
    }

    public static SparkSession getSession() {
        Map sparkConfigMap = new HashMap<>();
        sparkConfigMap.put("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension");
        sparkConfigMap.put("spark.packages","io.delta:delta-core_2.12:1.0.0");
        sparkConfigMap.put("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog");
        sparkConfigMap.put("fs.azure.account.auth.type", "OAuth");
        sparkConfigMap.put("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider");
        sparkConfigMap.put("fs.azure.account.oauth2.client.id", "");
        sparkConfigMap.put("fs.azure.account.oauth2.client.secret", "        sparkConfigMap.put("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/
        SparkSession session =  SparkSession.builder()
                .master("local[*]")
                .config(sparkConfigMap)
                .getOrCreate();

        return session;
    }

}

This piece of code can be run on local machine , or can be submitted to spark engines in any of the CSPs using the CSP’s native job submission APIs or using Apache Livy (Rest Gateway for Spark engine in CSPs).

Analytics with DuckDB

Mimoune Djouallah’s exploration of DuckDB and OneLake in running the TPC-DS benchmark demonstrates the robustness and adaptability of DuckDB combined with the openness of OneLake. The integration with Python notebooks highlights its efficiency for analytical tasks. Using OneLake for scalable Delta table storage underscores its strong performance, comparable to local SSDs in specific setups. This experiment effectively showcases the openness, versatility, and potential of both tools for lightweight and cost-effective analytical workflows.

How to run DuckDB with Fabric LakeHouse.

Step 1: Create a Shortcut in Fabric LakeHouse

Create a LakeHouse or use an existing one to set up a shortcut in the Files section pointing to the mirrored Delta tables.
In the Fabric UI, navigate to the LakeHouse page, right-click the Files section, and select New Shortcut.
Under the Internal Sources category, choose Onelake.
You’ll now see a list of data entities within your Fabric ecosystem. Select the mirrored database created by Striim and click Next.
In the Tables section, choose the specific Delta tables you wish to access from Databricks and click Next.
Review your selections and click Create. This will create a shortcut in the Files section of your Fabric LakeHouse.
Note: This action does not copy any data; it simply creates a reference to the original Delta files.
Right-click the shortcut folder, click Properties, and note the ABFSS path for future use.

Step 2 : Setup Authentication in OneLake

Register an Azure Entra ID app in the Azure portal. Provide a name for the app; no additional configurations are required.
Create a client secret for the app and save the value.
In the Fabric portal, go to your workspace, click Manage Access, and add the Entra app (created earlier) as a Contributor to the workspace.

Step 3 : Setup Authentication in DuckDB

Before setting up authentication , we need to import delta and azure extensions

install azure;
load azure;

install delta;
install delta;

Create secret

CREATE SECRET azure_spn (
      TYPE AZURE,
      PROVIDER SERVICE_PRINCIPAL,
      TENANT_ID '',
      CLIENT_ID '',
      CLIENT_SECRET ''
  );

Start querying

SELECT * FROM 
delta_scan('')

DuckDB is an extremely fast, user-friendly, and feature-packed system that excels at processing complex analytical queries on massive datasets. It’s widely appreciated by users and definitely worth trying out

Access from Google BigQuery

This is not possible for now, Google BigQuery doesn’t accept Onelake endpoints for creating Biglake external tables referencing DeltaLake on Onelake.

Conclusion :

Open Mirroring has significantly empowered organizations to efficiently stream data from a variety of sources to Fabric OneLake, addressing critical aspects such as Change Data Capture (CDC) and Schema Evolution. By mirroring data and leveraging DeltaLake format, it ensures a high level of cross-vendor compatibility, thus enabling seamless access and improved interoperability across different systems.

This approach, however, introduces engineering challenges related to building a reliable data pipeline .

A leader in this space, Striim provides a robust solution that addresses these challenges. By focusing on critical aspects of data integration, Striim ensures that organizations can rely on a stable, scalable pipeline that facilitates smooth, continuous data replication, synchronization, and transformation across a diverse range of storage systems and platforms.

In conclusion, open mirroring provides organizations with the scalability, flexibility, and reliability needed to manage complex data ecosystems while optimizing costs and accelerating time-to-insight.

Microsoft Fabric Open Mirroring — Mirror once query anywhere — Striim was originally published in Striim on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Change Data Capture Works: Understanding the Impact on Databases

John Kutay — Sat, 11 May 2024 06:52:28 GMT

Change Data Capture (CDC) is a process that extracts database transactional data in real-time. Data Engineers and Analytics Engineers adopt CDC platforms like Striim, Oracle GoldenGate, Debezium, IBM Infosphere or Qlik Replicate to replicate operational data from Databases to event busses — like Kafka, PubSub and EventHubs — or Data Warehouses such as BigQuery, Databricks, Snowflake, Fabric and others. Change Data Capture enables businesses to process real-time data without impacting production operational systems, which is crucial for event-driven architectures and analytics. This post examines the impact of CDC on Operational Databases from minimal to significant levels, analyzing techniques such as redo log mining to the maintenance of shadow tables.

Based on some in-person discussions I had in the data community, I observed there’s a perception that Change Data Capture is viewed exclusively as a requirement for real-time analytics.

While CDC certainly enables real-time analytics if downstream consumers are not a bottleneck, the main value proposition of Change Data Capture is providing the most performant, least impactful way of replicating data from an operational database, regardless of the downstream latency requirements for analytics.

In a practical sense for data engineers, the beginning of a database replication project should not alarm DBAs and software engineers with potentially disruptive requests like polling the production database or adding triggers. Instead, data engineers need to present themselves as knowledgeable, considerate of potential concerns, and proactive in their collaboration to ensure efficient replication.

Why we care about database performance

Databases are critical for both business operations and customer-facing applications, where performance directly influences user experience, operational efficiency, and ultimately, business outcomes. High-performance databases ensure that applications run smoothly, data retrieval is swift, and transactions are processed quickly, which is vital for maintaining customer satisfaction and competitive advantage. For instance, in e-commerce, faster database responses can lead to quicker page loads and transaction processing, directly affecting sales conversions and customer retention. This is why we don’t want internal analytics workloads competing for database resources.

Implementing Change Data Capture using direct queries or maintaining shadow tables, can significantly influence the management of the database’s cache, particularly the Least Recently Used (LRU) cache objects. The database query cache is designed to keep frequently accessed data in memory, optimizing the performance for repetitive queries. However, the operations involved in CDC like frequent querying or updating shadow tables can disrupt this optimization.

These CDC methods might populate the cache with data that is less frequently accessed by operational applications, potentially evicting more critical data from the cache and degrading overall application performance. This scenario underscores why minimal impact CDC methods, such as log mining, are often preferable from a performance optimization perspective. These methods are less likely to interfere with the normal operation of the database’s query cache, maintaining a better balance between data capture and operational efficiency.

Long story short, we want the database to be tuned for application purposes, not for the convenience of internal analytics workloads.

Varying Degrees of Impact on Databases

I ran a poll on both LinkedIn and Twitter to gauge the data community’s assumptions about the impact of Log-Based Change Data Capture, and the results were polarizing to say the least!

To answer my respective polls on Twitter and Linkedin, the correct answer is YES. There’s always memory overhead of performing Change Data Capture on an operational Database. However, impact can vary from minimal — such as the memory for a thread to tail a disk on-file to extreme — such as periodically running queries on your database. That can be the difference between some minor memory tuning to your database to support Change Data Capture versus doubling the size and cost of your database to support batch queries.

To help visualize this, I created a matrix of performance impact from various methods of Change Data Capture.

Here we show how Striim performs Change Data Capture from Oracle. You can see some of the memory usage on the database is ‘house money’ so to speak: Oracle’s Program Global Area will actually maintain a redo log buffer regardless of external Change Data Capture consumers for downstream analytics. Striim will simply subscribe to the changes published in the online redo log, and buffer them in-memory and on-disk. We’ve added extensive work to support long running transactions off-heap in Striim’s system layer, which also offloads processing from the operational database.

Minimal Impact: Redo or Write Ahead Log Mining

At the minimal end of the spectrum is log mining. Database systems like Oracle use the redo log files for log mining, which is a relatively low-impact method of implementing CDC. The Oracle database maintains a redo log to record all changes made to the database. This log is essential for data recovery and is also used in log mining to track changes.

The actual memory impact here is primarily on the Shared Global Area (SGA) in Oracle. The SGA is a memory region that contains data and control information for a server process. It is used to process SQL statements and to manage the data as it moves through the system. In the context of CDC, when Oracle’s log mining feature (LogMiner) is used, it reads from the redo logs and uses the SGA to store the session’s private SQL area for processing the mined data. Oracle’s native logmining can cause out of memory errors.

The SGA size can vary based on the workload and the specific configuration of the Oracle instance. However, since log mining processes only read the redo logs and do not require maintaining a separate physical structure for the changes, the memory overhead is relatively controlled. The key to minimizing impact in such configurations is to ensure that SGA memory is sized adequately to handle the peak workload without causing significant performance degradation.

Oracle also supports Archive Logs, which can serve as a backup of the redo log for long term storage and recovery purposes. You can also mine data from the Archive Logs with minimal impact, given the work of generating the redo log was already done by the database, and you’re just spinning some threads to tail a file from the database’s operating system.

Low Impact: Binary Log CDC in MySQL and Logical Replication in PostgreSQL

MySQL BinLog

MySQL’s binary log-based CDC is a low impact method to extract changes from a database. MySQL’s binary log records all changes to the database, both data and structure, as a series of events. This method is similar to Oracle’s log mining but includes some operational differences that may influence memory usage.

The binary log itself is a file-based log, not directly impacting the database’s memory under typical operations. However, reading these logs for CDC purposes, especially when using external tools or custom scripts, can increase memory usage depending on how the changes are processed and staged before they are consumed or replicated.

PostgreSQL’s logical replication offers a balance between performance impact and real-time data synchronization capabilities. Logical replication in PostgreSQL involves streaming changes at the logical level, rather than copying physical data blocks, allowing for more selective replication and lower overhead compared to physical replication methods. This method captures changes to the database schema and data in the form of logical change records, which are then transmitted to subscriber databases.

The impact on memory and overall database performance with logical replication is generally low-to-moderate. Unlike methods that require frequent direct queries or the maintenance of shadow tables, logical replication leverages a publish-subscribe model, which minimizes the disruption to the main database operations. This approach allows PostgreSQL to maintain high performance by not overly taxing the database’s cache or CPU resources, making it particularly suitable for applications that require real-time or near-real-time data updates without a significant performance trade-off. Logical replication is highly configurable and can be tuned to replicate entire databases, specific tables, or even specific rows, offering flexibility that is valuable for maintaining efficient database operations and ensuring data consistency across distributed systems.

PostgreSQL Write-Ahead Logic with Logical Replication

Mining the PostgreSQL Write-Ahead Log (WAL) can be an intensive operation, especially when done frequently or on large databases. Here are some of the disk-related issues that can occur from this process:

Increased Disk I/O: The WAL records every change made to the database, so mining it means reading through these records. This can lead to increased disk I/O, which might strain the storage system, especially if it’s not equipped with high-performance drives like SSDs.

Disk Space Consumption: The WAL can grow significantly in size, especially in a busy database system with lots of transactions. If the WAL files are not managed properly (e.g., archived or cleaned up regularly), they can consume a substantial amount of disk space, potentially filling up the disk.

Performance Degradation: As the disk begins to fill up, or as I/O operations increase, you might notice a degradation in overall system performance. This can affect not just database operations but other applications that rely on the same disk resources.

Fragmentation: Over time, continuous writing and deleting of WAL files can lead to disk fragmentation. This can degrade the performance of the disk as it requires more time to read scattered pieces of data.

Risk of Data Loss: In extreme cases, such as when the disk is full or nearly full, new transactions might fail or the system might not be able to write new WAL entries. This could lead to transaction failures or, in worst cases, data corruption if the system behaves unexpectedly due to disk space issues.

To mitigate these issues, it’s important to:

Regularly monitor disk space and I/O metrics.
Implement proper WAL archiving and cleanup strategies.
Use high-performance disks for databases that require intensive I/O operations.
Consider scaling out your storage or using a dedicated storage system for WAL files to isolate the impact from other operations.

Moderate Impact: Maintaining Shadow Tables and Change Stream Implementations

SQLServer Change Data Capture with Shadow Tables

On the more significant end of the spectrum is the use of shadow tables for CDC, such as implemented by SQL Server’s Change Data Capture feature. This method involves creating and maintaining additional tables (shadow tables) that mirror the structure of the monitored tables and hold all changes made to the data. Each insert, update, or delete operation on the target table is reflected in the shadow table, capturing the old and new values of the affected rows.

This approach has a more pronounced impact on memory usage for several reasons:

Increased Storage Requirements: Shadow tables increase the storage requirement as they duplicate a significant amount of data.
Increased I/O Operations: Manipulating shadow tables requires additional read and write operations, which can lead to increased memory usage as data pages are loaded into and evicted from the cache.
Overhead of Trigger-Based Tracking: In some implementations, triggers are used to populate shadow tables. Triggers themselves consume memory and CPU resources, further adding to the overhead.

MongoDB Change Streams is another Change Data Capture (CDC) implementation that generally has a minimal to moderate impact on database performance, depending on the scale and configuration of the deployment. Change Streams allow applications to access real-time data changes without the complexity of tailing the oplog (operations log) directly. They provide a more streamlined and scalable approach to reacting to database changes, making them particularly useful for applications that need to trigger actions or update external systems in response to data modifications within MongoDB.

Performance Impact of MongoDB Change Streams

Moderate Overhead: Change Streams utilize MongoDB’s built-in replication capabilities. They operate by listening to changes in the oplog, a special capped collection that logs all operations that modify the data stored in databases. Since the oplog is already an integral part of MongoDB’s replication infrastructure, tapping into this system adds only little overhead. However, the oplog listener does consume resources from the database.
Scalability: MongoDB’s horizontal scalability through sharding means that Change Streams can also scale by distributing the load across multiple servers. This scalability helps in maintaining performance even as the volume of changes increases.
Selective Listening: One of the key features of Change Streams is their ability to filter and only listen to specific changes, which can significantly reduce the amount of data that needs to be processed and transmitted. This selective approach minimizes the memory and network bandwidth used, thereby mitigating the impact on the overall database performance.

Optimizing Performance with Change Streams

To optimize the performance when using MongoDB Change Streams, it’s crucial to:

Filter Changes: Apply filters to only subscribe to the relevant changes needed by the application, reducing unnecessary data processing.
Monitor Load: Keep an eye on the replication window and the impact of Change Streams on the primary database operations, especially in high-throughput environments.
Adjust Oplog Size: Ensure the oplog is appropriately sized to handle the volume of changes without frequent rollovers, which could lead to missed changes or higher latency in Change Streams.

Highest Impact: Periodic Batch Jobs with Query-Based CDC

The highest memory impact in CDC operations often occurs when CDC is implemented through periodic queries against the database. This approach involves running full or incremental queries at regular intervals to detect changes in the data. This method can be highly resource-intensive, particularly for large databases or databases with high transaction volumes.

The main challenges with batch query-based CDC include:

High Resource Utilization: Running complex queries to detect changes can consume significant CPU and memory resources, as these queries often involve scanning large portions of tables or joining multiple tables.
Impact on Database Performance: Frequent and resource-intensive queries can degrade the performance of the primary database operations, potentially leading to slower response times for other applications using the same database.
Data Freshness Issues: Since this method captures changes at intervals, there is a latency in data capture, which may not be suitable for scenarios requiring real-time or near-real-time data syncing.

How we designed Striim

Change Data Capture is a powerful technique for enabling real-time data processing and synchronization. The choice of CDC method should consider not only the operational requirements but also the impact on database performance, especially memory usage. From minimal impact techniques like Oracle’s log mining to more intensive methods like SQL Server’s shadow tables, and the highest impact method of periodic query-based CDC, each approach has its trade-offs that need to be managed to maintain overall system efficiency and performance.

At Striim, we’ve optimized CDC to offer both high performance and ease-of-use, supporting log-based CDC and CDC from change tracking tables, thus simplifying data movement and reducing total cost of ownership as highlighted by our clients like American Airlines: “Striim is a fully managed service that reduces our total cost of ownership while providing a simple drag and drop UI. There’s no maintenance overhead for American Airlines to maintain the infrastructure.”

By optimizing Change Data Capture for low impact data capture and low latency delivery to consumers, enterprises like UPS are able to build real-time AI applications to battle porch pirates.

You can learn more about Striim’s leading performance in Change Data Capture here with our respective partner endorse blogs:

Striim Oracle to BigQuery Benchmark

Striim Oracle to Snowflake Benchmark

If you want to start developing real-time, low impact Change Data Capture pipelines you can try Striim’s free trial and free community version.

How Change Data Capture Works: Understanding the Impact on Databases was originally published in Striim on Medium, where people are continuing the conversation by highlighting and responding to this story.

Striim - Medium

Beyond Materialized Views: Using DuckDB for In-Process Columnar Caching

The Challenge: Optimizing Operational Analytical Workloads

A Word About Buffer Caches

Why Not Materialized Views?

Our Solution: DuckDB as a OLAP Cache with PostgreSQL Extension

Implementation Overview

Performance Benchmarks

Results with DuckDB Caching:

Memory management

Why DuckDB Cache is Superior

So… is this HTAP?

Scaling Considerations

Conclusion

Microsoft Fabric Open Mirroring — Mirror once query anywhere — Striim

Open Mirroring with Microsoft Fabric: Mirror Once, Query Anywhere with Striim

Exploring Microsoft Fabric Open Mirroring: Striim’s SQL2Fabric-Mirroring Integration, Use Cases, and Data Access and Interoperability across DuckDB, Databricks, Snowflake, Spark Engines, and More

Table Of Contents

Audience

How Microsoft Fabric Open Mirroring Works

What is the Cost Structure?

What Are the Pain Points?

Open mirroring — powered by Striim

Use Cases after mirroring

Access data within fabric ecosystem

Access from Snowflake

Access from Databricks

Access from Spark Engines

Analytics with DuckDB

Access from Google BigQuery

Conclusion :

How Change Data Capture Works: Understanding the Impact on Databases

Why we care about database performance

Varying Degrees of Impact on Databases

Minimal Impact: Redo or Write Ahead Log Mining

Low Impact: Binary Log CDC in MySQL and Logical Replication in PostgreSQL

Moderate Impact: Maintaining Shadow Tables and Change Stream Implementations

Highest Impact: Periodic Batch Jobs with Query-Based CDC

How we designed Striim