Onehouse Offers First Turnkey End-to-End CDC Ingestion Lakehouse

Published in

onehouse-blogs

9 min readJan 5, 2024

by Kyle Weller
November 30th, 2023

Data engineers need a new approach for using data change data capture (CDC) to extract updates from transactional databases and keep analytics tables up-to-date, with end-to-end times measured in seconds. The data warehouse is flexible and can be updated fast enough, but is proprietary, expensive, and creates an extra, captive copy of incoming data. Traditional data lakes can accommodate the data volumes involved on affordable object storage, such as Amazon S3 or Google Cloud Storage. But data lakes operate best in batch mode, making real-time and near real-time updates difficult or impossible to achieve.

(This article is a repost from the Onehouse blog.)

Now there is a new solution that offers the best of both worlds: the flexibility and update speed of a data warehouse and the openness, capacity, and affordability of a data lake. This solution is the universal data lakehouse, based on Hudi technology and offered as a managed service by Onehouse.

A New Solution for CDC

Our previous blog post described the benefits of CDC and why it is more and more frequently used in modern data architectures. However, CDC only captures changes and sets them in motion toward a destination. For the data warehouse, updating the destination is a solved problem, though this comes as a price. But there’s no uniform, easy to implement, performant, and reliable way to update destination tables on a data lake architecture.

In this blog post we describe the problem; current solutions; and why those solutions are unsatisfactory. We then describe how the the Onehouse managed service, based on the Hudi open source project, offers a fast, easy-to-use, serverless solution to this challenge, running on inexpensive object storage in the cloud.

In addition, using Onehouse to create and maintain the target data table doesn’t only solve the problem for the original use case. Once the target data table is created, it’s in open table formats and open data formats on object storage in the cloud, in your own virtual private cloud (VPC) account. From there, you can:

Query the table directly
Transform it further with open compute services
Transform it further with proprietary compute services, such as a data warehouse

The initial data table can serve as a bronze data table in a medallion architecture. Open compute services can then be used to create a cleansed and deduped silver data table, still resident on the data lake. In addition, data from CDC can be augmented with data from multiple sources. The bronze and silver data tables serve as a source of truth for multiple purposes, simplifying the data architecture, as shown in Figure 1.

*Figure 1. Using the Universal Data Lakehouse as a source of truth for CDC.*

Silver data tables can then be processed to gold, using open or proprietary computer services, on the data lake or within one or more data warehouses. This approach reserves expensive proprietary compute services for the use cases where they add the most value.

Why is End-to-End CDC a Challenging Problem?

CDC is largely a solved problem for using change logs from transactional databases to updating data warehouse destinations. The problem here is that data warehouses are not an ideal solution for many use cases, for reasons that include:

Data warehouses are mostly closed, proprietary solutions
Data warehouses use relatively expensive structured databases
To use a data warehouse, you copy your data into infrastructure controlled by the provider, which is expensive to run further processing on — but which is subject to egress charges if you want to get it out
In using a data warehouse, you are likely to end up with multiple copies (or near-copies) of your data: one or more versions in a data warehouse and one or more versions in open storage

Ideally, we would use a data lake solution instead. Data lakes tend to be open and to run on inexpensive object stores, making it practical to store very large amounts of data and to keep versions of data tables for re-use and for governance purposes, as is done with the medallion architecture.

However, keeping a data lake updated against the changes delivered by CDC has, until now, been far more difficult than with a data warehouse. To understand why it’s so challenging, we need to break the process down into its component parts.

The term CDC only refers to capturing changes from the source database, but that’s actually only the first part of the problem that data engineers and developers face. To move changes through your data infrastructure efficiently, you need to complete several steps. For log-based CDC, these steps are:

Connect directly to the source database
Extract change logs into a scalable event buffer where they can be stored and processed
Interpret the log entries to establish the specific database operations they represent — inserts, updates, and delete operations
Apply these changes to the target data table for use in analytics

The result of these steps is that the analytics-ready destination data table is a consistent, exact, and reliable representation of the current state of the transactional database that it reflects.

This problem is challenging for data lakes for several reasons:

Data lake object storage is immutable by nature at the file level, requiring time-consuming and costly file rewrites for any modifications.
Data lake solutions tend to use columnar file formats, such as Apache Parquet, which are efficient for analytics, but slow and expensive for write operations.
Data lakes don’t traditionally support ACID guarantees, so at a given point in time, the target data table is not a reliable reflection of the source operational database.

This leads to serious operational problems when trying to implement a CDC process on the data lake. Until recently, it’s been impossible to get the best of both worlds — the openness, flexibility, lower cost, and ease of use of the data lake, and the ability to work with mutable data of the data warehouse.

Where Existing Solutions Fall Short

It’s well known that databases are limited by the CAP theorem, in which an ideal data store is consistent, available, and has partition tolerance — yet, according to the theorem, no data store can have all three of these attributes at once.

Similarly, a data lake-based solution for CDC needs to be open, flexible, and easy-to-use; but available solutions have offered one or two of these attributes, not all three.

*Figure 2. Integration solutions vs. data warehouse and data lake storage.*

Figure 2 shows some of the important operational databases we want to use CDC for on the left, and the kind of targets that different solutions support on the right: either data warehouses or data lakes.

Solutions tend to fall into five categories with regard to their openness, flexibility, and ease of use:

No direct database connection: The first kind of solution lacks the ability to connect directly to a database, requiring the use of tools such as Debezium and Kafka to connect to databases and stream data; this limits the ability to support CDC use cases. Tabular falls into this category.
CDC only to data warehouse locations: The next group only works, or works best with a data warehouse, because these solutions don’t have a practical way to deal with the immutable nature of object storage, as used by a data lake.
Raw changelogs copied to data lakes: The third group fully supports data warehouse targets but only supports the writing of changelogs to a data lake — there is no ability to insert, update, or delete existing records, unless the user writes their own updating code, a massive undertaking.
Changes are materialized into data lake tables using proprietary technology. When a vendor is able to materialize changes into a table on a data lake, this usually does not use open source technology. The proprietary technology used lakes optimizations for the lakehouse.
Full end-to-end data lakehouse CDC. Only one type of solution is able to update data tables on a data lake so they fully reflect the source operational databases, including the ability to insert, update, and delete existing records. This can be accomplished by the Hudi lakehouse project and the Onehouse managed service.

How Hudi Implements Upserts and Deletions

Why are the Hudi open source project, and the Onehouse managed service based on Hudi, the only solution that’s able to support inserts, updates, and deletions on a data lake?

Hudi is the original lakehouse open source project and was designed from the beginning to support mutable data. Hudi accomplishes this by supporting a rich set of services that include the needed update and deletion capabilities, with performance that approaches that of a transactional database.

What’s different about Hudi? The core difference is that Hudi uses metadata in a clever and original way to work around the append-only nature of updates to data lakes residing on object storage. Data files, which are immutable, are kept up to date in four steps:

When a data table is first created, it’s stored in an immutable file on object storage.
When updates arrive — for instance, via CDC — they can be stored using copy on write (CoW), in which each write causes an updated version of a Parquet file to be created; or merge on read (MoR), in which updates are kept as metadata accompanying the relevant Parquet file until compaction. Figure 3 shows these two approaches.
Between compactions, reads accommodate updates (stored as metadata) by triggering a data merge of a Parquet file and its accompanying update metadata for each query.
Compactions run on a schedule or by request, which has a cost, but which restores read performance, as it is no longer necessary to merge metadata for a given Parquet file until the next update to that file occurs.

*Figure 3. Hudi and Onehouse use additional metadata to implement transactions on the data lake.*

Figure 3 shows the tradeoffs between using the CoW and MoR updating approaches. Adroit use of CoW and MoR allows for an optimal balance between write and read performance, maintaining a high level of overall system performance. Overall performance approaches the performance of a relational database, at far less cost.

Hudi has specific advantages over other lakehouse projects that enable Hudi users (and the Onehouse managed service) to achieve these feature and performance benefits. The differences are described in our blog post, Apache Hudi vs Delta Lake vs Apache Iceberg — Data Lakehouse Feature Comparison, and explained in our webinar on the same topic.

How Onehouse Solves the Problem

Creating a data lakehouse with any of the major lakehouse open source projects, including Hudi, is a large undertaking. Onehouse delivers a data lakehouse as a managed service, so you avoid the challenges of implementing a data lakehouse yourself.

Onehouse also makes full use of the new Onetable project, which provides interoperability across Hudi, Iceberg, and Delta Lake. As a result, the Onehouse user gets all the capabilities of Hudi, which is highly open, highly performant, and has a rich services layer, while maintaining full interoperability with all lakehouse table formats.

With Onehouse, the user saves time and effort in two important (and related) ways:

The data lakehouse is created through the Onehouse user interface, a process which takes a day or two of work by one person rather than many months of work by a team.
Managing a DIY data lakehouse and handling change requests back to the lakehouse implementation team is a major DevOps challenge, while managing the Onehouse service is a part-time job managed through a point-and-click interface (see Figure 4).

*Figure 4. Mapping a database to a data lake and assigning a catalog are easy in Onehouse.*

Onehouse offers end-to-end CDC as part of a managed service. The Onehouse offering is serverless; you don’t have to instantiate, provision, and operate servers, nor do you have to scale, deal with errors and faults, or handle security issues. You simply call the services needed to extract data and load the changes into an analytics-ready data table.

The Hudi project and the Onehouse managed service treat CDC as a first-class problem. Onehouse dedicates development and operational resources to supporting end-to-end CDC, including a close partnership with Confluent for managed Kafka and continuous interaction with relevant open source projects. Onehouse customers frequently use the service for end-to-end CDC, which is a primary driver for many customers to begin an engagement with Onehouse.

A Onehouse-type solution to end-to-end CDC is not new; in fact, it’s a widely used architecture that is implemented in hundreds of companies, mostly enterprise-scale organizations with large engineering staffs. These organizations have spent a great deal of engineering time and effort to implement semi-custom solutions based on open source lakehouse software.

Onehouse is gaining traction as a solution that organizations seriously consider when implementing CDC. If you have such a workload on the horizon — or if you want to save time, money, and hassle on existing workloads — contact Onehouse today.