A Glimpse into Flink CDC 3.0

Giannis Polyzos
7 min readMay 21, 2024

--

The Next-generation of Streaming Data Integration

Apache Flink CDC 3.1.0, was released last week marking the first official release after the project was graduated to the Apache Incubator.

Disclaimer: This article is written with Leonard Xu, PMC of Apache Flink and Flink CDC project Lead

This blog post aims to provide some insight, into the goals of Flink CDC 3.0. Flink CDC 2.0 set the bar high as a CDC solution and has been adopted in many large-scale production environments, to address problems users would typically come across with alternative solutions.

This blog post aims to provide, some insight into the project, with the hope of helping engineers and architects better understand why they should consider this solution regardless of the scale.

Introduction

Flink CDC is a real-time data integration framework based on the Change Data Capture (CDC) technology of database changelogs.

It provides multiple advanced features, such as full and incremental data synchronization, lock-free reading, parallel reading, automatic synchronization of schema changes, and distributed architecture, on top of Apache Flink’s excellent processing capability and robust ecosystem.

Full and incremental data synchronization refers to the process of reading all the historical data within the database and then automatically switching to reading the incremental data.

Background

Flink CDC 2.0 received a lot of traction from many users. However, although it provided many benefits compared to existing CDC solutions, resulting in a strong adoption, users experienced the following pain points:

  • User experience: Flink CDC provides only source connectors and does not support end-to-end data integration, making it difficult to create jobs via SQL syntax or the DataStream API.
  • Frequent maintenance: Frequent table creation and deletion operations are necessary due to the frequent changes of schemas in source databases.
  • Scalability: Large amounts of resources are required to synchronize data from thousands of tables and ingest tens of thousands of tables into data lakes or data warehouses. In addition, scaling cannot be automatically performed to handle different resource requirements for the full synchronization and incremental synchronization stages.

To tackle the above challenges Flink CDC 3.0 was introduced.

Flink CDC was donated to the Apache Foundation, aspiring to become a complete streaming integration framework, based on the following design principles:

  • End-to-end experience: As an end-to-end data integration framework, Flink CDC 3.0 provides high-level abstractions for setting up data movement pipelines easily.
  • Schema Synchronization: It can automatically synchronize schema changes from upstream to downstream system, allowing users to also add tables to existing jobs at any time.
  • Elasticity: Idle resources can be automatically reclaimed, and a single sink instance can write to multiple tables simultaneously.
  • Large data volume: Users' legacy databases can be large, commonly containing over 100 TB of data.
  • Real-time processing of incremental data: The business value of incremental data is higher than that of historical data but decreases over time, which leads to high requirements for data freshness; new incoming events need to be processed as soon as possible.
  • Data ordering: Support for global preservation of data ordering to ensure the consistency of processed data.

Design of Flink CDC 3.0

Architecture

The architecture of Flink CDC 3.0 is divided into four layers.

The architecture of Flink CDC 3.0 is divided into four layers:

  • Flink CDC API: YAML-formatted API operations are provided to help end users configure data synchronization pipelines. Users can call the API operations in Flink CDC CLI.
  • Flink CDC Connect: Source and sink connectors are provided to interact with external systems. Flink CDC 3.0 encapsulates the source connectors of Apache Flink and Flink CDC to read and write data to external systems.
  • Flink CDC Composer: This layer translates data synchronization tasks into Flink DataStream jobs.
  • Flink CDC Runtime: Custom Flink operators are provided for different data synchronization scenarios to implement advanced features, such as schema changes, routing, and transformations.

User-Friendly API Design

Flink CDC 3.0 is tailored for seamless streaming data integration scenarios. Users do not need to worry about the implementation details of the framework.

They can easily create data synchronization pipelines by using a yaml file; configuring data sources, sinks, and intermediate transformations or routes.

The following figure shows a sample yaml for synchronizing data from a MySQL database to Apache Kafka or Paimon.

Ingestion Pipeline from MySQL to Kafka or Paimon

Pipeline Connector API

To facilitate the integration of external systems into data synchronization pipelines, Flink CDC 3.0 introduced the Pipeline Connector API.

  • DataSource: it is used to collect change events from external systems and pass them to downstream operators. It is composed of the EventSourceProvider and MetadataAccessor. EventSourceProvider builds Flink sources, whereas MetadataAccessor accesses metadata.
  • DataSink: it is used to apply schema changes received from upstream operators and write the changed data to external systems. It is composed of EventSinkProvider and MetadataApplier. EventSinkProvider builds Flink sinks, whereas MetadataApplier applies metadata changes (such as table schema changes) to the destination system.

To ensure compatibility with the Flink ecosystem, the design of DataSource and DataSink follows the same logic as Apache Flink. Developers can easily integrate external systems with Flink CDC 3.0 by using Flink connectors.

Core Features of Flink CDC 3.0

To achieve high performance in scenarios such as schema changes, full database synchronization, and table merging, Flink CDC 3.0 integrates the capabilities of Apache Flink and provides multiple custom Flink operators to support various synchronization modes.

Schema Evolution

Schema evolution is a common but challenging feature of data synchronization frameworks. Flink CDC 3.0 introduces a SchemaRegistry to map jobs in topology and uses a SchemaOperator to manage schema changes in job topologies.

Here’s how Flink CDC 3.0 handles schema changes:

  • When a schema change is detected in a data source, SchemaRegistry issues a pause request to SchemaOperator. After receiving the request, SchemaOperator pauses the streaming ingestion and flushes the data to maintain schema consistency.
  • Once the schema change is synchronized to the external system, SchemaRegistry issues a resume request to SchemaOperator. After receiving the request, SchemaOperator resumes with the streaming ingestion.

Full Database Synchronization

Users can specify a multi-table or full database synchronization task by configuring the DataSource in the configuration file of Flink CDC 3.0.

The schema evolution feature enables automatic synchronization for the entire database. When new tables are detected, SchemaRegistry automatically creates replicas in the destination system.

Table Merging

Another common use case of Flink CDC 3.0 is merging multiple source tables into a single sink table. Flink CDC 3.0 employs a Route mechanism to implement table merging and synchronization. Users can define routing rules in the configuration file of Flink CDC 3.0 by using regular expressions to specify the source tables and the sink table.

High-performance Data Structure

To reduce serialization overhead during data transmission, Flink CDC 3.0 adopts a high-performance data structure.

  • Schemaless deserialization: Schemaless deserialization decouples schema information from changed data. Before sending changed data, DataSource sends the schema description, which is tracked by the framework. This way, schema information does not need to be bound to each changed record, and the serialization cost for wide tables is significantly reduced.
  • Binary storage format: Data is stored in a binary format during synchronization. Deserialization is performed only when the detailed data of a field is read (such as when the table is partitioned by the primary key) to reduce serialization costs.

In addition to fundamental data synchronization capabilities, Flink CDC 3.0 provides multiple advanced features, such as automatic synchronization of schema changes, full database synchronization, and table merging and synchronization, to cater to complex data integration scenarios.

The automatic synchronization of schema changes frees users from manual intervention when schema changes occur in a data source, greatly reducing operational costs.

Moreover, only a few operations are needed to configure a multi-table or multi-database synchronization task, facilitating users’ development.

Conclusion

Apache Flink CDC 3.0 sets a new direction and the future looks quite promising.

It supports a rich variety of connectors already, but for the 3.1.0 version, the streaming data integration framework supports out-of-the-box — MySQL, Apache Doris, StarRocks, Apache Kafka and Apache Paimon.

Apache Flink CDC 3.0 is part of the unified stack.

It also powers Ververica’s streaming data movement framework.

Unified Ingestion refers to the process of being able to read all the historical data within the database (batch reads) and then without locking the database, automatically switch to reading the incremental data (streaming reads).

At the same time, the framework needs to be able to ensure data consistency, downscale resources and make sure it doesn't put pressure to the source system.

Make sure to keep an eye on the project, give it a try and if you like it, don’t forget to give it some ❤️ via ⭐ on GitHub.

--

--

Giannis Polyzos

Staff Streaming Product Architect @ Ververica ~ Stateful Stream Processing and Streaming Lakehouse https://www.linkedin.com/in/polyzos/