Apache ShardingSphere 5.4.0 Release: Global Table, CDC Improvement, and Enhanced Metadata Loading

Apache ShardingSphere
CodeX
Published in
8 min readJul 27, 2023

Apache ShardingSphere has just launched version 5.4.0 following two months of development. This release includes 1271 PRs from global contributors, bringing you major enhancements in functionality, performance, testing, documentation, and examples. We aim to provide an improved user experience with the following highlights:

  • Broadcast table now replaced by global table
  • Improved Change Data Capture (CDC)
  • Optimized metadata loading logic for single table

Global Table Replacing Broadcast Table

Broadcast table refers to the table that exists in all sharding data sources. The data and structure within each database must be consistent. Broadcast tables are useful when dealing with a small amount of data and performing associated queries on large tables with massive data, such as a dictionary table.

Currently, broadcast tables can only exist in the data source form a sharding table. Query association would fail if a single table and a broadcast table were not in the same storage node. To address this, we need to use all the data in the schema. Therefore, broadcast tables should be removed from the sharding rule and changed into global tables.

API Changes

Meta Data

  • Added BroadcastRule.
  • The rules Collection in ShardingSphereRuleMetaData saves the BroadcastRule.
  • Removed broadcastTables from ShardingRuleConfiguration.
  • Removed broadcastTables from ShardingRule.
  • Removed broadcastTables from YamlShardingRuleConfiguration.

DistSQL

  • Adjusted the logic of CREATE BROADCAST TABLE RULE, with no change in SQL statements.
  • Adjusted the logic of DROP BROADCAST TABLE RULE, with no change in SQL statements.
  • Adjusted the logic of SHOW BROADCAST TABLE RULES, with no change in SQL statements.
  • Removed broadcast_table in the result of COUNT SHARDING RULE FROM sharding_db, with no change in SQL statements.
mysql> COUNT SHARDING RULE FROM sharding_db;
+--------------------------+----------------+-------+
| rule_name | database | count |
+--------------------------+----------------+-------+
| sharding_table | sharding_db | 2 |
| sharding_table_reference | sharding_db | 2 |
| broadcast_table | sharding_db | 0 |
+--------------------------+----------------+-------+
3 rows in set (0.00 sec)
  • Added COUNT BROADCAST RULE (FROM sharding_db) statement.

YAML

  • Removed broadcastTables config from the original SHARDING RULE in config-sharding.yaml config.
  • Added BROADCAST RULE config broadcast tables in config-sharding.yaml config.

JAVA Config

  • Added YamlBroadcastRuleConfiguration.
  • Added BroadcastRuleConfiguration.

SPI Interface

  • Added class YamlBroadcastRuleConfigurationSwapper to implement the YamlRuleConfigurationSwapper.
  • Added class BroadcastSQLRouter to implement the SQLRouter.

CDC Improvement

Change Data Capture (CDC), as a common database function, is provided by most relational databases such as MySQL, PostgreSQL, openGauss, etc.

CDC works by monitoring the record of modifications (insertions, updates, deletions, or structural changes). After changes are identified by CDC, they then are relayed to downstream systems or stored in a separate system such as OLTP, OLAP, or MQ systems. CDC can be used for data synchronization as well as ETL.

ShardingSphere 5.4.0 has the following optimization in terms of CDC:

CDC Now Supports Pure Incremental Mode

You can specify using pure incremental mode when creating CDCClient. In this case, CDC will skip full data synchronization and initialize the incremental position as early as possible.

StartCDCClientParameter parameter = new StartCDCClientParameter();
// The default value of full is false, meaning subscribing to incremental data only
parameter.setFull(false);
......
new CDCClient(parameter, records -> {
}).start();

CDC Export is Now Transaction-Based

To ensure that the data of the same transaction can be completely exported to the clients, we have optimized the incremental phase of CDC, aiming to ensure the complete transaction output at the physical level.

Scenario A: support all databases.

Scenario B: an XA transaction and multiple physical databases are independent of each other. Currently, only openGuass is supported.

openGauss provides a GLT component to ensure that the CSNs of multiple database instances are the same in the same XA transaction. When CDC exports data, it sorts by CSN to ensure the order of XA transactions.

  • CSN (Commit Sequence Number): the sequence number of the transaction to be committed. CSN is used as a logical timestamp inside openGauss to simulate the internal timing of the database.
  • GLT: maintains the next CSN as the global logical clock. ShardingSphere will issue the CSN to the openGauss instances to ensure that the visibility between any two transactions is consistent on each shard.

Improvements in the Metadata Loading Logic in Single Table

Starting from version 5.0, ShardingSphere can automatically scan a single table (non-sharded table), record metadata from the single table, and perform correct routing when users execute the relevant SQL statements. However, a large number of single tables can appear in certain user scenarios, leading to the following problems:

  • Slow startup (long metadata scanning).
  • Increased memory usage (a large number of objects from single tables and metadata).

The main reasons for the large number of single tables might be:

  • ShardingSphere manages a large number of data sources, but some tables are irrelevant to the current business.
  • A large number of single tables are created during project iteration.

To address these prevalent issues, we have designed a new mechanism that allows users to selectively import single tables, thereby avoiding the automatic loading of excessive single-table metadata during startup or when executing the REGISTER STORAGE UNIT.

Starting from ShardingSphere 5.4.0, you can manage single table loading as needed. The new YAML config and DistSQL statements are outlined below:

  • YAML
# YAML configuration
databaseName: sharding_db

- !SINGLE
tables:
# MySQL mode
- ds_0.t_single
- ds_2.*
- "*.*"
# PostgreSQL, openGauss support specifying schema
- ds_1.public.t_config
- ds_1.public.*
- ds_1.*.*
- "*.*.*"
# Control CREATE TABLE routing
defaultDataSource: ds_0
  • DistSQL
-- MySQL
LOAD SINGLE TABLE ds_0.t_single;
LOAD SINGLE TABLE ds_0.*;
LOAD SINGLE TABLE *.*;

-- PostgreSQL, openGauss
LOAD SINGLE TABLE ds_0.public.t_single;
LOAD SINGLE TABLE ds_0.public.*;
LOAD SINGLE TABLE ds_0.*.*;
LOAD SINGLE TABLE *.*.*;

SHOW UNLOADED SINGLE TABLES;

SHOW SINGLE (TABLES | TABLE tableName);

UNLOAD SINGLE TABLE ds_0.t_single;

Note:

  • Table names that are already configured in rules such as Encrypt, Mask, etc. will be automatically loaded if they exist in the single table.
  • When users execute CREATE TABLE, if the table is a single table, it will be automatically loaded.

Release Notes

API Changes

  • Metadata: Changed sharding broadcast tables to global broadcast tables.
  • JDBC: Removed exclamation mark (!) from global rules.
  • DistSQL: Simplified keywords ASSISTED_QUERY and LIKE_QUERY in Encrypt DistSQL.
  • DistSQL: Updated SQL_PARSER RULE syntax.
  • Encryption: Adjusted encryption API YAML config to differentiate between encryption, LIKE, and assisted query columns.
  • Encryption: Removed plain column and queryWithCipherColumn toggle.
  • Read/write split: API optimization.
  • Proxy: Removed proxy-instance-type config.
  • Proxy: Removed proxy-backend-executor-suitable.
  • Proxy: Removed proxy-mysql-default-version.
  • Scaling: Refactored commit rollback streaming to drop streaming.
  • (Experimental) Merged ShardingCacheRule into ShardingRule.

New Features

  • DistSQL: Added new syntax to manage SQL_FEDERATION rule.
  • Proxy: Now supports Unix Domain Socket.

Enhancements

  • Scaling: CDC now supports incremental mode only.
  • Scaling: CDC incremental data is now exported by transactions.
  • Scaling: CDC now supports MySQL and PostgreSQL.
  • Scaling: CDC now supports the single table.
  • Scaling: CDC now supports all data types of openGauss.
  • Scaling: CDC now supports openGauss incremental replication reconnection.
  • Scaling: Removed DataConsistencyCalculateAlgorithmChooser which is incompatible with encryption rules.
  • Scaling: Optimized integer unique key tables inventory data splitting.
  • Scaling: Adjusted the default value for process config to optimize resource consumption.
  • Scaling: Auto refresh table metadata for data migration.
  • Scaling: When creating a slot in PostgreSQL/openGauss, check if the corresponding database for the slot exists.
  • Scaling: The result of the data consistency check status is no longer false when it is not completed.
  • Scaling: Enabled concurrent CRC32_MATCH match on source and target.
  • Scaling: Pipeline job compatible with sharding audit strategies.
  • Metadata: Refactored the ShardingSphere metadata system architecture.
  • Metadata: Optimized the logic for loading metadata of a single table.
  • Metadata: Now supports the empty query of the system table for MySQL/PostgreSQL/openGauss.
  • DistSQL: Added support for transactionalReadQueryStrategy for read/write split.
  • DistSQL: Enhanced algorithms property check.
  • Transaction: Added permission check.
  • Transaction: Removed TransactionTypeHolder and only created the current transaction manager.
  • SQL Support: Support MySQL LOAD DATA and LOAD XML statements for single table and broadcast table.
  • SQL Support: Improved the high-priority SQL in MySQL test program results.
  • SQL Support: Oracle SQL parser now supports Chinese commas.
  • Encryption: Support query of encrypt column in projection subquery.
  • Kernel: Added metadata verification for table existence to INSERT, DELETE, UPDATE, and SELECT statements.
  • JDBC: Implemented batch execution for ShardingSphereStatement.
  • Proxy: Added TLS support for the frontend.
  • Proxy: Support Flush messages in PostgreSQL/openGauss proxy.
  • Proxy: Added support for bit and bool data types in PostgreSQL proxy.

Bug Fixes

  • Scaling: Fixed the issue of resuming due to job preparation failure.
  • Scaling: Fixed the issue where CDC deletes event record.beforeList is null.
  • Scaling: Fixed the issue of the single quote in openGauss mpp decoding plugin.
  • Scaling: Fixed execute engine not closed after job stopping.
  • Scaling: Fixed stop job before task starting.
  • Metadata: Fixed case sensitivity issue when loading schema metadata with the H2 database.
  • Metadata: Fixed “object not found” exception when using PostgreSQL/openGauss schema name as the logical database name.
  • DistSQL: Fixed the wrong result of check_table_metadata_enabled when executing SHOW DIST VARIABLE.
  • Encryption: Fixed the rewriting exception caused by lowercase in PostgreSQL/openGauss Encrypt LIKE.
  • Sharding: Support passing empty shard conditions to the sharding algorithm, allowing users to control null value routes.
  • SQL support: Support BETWEEN AND expression parsing in MySQL Projection.
  • Data masking: Fixed incorrect result when configuring the same value of from-x and to-y with KEEP_FROM_X_TO_Y.
  • Infra: Fixed ClassNotFound error when missing pgjdbc JAR.
  • Proxy: Fixed the error of Sequence ID of pipelining requests in MySQL proxy.

Overall, this latest version of Apache ShardingSphere offers significant enhancements and features that help users solve their problems more efficiently. The Apache ShardingSphere 5.4.0 release has been made possible by the hard work of our community with 1271 PRs made by 53 contributors. We are grateful for your support!

We kindly invite you to join the Apache ShardingSphere community where you can improve your skills with global technology experts from diverse backgrounds. Here are some ways to engage with our community:

  • Join the ShardingSphere Slack discussions.
  • Contribute to our project on GitHub. New to coding? No problem — check out our beginner-friendly issue list.
  • Create content: submit your ShardingSphere-related content such as tutorials, experience, case studies, and more.
  • Help improve our documentation: review and optimize existing docs, or assist with updates.
  • Does none of the above work for you? Consider signing up for our events, becoming a volunteer, providing suggestions, or simply promoting ShardingSphere — we sincerely appreciate any support you can offer!

Relevant Links

--

--

Apache ShardingSphere
CodeX
Writer for

Distributed SQL transaction & query engine for data sharding, scaling, encryption, and more - on any database. https://linktr.ee/ApacheShardingSphere