Using OneTable to translate a Hudi table to Iceberg format and sync with Glue Catalog

Sagar Lakshmipathy
2 min readJan 25, 2024

--

It dawned upon me that we don’t have a step-by-step guide on the OneTable site to sync OneTable translated tables directly to Glue Catalog. At work, I was doing some data migration to create a bunch of Iceberg tables, so I figured I’ll write about it, so other folks may benefit from it.

Without much fluff, I’ll get to the point.

1. Create/Download bundled OneTable jar

a. You could download the OneTable jar (utilities-0.1.0-beta1-bundled.jar) directly from Github

b. Or build one yourself following these steps

2. Write Hudi tables to storage

a. Start pyspark shell, this works locally if you have the same spark version. Or choose an EMR cluster which comes with Hudi pre-installed. You could access Hudi from EMR just with pyspark

b. Write tables to S3. If you’re working locally you’ll have to use s3a:// in place of s3:// in the below code.

3. Create Glue Database

Use the below command to create a Glue Database from CLI, you could also do this from UI which is straightforward.

aws glue create-database --database-input "{\"Name\":\"icebergdb\"}"

4. Create required yaml files for OneTable

Now to the good part. Create the necessary .yaml files to work with OneTable. Remember you can also directly use the OneTable classes to run sync. But config files are easier to get the point across, so I’ll stick to it.

a. Create config.yaml like below

sourceFormat: HUDI
targetFormats:
- ICEBERG
datasets:
-
tableBasePath: s3://bucket-name/hudi-dataset/people
partitionSpec: city:VALUE
tableName: people
namespace: icebergdb

b. Create catalog.yaml like below

catalogImpl: org.apache.iceberg.aws.glue.GlueCatalog
catalogName: onetable
catalogOptions:
io-impl: org.apache.iceberg.aws.s3.S3FileIO
warehouse: s3://bucket-name/warehouse

5. Run Sync

Finally run sync process by including the necessary jars in the classpath

java -cp utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar:/Users/sagarl/Downloads/iceberg-aws-1.3.1.jar:/Users/sagarl/Downloads/bundle-2.23.9.jar io.onetable.utilities.RunSync  --datasetConfig config.yaml --icebergCatalogConfig catalog.yaml

6. Optional: Validate the table creation

You could now check the s3 path to make sure the metadata/ folder is created with iceberg specific metadata files. You’d find that under s3://bucket-name/hudi-dataset/people/metadata/

You could also check the table’s schema in AWS Glue > Data Catalog > Databases > Tables > people

And also query the table in Amazon Athena with something like

SELECT * FROM icebergdb.people;

While this example was simply focused on Hudi -> OneTable Iceberg -> AWS Glue, the steps are similar to for other formats and catalogs as well. For example, you’d find almost similar steps if you want to go from Hudi/Delta to Iceberg and catalog in BigLake Metastore as shown in the OneTable docs here.

This blog represents my own viewpoints and not of my employer, Onehouse. All product names, logos, and brands are the property of their respective owners.

--

--