A java toolkit for Apache Iceberg open table format

Brajesh Pandey
4 min readMay 14, 2023

--

( Brajesh Pandey, Shabana Baig)

Introduction

Data and AI industry is trending towards adapting Open Data Lakehouse ecosystem using technology such as Apache Iceberg etc. However, there are not enough open-source toolkits available to access Open data lakehouse.

“Java-iceberg-toolkit” is a Java implementation for performing operations on Apache Iceberg and Hive tables to enable open data lakehouse access to developers, data scientists and DB users.

The toolkit aims at overcoming one of the major challenges which was to find a unified tool that can help interact with iceberg tables.

Apache Iceberg — Java toolkit

If you don’t know about “Apache Iceberg”, you are not alone :) you can find more information about that here (https://iceberg.apache.org/). Basically it is an open table format for huge analytic datasets.

“Apache Iceberg” provides many functionalities one that caught my eyes was ACID (atomicity, consistency, isolation, and durability) support for tables. Considering the boom in open datasets (Parquet, AVRO, ORC etc), we wanted to explore this.

When we started exploring Iceberg we quickly realized that we could not find enough APIs or example to perform some of the basic table level operations. To help us and others, we created an “OpenSource toolkit” called java-iceberg-toolkit (https://github.com/IBM/java-iceberg-toolkit).

— This toolkit supports the following operations —

  • Provides a simple and easy-to-use interface to interact with Apache Iceberg and Hive tables.
  • Provides APIs to perform operations on Iceberg tables and Hive tables.
  • Supports all primitive data types.
  • Supports most of the operations on Iceberg and Hive tables:
  • Create namespace
  • Create table
  • Get schema of a plan
  • Get plan task
  • Read table
  • Write table

(more details: https://github.com/IBM/java-iceberg-toolkit#supported-operations)

java-iceberg-toolkit provides the following interfaces

A CLI which is ready to use when the code is packaged and all configurations are in place. (https://github.com/IBM/java-iceberg-toolkit#cli-2)

$ java -jar <jar_name> --help
usage: java -jar <jar_name> [options] command [args]
--format <iceberg|hive> The format of the table we want to
display
-h,--help Show this help message
-o,--output <console|csv|json> Show output in this format
--snapshot <snapshot ID> Snapshot ID to use
-u,--uri <value> Hive metastore to use
-w,--warehouse <value> Table location

Commands:
drop Drop a table or a namespace
schema Fetch schema of a table
read Read from a table
commit Commit file(s) to a table
list List tables or namespaces
type Fetch table type
uuid Fetch uuid of a table
spec Fetch partition spec of a table
rename Rename a table a table
create Create a table or a namespace
files List data files of a table
location Fetch table location
describe Get details of a table or a namespace
write Write to a table
snapshot Fetch latest or all snapshot(s) of a table

Programmable APIs

For ex,

  • Get schema of a Hive table:
import iceberg.HiveConnector;

HiveConnector connector = new HiveConnector(uri, warehouse, namespace, table);
Schema schema = connector.getTableSchema();
  • Get schema of an Iceberg table:
import iceberg.IcebergConnector;
import org.apache.iceberg.Schema;

IcebergConnector connector = new IcebergConnector(uri, warehouse, namespace, table);

Schema schema = connector.getTableSchema();
  • Create an unpartitioned Iceberg table
import iceberg.IcebergConnector;

import org.apache.iceberg.PartitionSpec;
import org.apache.iceberg.Schema;
import org.apache.iceberg.types.Types;

IcebergConnector connector = new IcebergConnector(uri, warehouse, namespace, null);
Schema schema = new Schema(
Types.NestedField.required(1, "ID", Types.IntegerType.get()),
Types.NestedField.required(2, "Name", Types.StringType.get()),
Types.NestedField.required(3, "Price", Types.DoubleType.get()),
Types.NestedField.required(4, "Purchase_date", Types.TimestampType.withoutZone())
)
);
PartitionSpec spec = PartitionSpec.unpartitioned();
boolean overwrite = false;
connector.createTable(schema, spec, overwrite);

These APIs are really powerful to solve some use-cases, for example, bulk load ingestion of files directly into Iceberg table.

Use-Case

Description : Bulk copy of parquet files to data lake in Iceberg table format.

Steps:

  • Create an Iceberg table
import iceberg.IcebergConnector;

import org.apache.iceberg.PartitionSpec;
import org.apache.iceberg.Schema;
import org.apache.iceberg.types.Types;

IcebergConnector connector = new IcebergConnector(uri, warehouse, namespace, null);
Schema schema = new Schema(
Types.NestedField.required(1, "ID", Types.IntegerType.get()),
Types.NestedField.required(2, "Name", Types.StringType.get()),
Types.NestedField.required(3, "Price", Types.DoubleType.get()),
Types.NestedField.required(4, "Purchase_date", Types.TimestampType.withoutZone())
)
);
PartitionSpec spec = PartitionSpec.builderFor(schema)
.year("hour")
.build();
boolean overwrite = false;
connector.createTable(schema, spec, overwrite);
  • Write data to Iceberg table
import iceberg.IcebergConnector;

IcebergConnector connector = new IcebergConnector(uri, warehouse, namespace, table);

String record = '{"records":[{"ID":1,"Name":"Testing","Price": 1000,"Purchase_date":"2022-11-09T12:13:54.480"}]}';
String outputFile = null;
String dataFiles = connector.writeTable(record, outputFile);

Note: If you already have data files stored somewhere, you can move to the next commit phase.

  • Commit to iceberg table
import iceberg.IcebergConnector;

IcebergConnector connector = new IcebergConnector(uri, warehouse, namespace, table);
String dataFiles = '{"files":[{"file_path":"path_a"}]}';
connector.commitTable(dataFiles);
  • Read the iceberg table
import iceberg.IcebergConnector;

import java.util.List;

IcebergConnector connector = new IcebergConnector(uri, warehouse, namespace, table);
List<List<String>> records = connector.readTable();
for (List<String> record : records) {
for (int x = 0; x < record.size(); x++) {
String comma = x == record.size() - 1 ? "" : ", ";
System.out.print(record.get(x) + comma);
}
System.out.println();
}

Conclusion

Open source “java-iceberg-toolkit” is a Java implementation for performing operations on Apache Iceberg and Hive tables. It was created to help developers understand low level Iceberg APIs and can be used seamlessly to integrate with other products.

Please stay tuned for the next blog on “How to write a Java App Server to improve performance of Iceberg APIs”.

watsonx.data-Netezza-Iceberg

--

--