A java toolkit for Apache Iceberg open table format

4 min readMay 14, 2023

Introduction

Data and AI industry is trending towards adapting Open Data Lakehouse ecosystem using technology such as Apache Iceberg etc. However, there are not enough open-source toolkits available to access Open data lakehouse.

“Java-iceberg-toolkit” is a Java implementation for performing operations on Apache Iceberg and Hive tables to enable open data lakehouse access to developers, data scientists and DB users.

The toolkit aims at overcoming one of the major challenges which was to find a unified tool that can help interact with iceberg tables.

Apache Iceberg — Java toolkit

If you don’t know about “Apache Iceberg”, you are not alone :) you can find more information about that here (https://iceberg.apache.org/). Basically it is an open table format for huge analytic datasets.

“Apache Iceberg” provides many functionalities one that caught my eyes was ACID (atomicity, consistency, isolation, and durability) support for tables. Considering the boom in open datasets (Parquet, AVRO, ORC etc), we wanted to explore this.

When we started exploring Iceberg we quickly realized that we could not find enough APIs or example to perform some of the basic table level operations. To help us and others, we created an “OpenSource toolkit” called java-iceberg-toolkit (https://github.com/IBM/java-iceberg-toolkit).

— This toolkit supports the following operations —

Provides a simple and easy-to-use interface to interact with Apache Iceberg and Hive tables.
Provides APIs to perform operations on Iceberg tables and Hive tables.
Supports all primitive data types.
Supports most of the operations on Iceberg and Hive tables:
Create namespace
Create table
Get schema of a plan
Get plan task
Read table
Write table

(more details: https://github.com/IBM/java-iceberg-toolkit#supported-operations)

java-iceberg-toolkit provides the following interfaces

A CLI which is ready to use when the code is packaged and all configurations are in place. (https://github.com/IBM/java-iceberg-toolkit#cli-2)

$ java -jar <jar_name> --help
usage: java -jar <jar_name> [options] command [args]
    --format <iceberg|hive>       The format of the table we want to
                                  display
 -h,--help                        Show this help message
 -o,--output <console|csv|json>   Show output in this format
    --snapshot <snapshot ID>      Snapshot ID to use
 -u,--uri <value>                 Hive metastore to use
 -w,--warehouse <value>           Table location

Commands:
  drop                 Drop a table or a namespace
  schema               Fetch schema of a table
  read                 Read from a table
  commit               Commit file(s) to a table
  list                 List tables or namespaces
  type                 Fetch table type
  uuid                 Fetch uuid of a table
  spec                 Fetch partition spec of a table
  rename               Rename a table a table
  create               Create a table or a namespace
  files                List data files of a table
  location             Fetch table location
  describe             Get details of a table or a namespace
  write                Write to a table
  snapshot             Fetch latest or all snapshot(s) of a table

Programmable APIs

For ex,

Get schema of a Hive table:

import iceberg.HiveConnector;

HiveConnector connector = new HiveConnector(uri, warehouse, namespace, table);
Schema schema = connector.getTableSchema();

Get schema of an Iceberg table:

import iceberg.IcebergConnector;
import org.apache.iceberg.Schema;

IcebergConnector connector = new IcebergConnector(uri, warehouse, namespace, table);

Schema schema = connector.getTableSchema();

Create an unpartitioned Iceberg table

import iceberg.IcebergConnector;

import org.apache.iceberg.PartitionSpec;
import org.apache.iceberg.Schema;
import org.apache.iceberg.types.Types;

IcebergConnector connector = new IcebergConnector(uri, warehouse, namespace, null);
Schema schema = new Schema(
            Types.NestedField.required(1, "ID", Types.IntegerType.get()),
            Types.NestedField.required(2, "Name", Types.StringType.get()),
            Types.NestedField.required(3, "Price", Types.DoubleType.get()),
            Types.NestedField.required(4, "Purchase_date", Types.TimestampType.withoutZone())
            )
        );
PartitionSpec spec = PartitionSpec.unpartitioned();
boolean overwrite = false;
connector.createTable(schema, spec, overwrite);

These APIs are really powerful to solve some use-cases, for example, bulk load ingestion of files directly into Iceberg table.

Use-Case

Description : Bulk copy of parquet files to data lake in Iceberg table format.

Steps:

Create an Iceberg table

import iceberg.IcebergConnector;

import org.apache.iceberg.PartitionSpec;
import org.apache.iceberg.Schema;
import org.apache.iceberg.types.Types;

IcebergConnector connector = new IcebergConnector(uri, warehouse, namespace, null);
Schema schema = new Schema(
            Types.NestedField.required(1, "ID", Types.IntegerType.get()),
            Types.NestedField.required(2, "Name", Types.StringType.get()),
            Types.NestedField.required(3, "Price", Types.DoubleType.get()),
            Types.NestedField.required(4, "Purchase_date", Types.TimestampType.withoutZone())
            )
        );
PartitionSpec spec = PartitionSpec.builderFor(schema)
      .year("hour")
      .build();
boolean overwrite = false;
connector.createTable(schema, spec, overwrite);

Write data to Iceberg table

import iceberg.IcebergConnector;

IcebergConnector connector = new IcebergConnector(uri, warehouse, namespace, table);

String record = '{"records":[{"ID":1,"Name":"Testing","Price": 1000,"Purchase_date":"2022-11-09T12:13:54.480"}]}';
String outputFile = null;
String dataFiles = connector.writeTable(record, outputFile);

Note: If you already have data files stored somewhere, you can move to the next commit phase.

Commit to iceberg table

import iceberg.IcebergConnector;

IcebergConnector connector = new IcebergConnector(uri, warehouse, namespace, table);
String dataFiles = '{"files":[{"file_path":"path_a"}]}';
connector.commitTable(dataFiles);

Read the iceberg table

import iceberg.IcebergConnector;

import java.util.List;

IcebergConnector connector = new IcebergConnector(uri, warehouse, namespace, table);
List<List<String>> records = connector.readTable();
for (List<String> record : records) {
    for (int x = 0; x < record.size(); x++) {
        String comma = x == record.size() - 1 ? "" : ", ";
        System.out.print(record.get(x) + comma);
    }
    System.out.println();
}

Conclusion

Open source “java-iceberg-toolkit” is a Java implementation for performing operations on Apache Iceberg and Hive tables. It was created to help developers understand low level Iceberg APIs and can be used seamlessly to integrate with other products.

Please stay tuned for the next blog on “How to write a Java App Server to improve performance of Iceberg APIs”.