iceberg-spark-docker-compose-0 — Apache Iceberg

Quickstart Iceberg with Spark and Docker Compose

3 min readApr 24, 2023

Introduction

Apache Iceberg is an open table format (way to organize data files) for huge (petabytes) analytic datasets. It was created by Netflix, open sourced on ASF. It is in extensive use at Netflix, Apple and many other companies. Tabular.io is based on Apache Iceberg tables. Dremio.com Arctic is built for Apache Iceberg.

It’s open table format comprises of two components:

metadata files (metadata files, manifest list file, manifest file)
data files (data itself)

Most popular query engine or framework is Apache Spark and others are Snowflake, Trino, Starburst, …

Next Step: Start Iceberg, Spark

There are multiple ways to try Apache Iceberg. If you prefer cloud, you can try Tabular.io hosted.

I will use docker-compose.yml. I am running these steps on my fully loaded MBP. I will skip few descriptions and save your time.

% git clone https://github.com/tabular-io/docker-spark-iceberg
% cd docker-spark-iceberg
% docker compose up

...
spark-iceberg  | [I 20:22:37.095 NotebookApp] Serving notebooks from local directory: /home/iceberg/notebooks
spark-iceberg  | [I 20:22:37.098 NotebookApp] Jupyter Notebook 6.5.4 is running at:
spark-iceberg  | [I 20:22:37.098 NotebookApp] http://0bfdb4bae85e:8888/
spark-iceberg  | [I 20:22:37.099 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

Open a new terminal window and get to spark-sql prompt.

% docker exec -it spark-iceberg spark-sql
...
Spark master: local[*], Application Id: local-1682287035887
spark-sql>

Create a table.

spark-sql> CREATE TABLE demo.nyc.taxis
         > (
         >   vendor_id bigint,
         >   trip_id bigint,
         >   trip_distance float,
         >   fare_amount double,
         >   store_and_fwd_flag string
         > )
         > PARTITIONED BY (vendor_id);
Time taken: 14.935 seconds

Insert data.

spark-sql> INSERT INTO demo.nyc.taxis
         > VALUES (1, 1000371, 1.8, 15.32, 'N'), (2, 1000372, 2.5, 22.15, 'N'), (2, 1000373, 0.9, 9.01, 'N'), (1, 1000374, 8.4, 42.13, 'Y');
Time taken: 11.659 seconds

Read data.

spark-sql> SELECT * FROM demo.nyc.taxis;
1 1000371 1.8 15.32 N
1 1000374 8.4 42.13 Y
2 1000372 2.5 22.15 N
2 1000373 0.9 9.01 N
Time taken: 1.804 seconds, Fetched 4 row(s)

Next Up: Run notebook

Browse to http://localhost:8888.

Click to open Iceberg — Getting Started.ipynb and run In [1]:

Keep running the next lines until the end of the notebook.

We will create a database, table, load data (parquet format) into the table (count*=2171187).
Then, we evolve the schema, alter the table, add column, add data to that column by computing data from two other columns.
And then do row level changes, deletes from the table (count*=17703).
And then do in place partitioning.
And then query the metadata of this table.
And end with time travel by creating and querying snapshots of this table (count*=2171187).

Conclusion

Apache Iceberg is engine/framework agnostic. Separates and Maintains the metadata and data files. Provides Transactional Consistency. Successfully deployed in production at multiple companies with loads of 10s of PetaBytes and 1+M partitions. Combining with Apache Ozone brings the best of the both.

References:

Apache Iceberg Website

Iceberg Specification

The State of Apache Iceberg`

Running Apache Iceberg at Petabyte Scale — Takeaways & Lessons Learned

Apache Iceberg: An Architectural Look Under the Covers