Learning Apache Iceberg — storing the data to Minio S3

Marin Aglić
7 min readJan 30, 2024

--

Bridging the gap between what I knew and what I wanted to learn. This is the third in a series of articles about learning Apache Iceberg. In this article, we’ll store the data to a Minio S3 bucket. You can find the links to the previous two articles: first and second.

Image created with draw.io | Spark logo taken from wikimedia commons

Introduction

In the first article that I wrote about Apache Iceberg, I stated a number of questions that I would like to answer. Just for reference, here are the questions once again:

  1. How does Iceberg fit into Spark? Is it a framework on top of Spark? Something else?
  2. Can it run remotely from Spark? Like another service that can communicate over a port?
  3. How is the data ingested into these so-called Iceberg tables? What happens, then?
  4. What would it mean to modify the data in these tables?
  5. Can I use the standalone cluster that I already had prepared?

The first article that I wrote focused on figuring out what Apache Iceberg was and how to setup the project. It showed how to setup a project using docker compose with minimal effort and store the data on the local file system. Apache Iceberg is an open table format specification, and a set of APIs and libraries for engines to interact with tables following that specification (see reference [1]).

The second article:

  1. shows how to store the catalog to Postgres;
  2. provides an in depth look about how the metadata is stored in metadata, manifest list, and manifest files.

In both articles, both the data and metadata were stored on the file system.

In this article, I hope to demonstrate how to connect to and store the files to Minio S3.

We’ll start by making the required modifications to the same project that I used for the previous two articles. The source code can be found here. Note that I also updated the packages since the last article.

NOTE: this story actually started as the first part of a much larger one, but I decided to split it into two. This first part covers connecting iceberg to Minio S3. The second one looks into appending, deleting and updating data.

Modifications to where we left off

So our goal is to use Minio S3 as the storage for metadata and data. I will continue using the Postgres database for the catalog.

Firstly, I made some changes to the Dockerfile. You can find most of the changes in the commit here (later I just updated Iceberg to 1.4.3).

New (updated) spark-defaults

Let’s see the modifications required for the spark-defaults.conf configuration file. For the previous story, we used the following configuration (stored in spark-defaults-pg-catalog.conf):

spark.master                           spark://spark-iceberg:7077
spark.eventLog.enabled true
spark.eventLog.dir /opt/spark/spark-events
spark.history.fs.logDirectory /opt/spark/spark-events
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.data org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.data.warehouse /home/iceberg/warehouse
spark.sql.catalog.data.catalog-impl org.apache.iceberg.jdbc.JdbcCatalog
spark.sql.catalog.data.uri jdbc:postgresql://pg-catalog:5432/iceberg
spark.sql.catalog.data.jdbc.user iceberg
spark.sql.catalog.data.jdbc.password iceberg
spark.sql.defaultCatalog data
spark.sql.catalogImplementation in-memory

We add a new file to the project, spark-defaults-minio.conf, with the following content:

spark.master                           spark://spark-iceberg:7077
spark.eventLog.enabled true
spark.eventLog.dir /opt/spark/spark-events
spark.history.fs.logDirectory /opt/spark/spark-events
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.data org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.data.warehouse s3://iceberg-data
spark.sql.catalog.data.s3.endpoint http://minio-s3:9000
spark.sql.catalog.data.io-impl org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.data.catalog-impl org.apache.iceberg.jdbc.JdbcCatalog
spark.sql.catalog.data.uri jdbc:postgresql://pg-catalog:5432/iceberg
spark.sql.catalog.data.jdbc.user iceberg
spark.sql.catalog.data.jdbc.password iceberg
spark.sql.defaultCatalog data
spark.sql.catalogImplementation in-memory

The changes are as follows:

  • change the spark.sql.catalog.data.warehouse location to s3://iceberg-data;
  • add the s3 endpoint: http://minio-s3:9000;
  • set S3FileIO as the custom FileIO implementation to use in the catalog by setting:
spark.sql.catalog.data.io-impl         org.apache.iceberg.aws.s3.S3FileIO

Daemonization changes

We’re going to need to add some environment variables for Minio S3 that the Spark containers will need access to. I chose to add these environment variables to the .env file.

The base docker compose file defines the following service:

  • spark-iceberg — our master node and the one running Jupyter lab for testing
  • spark-worker — the worker node
  • spark-history-server — the history server

The latter 2 containers used the .env file, which, up to this point, contained only a single setting SPARK_NO_DAEMONIZE=true. The setting prevents the containers from exiting. By default, the spark services will run in the background as daemon services. This setting, brings those services to the foreground. However, if we were to use this setting with the master node, the master node will never start Jupyter Lab. If you take a look at entrypoint.sh you will se why:

if [ "$SPARK_WORKLOAD" == "master" ];
then
start-master.sh -p 7077
eval notebook
...

Therefore, we need to bring this option out of the .env file. For the spark-worker and spark-history-server services, I add this option:

environment:
- SPARK_NO_DAEMONIZE=true

And, to the master (spark-iceberg):

env_file:
- spark/.env

Here si the new .env file (for now 😊):

__SPARK_NO_DAEMONIZE=true

We basically “removed” the option not to damonize services.

Minio S3

I already added the minio folder to the repo when I made the initial commit. However, I never commented on it. So, the contents of the directory are:

  • .env file for the environment variables
  • .env.backup same as .env but unused
  • Dockerfile — to run a custom script that I prepared for creating the required buckets, and possibly copy some files if the user wants (used to build minio-init)
  • entrypoint.sh — the shell script that accepts some options. It can either create an empty bucket, a bucket with some files, or mirror a directory (if I didn’t introduce any bugs 😅)
  • Readme.md — the file that should describe how to use the entrypoint script and what it does.

The entrypoint script is copied across projects where I use Minio S3. Making it work for this story required some minor changes. The most notable change was to update the endpoint for the service.

I added the following options to Minio’s env file:

USER=user
PASSWORD=password

MINIO_ROOT_USER=user
MINIO_ROOT_PASSWORD=password
MINIO_DOMAIN=minio-s3
MINIO_REGION=us-east-1

I just define the user and minio root user and region. As far as I could tell, when running Minio S3 in my setting, the only region I could use is us-east-1 (default).

I also needed to set MINIO_DOMAIN. For some reason, I kept getting the following exception without it:

Suppressed: software.amazon.awssdk.services.s3.model.S3Exception: The specified bucket is not valid. (Service: S3, Status Code: 400, Request ID: 17AE3098AF602ECE, Extended Request ID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8)

New docker compose

Now, we’re going to add a new docker compose file just like we did in the previous story. The new file is called docker-compose-minio.yml.

Here it is:

version: '3.8'

services:
pg-catalog:
image: postgres:15-alpine
container_name: pg_catalog
networks:
iceberg_net:
environment:
- POSTGRES_USER=iceberg
- POSTGRES_PASSWORD=iceberg
- POSTGRES_DB=iceberg
healthcheck:
test: [ "CMD", "pg_isready", "-U", "iceberg" ]
interval: 5s
retries: 5
ports:
- "5432:5432"
minio-s3:
image: minio/minio
container_name: iceberg_s3
ports:
- "9000:9000"
- "9001:9001"
env_file:
- ./minio/.env
command: server --console-address ":9001" /data
networks:
iceberg_net:
aliases:
- iceberg-data.minio-s3
volumes:
- minio-s3-data:/minio/data
healthcheck:
test: ["CMD", "mc", "ready", "local"]
interval: 30s
timeout: 20s
retries: 3
minio-s3-init:
build: ./minio/
networks:
iceberg_net:
env_file:
- ./minio/.env
volumes:
- ./minio/data/:/data
environment:
- USER=user
- COPY_DIR=false
- INPUT_BUCKETS=iceberg-data
depends_on:
- minio-s3
entrypoint: /bin/sh ./entrypoint.sh

spark-iceberg:
build:
context: ./spark
args:
SPARK_DEFAULTS_CONF: spark-defaults-minio.conf
networks:
iceberg_net:
depends_on:
pg-catalog:
condition: service_healthy
minio-s3-init:
condition: service_completed_successfully
spark-worker:
networks:
iceberg_net:
spark-history-server:
networks:
iceberg_net:
volumes:
minio-s3-data:
networks:
iceberg_net:

We define the two new services and a volume for minio:

  • minio-s3 — which is the container running our S3 storage;
  • minio-s3-init — which is the container used to initialise the S3 storage;
  • minio-s3-data — the volume for the minio-s3 container.

We set the master (spark-iceberg) to depend on the success of the minio init container. Furthermore, we need to set the alias iceberg-data.minio-s3 for the minio-s3 container. For this, I defined a new network iceberg-net and assign all of the containers to that network.

We also need to update spark’s env file. Here is the new content:

__SPARK_NO_DAEMONIZE=true

MINIO_USER=user
MINIO_PASSWORD=password
MINIO_REGION=us-east-1

AWS_ACCESS_KEY_ID=user
AWS_SECRET_ACCESS_KEY=password
AWS_REGION=us-east-1

As can be seen, we define the user and AWS access settings. These settings are the same as the .env file for Minio.

Running the getting started notebook

In the Makefile, I added a number of commands that either start only Minio S3 or the entire Iceberg + Postgres catalog + S3 demo.

For running the full example just run:

make run-iceberg-minio

After executing this line from the notebook:

df.writeTo("db.test").createOrReplace()

The objects should be stored on Minio S3.

Iceberg data stored on Minio S3 | captured by Giphy Capture

We can see the files stored on S3 storage 🥳🥳.

And here is the row that we have in our Postgres table:

+------------+---------------+----------+-------------------------------------------------------------------------------------------+--------------------------+
|catalog_name|table_namespace|table_name|metadata_location |previous_metadata_location|
+------------+---------------+----------+-------------------------------------------------------------------------------------------+--------------------------+
|data |db |test |s3://iceberg-data/db/test/metadata/00000-4436ded6-fe79-4b64-aa47-914729f1343d.metadata.json|null |
+------------+---------------+----------+-------------------------------------------------------------------------------------------+--------------------------+

Summary

In this story, we continued from the previous one with storing the data and metadata files to Minio S3 storage. In the next story, we’ll take a look at what metadata and data files are created when running append, delete and update operations.

Hope you found this story useful 😊.

The code is available on GitHub here.

References

  1. https://www.dremio.com/resources/guides/apache-iceberg-an-architectural-look-under-the-covers/
  2. Tabular github repo — https://github.com/tabular-io/docker-spark-iceberg

--

--

Marin Aglić

Working as a Software Engineer. Interested in Data Engineering. Mostly working with airflow, python, celery, bigquery. Ex PhD student.