Learning Apache Iceberg — an introspection

Marin Aglić
10 min readMar 10, 2023

--

Bridging the gap between what I knew and what I wanted to learn. This is the first story I decided to write about my process of learning Apache Iceberg.

Image created with draw.io | Spark logo taken from wikimedia commons

Motivation

For some time now I’ve been interested in Data Engineering. Although my position is officially titled “software engineer”, most of the work I do is related to writing ETLs, SQL queries, terraform, and DQ checks. And I’m not claiming to be a Data Engineer, but I am interested in the field.

One of the first tools I used is Airflow, and I’m currently learning (Py)Spark.

I’ve heard about the terms Data Lake, and Data Mesh, and Data Lakehouse, and these terms were quire confusing at first. I guess everyone wants to describe what they do as concisely as possible? Anyway, at some point I came to learn about Apache Iceberg. And the main idea that kind of got my interest is:

Hey, we have a bunch of files, let’s treat them as tables

So, I started learning Iceberg. Coming from a Software engineering background there were a lot of questions. Maybe more then there should have been. In this post, I’ll try to present those questions and how I arrived at the answers.

First time reaching https://iceberg.apache.org/

When you get on Apache Iceberg homepage, the first thing you see is this:

The open table format for analytic datasets.

Ok…. what? If you scroll down just a little, you get to the description:

Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.

Being a software engineer I had more more questions at this point. What is an open table format? Ok, if I need SQL with my big data, shouldn’t I just use some database like PostgreSQL? What’s the catch? And what are all of these tools listed?

I started with entering “what is open table format” in Google. And a nice explanation is provided here. Basically, you want to have database-like features in your Data Lake. For example, we would like to have ACID support. Ok, now what is a Data Lake?

A data lake is a repository for large amounts of data, in whatever format you store it — structure, semi-structured or unstructured (e.g. files). So, basically you want to be able to query your unstructured data like it was ingested in a database.

Prerequisites

Ok, so to use Iceberg we need to know something about the compute engine that it uses. Since Spark is one the more popular tools and is the engine that supports the most features, let’s go with Spark. I have some experience with Spark from reading a book (Data Analysis with Python and PySpark), and from trying to setup a spark standalone cluster (here) and Hadoop Yarn (here).

I know about Spark as a big data analytics engine. Basically, if you have large amounts of data that needs to be analysed, Spark is a tested and safe bet. But, how does Iceberg fit into the picture with Spark?

If you’re more knowledgable or impatient than me, you should probably start up the docker setup provided on this GitHub repo and just get into it. However, in these situations I always remember the quote from the paper Notional Machines and Introductory Programming Education by Mr. Sorva:

… students need an understanding “one level beneath” the primary targeted level (of program code) that is viable for the purpose of explaining phenomena at the targeted level.

With that in mind, I decided to try and setup the environment by myself 😊.

Setting up the environment

I would like to organise this section into a couple of phases. Also, you can find the code on GitHub here.

“Copying” docker files from GitHub

I wanted to try and incorporate what I learn into the existing knowledge that I have. I wanted to know:

  1. how does Iceberg fit into Spark? Is it a framework on top of Spark? Something else?
  2. can it run remotely from Spark? Like another service that can communicate over a port?
  3. how is the data ingested into these so called Iceberg tables? What happens then?
  4. what would it mean to modify the data in these tables?
  5. can I use the standalone cluster that I already had prepared?

I started by going to that GitHub repo I mentioned and went through the docker compose and Dockerfile files. I decided to copy the parts that I (think) needed, i.e.

FROM pyspark as spark-iceberg

# Download iceberg spark runtime
RUN curl https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.1.0/iceberg-spark-runtime-3.3_2.12-1.1.0.jar -Lo /opt/spark/jars/iceberg-spark-runtime-3.3_2.12-1.1.0.jar

ENV TABULAR_VERSION=0.50.4

RUN curl https://tabular-repository-public.s3.amazonaws.com/releases/io/tabular/tabular-client-runtime/${TABULAR_VERSION}/tabular-client-runtime-${TABULAR_VERSION}.jar -Lo /opt/spark/jars/tabular-client-runtime-${TABULAR_VERSION}.jar

# Download Java AWS SDK
RUN curl https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/2.17.257/bundle-2.17.257.jar -Lo /opt/spark/jars/bundle-2.17.257.jar

# Download URL connection client required for S3FileIO
RUN curl https://repo1.maven.org/maven2/software/amazon/awssdk/url-connection-client/2.17.257/url-connection-client-2.17.257.jar -Lo /opt/spark/jars/url-connection-client-2.17.257.jar

# Install AWS CLI
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" \
&& unzip awscliv2.zip \
&& sudo ./aws/install \
&& rm awscliv2.zip \
&& rm -rf aws/

# Add iceberg spark runtime jar to IJava classpath
ENV IJAVA_CLASSPATH=/opt/spark/jars/*

RUN mkdir -p /home/iceberg/localwarehouse /home/iceberg/notebooks /home/iceberg/warehouse /home/iceberg/spark-events /home/iceberg

ARG jupyterlab_version=3.6.1

RUN apt-get update -y && \
apt-get install -y python3-pip python3-dev && \
pip3 install --upgrade pip && \
pip3 install wget jupyterlab==${jupyterlab_version}

# Add a notebook command
RUN echo '#! /bin/sh' >> /bin/notebook \
&& echo 'export PYSPARK_DRIVER_PYTHON=jupyter' >> /bin/notebook \
&& echo "export PYSPARK_DRIVER_PYTHON_OPTS=\"lab --notebook-dir=/home/iceberg/notebooks --ip='0.0.0.0' --NotebookApp.token='' --port=8888 --no-browser --allow-root\"" >> /bin/notebook \
&& echo 'pyspark' >> /bin/notebook \
&& chmod u+x /bin/notebook


ENTRYPOINT ["./entrypoint.sh"]
CMD ["notebook"]

With regards to the referent repo, I chose to use jupyter lab instead of jupyter notebook. I also removed the downloading of files and some other commands. The spark-defaults.conf file looked like this:

spark.master                           spark://spark-iceberg:7077
spark.eventLog.enabled true
spark.eventLog.dir /opt/spark/spark-events
spark.history.fs.logDirectory /opt/spark/spark-events
spark.sql.catalogImplementation in-memory
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Obviously, I didn’t configure everything that was mentioned in the docs. I merely added the iceberg extension. And this, along with the following line from the Quickstart docs:

spark-sql —packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0

told me that Iceberg is actually a package for Spark. So, back to the questions:

  1. how does Iceberg fit into Spark? Is it a framework on top of Spark? Something else? — it’s a package
  2. can it run remotely from Spark? Like another service that can communicate over a port? — I don’t think it can.

The docker compose file at this point:

version: '3.8'

services:
spark-iceberg:
image: spark-iceberg
container_name: spark-iceberg
build: ./spark
entrypoint: [ './entrypoint.sh', 'master' ]
volumes:
- ./warehouse:/home/iceberg/warehouse
- ./notebooks:/home/iceberg/notebooks
- ./data:/opt/spark/data
- ./spark_apps:/opt/spark/apps
- spark-logs:/opt/spark/spark-events
ports:
- '8888:8888'
- '8080:8080'
- '10000:10000'
- '10001:10001'


spark-worker:
container_name: spark-worker
image: spark-iceberg
entrypoint: [ './entrypoint.sh', 'worker' ]
depends_on:
- spark-iceberg
env_file:
- spark/.env
volumes:
- ./data:/opt/spark/data
- ./spark_apps:/opt/spark/apps
- spark-logs:/opt/spark/spark-events
ports:
- '8081:8081'

spark-history-server:
container_name: spark-history
image: spark-iceberg
entrypoint: [ './entrypoint.sh', 'history' ]
depends_on:
- spark-iceberg
env_file:
- spark/.env
volumes:
- spark-logs:/opt/spark/spark-events
ports:
- '18080:18080'


volumes:
spark-logs:

Running the first example

Build the docker images using docker compose build and run using docker compose up. And navigate to http://localhost:8888 and I prepared to a small notebook to try and save some data as a table. Here are the relevant parts of the app:

data = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]

schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])

df = spark.createDataFrame(data=data, schema=schema)
df.writeTo("db.test").create()

The suggested approach for writing tables by the Iceberg documentation is to use the V2 DataFrame API. Which I use in the above example. Upon executing this line df.writeTo("db.test").create() I got the following output:

23/03/05 17:23:04 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.

AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);
‘CreateTable `db`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, ErrorIfExists
+- LogicalRDD [firstname#0, middlename#1, lastname#2, id#3, gender#4, salary#5], false

I have no experience with Hive. Soooo… I Googled a bit about Hive and concluded that I didn’t want to learn it at this time. It’s basically a storage layer over Hadoop with it’s own query language [3]. I found out that in spark 3.0.0-preview2 spark.sql.legacy.createHiveTableByDefault was set to false as witnessed by the documentation here. However, when I ran a spark shell inside docker and checked the value for the configuration it turned out true:

>>> spark.conf.get("spark.sql.legacy.createHiveTableByDefault")
'true'

Seems like the default value had changed at some point. Therefore, Spark tries to create a Hive table, but we don’t want Hive and it would be out of scope of this story to dig deep into it.

At the end of this part I created a git commit called: Medium — setting up environment.

Configuring the catalog

Ok, so the previous didn’t work… surprise surprise. My next instinct was to try and configure a catalog that Iceberg would use. From the documentation here Iceberg supports a directory based catalog. That seemed simple enough. So I changed my spark-defaults.conf:

spark.master                           spark://spark-iceberg:7077
spark.eventLog.enabled true
spark.eventLog.dir /opt/spark/spark-events
spark.history.fs.logDirectory /opt/spark/spark-events
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.data org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.data.type hadoop
spark.sql.catalog.data.warehouse /home/iceberg/warehouse
spark.sql.defaultCatalog data
spark.sql.catalogImplementation in-memory

I added a new catalog named data that should be a SparkCatalog of the type hadoop, which should allow storing the data on the filesystem. I also configured where the directory to store the data and set the default catalog to be data.

When we run our note book now, mainly the last cell, everything seems to have been complete successfuly:

Executing create table in notebook

The volume /home/iceberg/warehouse is mapped to the location ./warehouse in docker compose, so when we look at this directory we can see this:

Structure after trying to save with Iceberg 1

So, the metadata is saved but the data seems to be missing. If we kill the running kernel or app and run the other notebook: iceberg-read-data we can see that we get the data:

Data is returned although we don’t see it in directory explorer

Now, how can this be? If we enter the worker docker container and execute the following:

root@e80874902d50:/opt/spark# ls -al /home/iceberg/warehouse/db/test/data/
total 40
drwxr-xr-x 2 root root 4096 Mar 7 20:33 .
drwxr-xr-x 3 root root 4096 Mar 7 20:33 ..
-rw-r--r-- 1 root root 24 Mar 7 20:33 .00000-0-8eff3bba-21dc-46e8-b20e-58b81b1640ea-00001.parquet.crc
-rw-r--r-- 1 root root 24 Mar 7 20:33 .00001-1-7e33edcb-f5e4-486f-9f9b-3f811375aa2f-00001.parquet.crc
-rw-r--r-- 1 root root 24 Mar 7 20:33 .00002-2-de082f5c-147b-49a4-b316-6fcfa8acbd92-00001.parquet.crc
-rw-r--r-- 1 root root 24 Mar 7 20:33 .00003-3-9511efc6-1658-4a99-8d01-72b9184025df-00001.parquet.crc
-rw-r--r-- 1 root root 1662 Mar 7 20:33 00000-0-8eff3bba-21dc-46e8-b20e-58b81b1640ea-00001.parquet
-rw-r--r-- 1 root root 1670 Mar 7 20:33 00001-1-7e33edcb-f5e4-486f-9f9b-3f811375aa2f-00001.parquet
-rw-r--r-- 1 root root 1691 Mar 7 20:33 00002-2-de082f5c-147b-49a4-b316-6fcfa8acbd92-00001.parquet
-rw-r--r-- 1 root root 1705 Mar 7 20:33 00003-3-9511efc6-1658-4a99-8d01-72b9184025df-00001.parquet

we see that the files are in the worker container. Ok, what I forgot to do at this point is to map the worker’s warehouse directory to the same warehouse directory that the master node uses. If we try out the first notebook again (with a slight change of using createOrReplace instead of create) we can see that the data is in the directory structure:

Data files located in directory

For this part I created the commit: Medium — configuring catalog. If you’re editor can read .parquet files you can open one of the data files and see something like this:

(Single) Parquet data file contents

Additional note

While I was working on this story, I ran the example from GitHub along with my setup. I noticed that the Jupyter notebook spark session hogs all of the driver resources. That means, that you can’t submit other jobs or other notebooks while the one is running. And I was asking myself “why was this?”.

Well, when setup on a spark standalone cluster mode with worker containers, as already said, the driver automatically hogs all resources. The master is set with the url: spark://spark-iceberg:7077. There is a way to run the notebook in local mode. This way, the driver won’t hog all of the resources, but I you might need to start a worker process in the master container. And the notebook won’t use the worker containers.

You can do this by changing the line in the Dockerfile to the following:

# Add a notebook command
RUN echo '#! /bin/sh' >> /bin/notebook \
&& echo 'export PYSPARK_DRIVER_PYTHON=jupyter' >> /bin/notebook \
&& echo "export PYSPARK_DRIVER_PYTHON_OPTS=\"lab --notebook-dir=/home/iceberg/notebooks --ip='0.0.0.0' --NotebookApp.token='' --port=8888 --no-browser --allow-root\"" >> /bin/notebook \
&& echo 'pyspark --master local[*]' >> /bin/notebook \
&& chmod u+x /bin/notebook

This will run spark locally with “as many worker threads as there are logical cores” when a notebook is started [5]. The application won’t appear on the spark master UI.

Summary

This story started out as an attempt to explain the troubles I had with understanding what Apache Iceberg actually is.

Apache Iceberg is a package that can be installed along with Spark to provide an open table format for data lakes. This allows us to have ACID support, among other features you might want in a data lake. For more read here.

In this story we’ve set up a data catalog and used Apache Iceberg with minimal configuration to store data on the local file system.

I haven’t answered all of the questions that I had. However, I plan to continue playing with Apache Iceberg to answer them step by step.

The code is on GitHub.

References

  1. https://github.com/tabular-io/docker-spark-iceberg
  2. https://iceberg.apache.org/docs/latest/spark-configuration/#catalogs
  3. https://jaceklaskowski.medium.com/why-is-spark-sql-so-obsessed-with-hive-after-just-a-single-day-with-hive-289e75fa6f2b
  4. https://spark.apache.org/docs/3.0.0-preview2/sql-migration-guide.html#upgrading-from-spark-sql-24-to-30
  5. https://stackoverflow.com/questions/32356143/what-does-setmaster-local-mean-in-spark

--

--

Marin Aglić

Working as a Software Engineer. Interested in Data Engineering. Mostly working with airflow, python, celery, bigquery. Ex PhD student.