Developing with Apache Iceberg & PySpark

Thomas Lawless
5 min readJun 17, 2024

--

Photo by Shahadat Rahman on Unsplash

Apache Iceberg and PySpark are powerful tools for managing and analyzing large datasets. Setting up a local development environment is crucial for leveraging these technologies effectively. In this blog post, we’ll explore how to create a simple but productive development environment using Visual Studio Code (VSCode), Poetry, and Apache Iceberg.

Project Creation

The following steps outline how to install Poetry and initialize a Poetry project. These steps assume you already have a recent version of Python v3 installed.

Why use Poetry?

  • Simplified Dependency Management: Poetry simplifies the process of adding, removing, and managing dependencies. It automatically handles version constraints and ensures that all dependencies are compatible with each other, reducing the likelihood of conflicts and compatibility issues.
  • Enhanced Project Isolation: By default, Poetry creates a virtual environment for each project, ensuring that dependencies are isolated and do not interfere with other projects. This is particularly important for PySpark developers who often work with multiple projects that may require different versions of libraries.
  • Streamlined Development Workflow: Poetry provides an easy-to-use CLI for managing project dependencies and configurations. It handles complex tasks such as dependency resolution, virtual environment management, and package publishing with simple commands, allowing developers to focus on writing code rather than managing dependencies.
  • Development Dependency Management: For PySpark developers, Poetry allows the inclusion of development dependencies that are not needed in production, such as testing frameworks or PySpark itself for local development. This keeps the production environment lean and avoids unnecessary bloating of the deployment package.

1. Install Poetry

First, install Poetry using pip, if not already installed.

# Install Poetry
pip install poetry

2. Set Up Your Project using Poetry

Create a new directory for your project and initialize it with Poetry.

# Create a new directory for your project
mkdir iceberg_pyspark_project
cd iceberg_pyspark_project

# Initialize a new Poetry project
cd iceberg_pyspark_project
poetry init

This command sets up a new Poetry project with an interactive experience. Answer the prompt using the values appropriate for your project.

3. Create a Virtual Environment

Create a virtual environment within the project and use poetry to spawn a shell using the virtual environment.

# Create a virtual environment.
python -m venv .venv

# Shawn a shell within the new virtual environment.
poetry shell

4. Add Dependencies

Add PySpark and Jupyter as a development dependencies. This way both tools will be available locally, but not included in the project’s production dependencies.

# Add PySpark as a development dependency
poetry add --group=dev pyspark ipykernel

VSCode & the Jupyter Extension

The next set of steps below outline how to install VSCode and configure the installation with the required extensions.

Why use the Jupyter Extension?

  • Seamless Integration: The Jupyter extension integrates Jupyter notebooks directly into VSCode, providing a cohesive environment that combines the interactive capabilities of Jupyter with the powerful features of VSCode, such as version control, debugging, and code refactoring tools.
  • Advanced Editing and Debugging: VSCode’s advanced code editing and debugging features, such as IntelliSense, code navigation, and error highlighting, are available within Jupyter notebooks, increasing efficiency and reducing the likelihood of errors.
  • Streamlined Workflow: The extension supports integration with other VSCode features and extensions, such as source control management and terminal access, creating a streamlined workflow that consolidates all development activities in one place, improving overall productivity.

1. Install VSCode

VSCode is downloaded and installed using the instructions specific to your operating system. The download and instructions can be found at https://code.visualstudio.com/.

2. Install Extensions

Our local Apache Iceberg development environment will need both the Python and Jupyter extensions for VSCode. The commands below will install these extensions using the command line.

code --install-extension ms-python.python
code --install-extension ms-toolsai.jupyter

3. Configure the Jupyter Notebook Environment

The next step is to create a Jupyter Notebook and configure VSCode to use the Python interpreter from the project’s virtual environment which is now managed by Poetry. Using VSCode, create a new file named iceberg_pyspark.ipynb. When you open this file in VSCode, you are given the option to select a Python interpreter. Choose the interpreter from the project’s virtual environment located in the .venv directory.

Developing with a Local Catalog

With our tools installed and our environment configured, we can now create a simple Apache Iceberg catalog for local development. The steps below use the Jupyter Notebook we created in the last section.

Why use Apache Iceberg?

  • Schema Evolution: Iceberg supports schema evolution without compromising read consistency, allowing for seamless updates to data structures over time without interrupting ongoing queries or data access.
  • Transactional Consistency: It offers ACID (Atomicity, Consistency, Isolation, Durability) transactions, ensuring data integrity and reliability, crucial for complex analytics workloads.
  • Time Travel Queries: Iceberg allows querying data at different points in time, facilitating historical analysis and ensuring that past states of data can be accurately retrieved and analyzed.
  • -Optimized for Large-scale Analytics: It is designed for large-scale analytics workloads, with efficient file format support (like Parquet and ORC), partition pruning, and metadata management, enabling fast query performance even on massive datasets.
  • Open Format and Ecosystem Integration: Iceberg is an open-source project with broad ecosystem support (e.g., Spark, Hive, Presto), making it easy to integrate into existing data pipelines and enabling compatibility with various analytics tools and frameworks.

1. Configure PySpark for Apache Iceberg using a Hadoop Catalog

Create a Spark session configured to use Iceberg and the local Hadoop catalog. Add the following code to your Jupyter notebook:

from pyspark.sql import SparkSession

# Initialize Spark session with Iceberg configurations
spark = SparkSession.builder \
.appName("IcebergLocalDevelopment") \
.config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2') \
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.local.type", "hadoop") \
.config("spark.sql.catalog.local.warehouse", "spark-warehouse/iceberg") \
.getOrCreate()

# Verify Spark session creation
spark.sql("SHOW DATABASES").show()

Note: A Hadoop catalog is great for local development, but is not good choice for a production environment.

2. Create and Query an Iceberg Table

In your Jupyter notebook, create a simple Iceberg table and run some queries to verify everything is set up correctly:

# Create an Iceberg table
spark.sql("""
CREATE TABLE local.schema.users (
id INT,
name STRING,
age INT
) USING iceberg""")

# Insert some sample data
spark.sql("""
INSERT INTO local.schema.users VALUES
(1, 'Alice', 30),
(2, 'Bob', 25),
(3, 'Charlie', 35)""")

# Query the data
result = spark.sql("SELECT * FROM local.schema.users")
result.show()

Conclusion

Setting up a productive local development environment for Apache Iceberg and PySpark involves integrating several tools to streamline your workflow. Using VSCode with the Jupyter extension, Poetry for dependency management, and configuring Iceberg with a Hadoop catalog provides a solid foundation for developing, testing, and experimenting with data pipelines and analytical workflows. Start leveraging the power of Apache Iceberg and PySpark in your local development environment today, and streamline your data engineering and analysis tasks with ease!

--

--

Thomas Lawless

Distinguished Engineer, IBM CIO Data, AI, and Automation Platform