Unleashing the Power of Polaris: Revolutionize Your Data Lakehouse With Polaris OSS, Apache Iceberg and StarRocks

8 min readAug 10, 2024

The world of data lakehouse management is evolving rapidly, and the recent open-source release of Polaris is set to redefine how organizations manage and query their massive datasets. Polaris, combined with Apache Iceberg’s table format and StarRocks’ low-latency SQL engine, offers an unparalleled solution for scalable, secure, and efficient data lakehouse management. In this article, we’ll guide you through deploying Polaris in a Kubernetes environment, setting up catalogs and roles, and leveraging the power of Spark and StarRocks for data processing and querying.

Introducing Polaris: The New Open-Source Game Changer

Polaris is an innovative, open-source tool designed to manage data catalogs and principals, providing a robust framework for controlling access and organizing data within a data lakehouse. The integration of Polaris with Iceberg, an advanced table format designed for large-scale data lakehouse, and StarRocks, a blazing-fast SQL engine optimized for low-latency queries, creates a powerful ecosystem that enhances both data management and accessibility.

With Polaris, you can now ensure that your data is not only well-organized but also easily accessible and secure, thanks to its role-based access control and seamless integration with existing data processing tools.

A Brief Overview of Polaris Architecture

Polaris is designed around a few key concepts that together form a comprehensive system for managing data access within a data lakehouse:

Catalogs: At the heart of Polaris, a catalog is a logical namespace that organizes datasets within the data lakehouse. It manages metadata and provides a structured way to access and manage large amounts of data.
Principals: These are entities (such as users or services) that interact with the catalog. Each principal has a unique identifier and is responsible for specific actions within the Polaris ecosystem.
Principal Roles: These are roles assigned to principals that define what actions a principal can perform within the catalog. For example, a principal role might allow a user to read from or write to a specific dataset.
Catalog Roles: These are roles specific to catalogs that define permissions for accessing and managing data within a catalog. Catalog roles are often assigned to principal roles, creating a hierarchy of permissions that ensures secure data access.

These elements together create a flexible and secure system for managing data access, allowing organizations to control who can do what within their data lakehouse. For a more in-depth understanding of Polaris’ architecture, including the intricacies of its catalog structure, please refer to this detailed article on Medium.

Step 1: Deploying Polaris on Kubernetes

The full source code for deploying Polaris, along with the Python client and additional configurations, is available in my GitHub repository. You can clone the repository and follow along with the steps outlined in this guide.

Deploying Polaris within a Kubernetes environment offers several advantages. Kubernetes is renowned for its scalability, resilience, and ease of management, making it the ideal platform for running complex, distributed applications like Polaris. By leveraging Kubernetes, you can deploy Polaris in a way that is both scalable and secure, ensuring high availability and reliability for your data lakehouse operations.

Setting Up the Environment

Before you begin, ensure that your environment is equipped with the following prerequisites:

Kubernetes Cluster: A running Kubernetes cluster with Helm and kubectl configured.
AWS Credentials: Necessary for integrating Polaris with Amazon S3, which will serve as the storage backend for your data catalogs.
Docker: Polaris will be deployed as a Docker container, so Docker should be installed on your system.
Helm: Helm is essential for managing the Kubernetes deployment of Polaris.

Deploying Polaris with Helm

Now that your secret is in place, deploy Polaris using Helm. This step will set up Polaris as a service within your Kubernetes cluster, allowing it to manage your data lakehouse catalogs and principals.

cd deployment/

export AWS_ACCESS_KEY_ID=your-access-key-id
export AWS_SECRET_ACCESS_KEY=your-secret-access-key

helm upgrade --install \
      --debug \
      --wait=false  \
      --set=awsAccessKeyId=${AWS_ACCESS_KEY_ID}  \
      --set=awsSecretAccessKey=${AWS_SECRET_ACCESS_KEY} \
      "polaris" -f values.yaml .

Retrieving Polaris Credentials from Logs

To interact with Polaris, you’ll need the root principal credentials, which are logged when Polaris starts. Retrieve these credentials by checking the Polaris pod’s logs.

kubectl logs -f polaris-<pod-id> -n metastore

Look for a log entry similar to this:

realm: default-realm root principal credentials: c9145a923a0f25de:340232a65999a503c8fc1f91b4a2f553

These credentials (client_id:client_secret) will be used to authenticate with the Polaris API in the next steps.

Step 2: Harnessing the Power of the Polaris Python Client

With Polaris up and running, the next step is to use the Polaris Python client to create and manage catalogs, principals, and roles. This is where the true power of Polaris shines, allowing you to define and control access to your data lakehouse with precision.

The Polaris Python client has been meticulously crafted following the principles of Clean Code Architecture and SOLID design concepts. This architectural approach ensures that the client is modular, maintainable, and scalable, making it easier to extend and adapt to future requirements. The separation of concerns, achieved through layers such as entities, use cases, interfaces, and infrastructure, ensures that the codebase remains clean and easy to navigate.

Setting Up Your Local and Development Environment

Before diving into deploying Polaris, it’s essential to set up your local and development environments. This setup ensures that you have all the necessary tools and dependencies to work with the Polaris Python client effectively.

Start by navigating to the appropriate directory and setting up a virtual environment:

cd polaris/regtests/client/python
python3 -m venv .venv
source .venv/bin/activate

Once your virtual environment is activated, install the necessary development dependencies using poetry:

pip install poetry==1.5.0
poetry install && pip install -e .
cd ../../../../polaris_client

Creating a Catalog and Managing Roles with Argparse

The Polaris Python client uses the argparse module to handle command-line arguments, making it easy to pass in the necessary parameters for interacting with the Polaris API.

Here’s how you can use the client to create a catalog, principal, principal role, and catalog role, and assign the necessary roles and privileges:

python run.py \
    --client_id c9145a923a0f25de \
    --client_secret 340232a65999a503c8fc1f91b4a2f553 \
    --host http://localhost:8181/api/catalog \
    --catalog_name demo \
    --s3_location s3://demo-polariscatalog/ \
    --role_arn arn:aws:iam::123456789876:role/polaris-storage-role \
    --principal_name polarisuser \
    --principal_role_name polarisuserrole \
    --catalog_role_name polariscatalogrole \
    --role_type admin

Understanding Argparse

In the Polaris client, it allows you to specify parameters such as client_id, client_secret, host, and others directly from the command line.

--client_id: The unique identifier for the principal making the API request.
--client_secret: The secret key associated with the client_id, used for authentication.
--host: The base URL of the Polaris API.
--catalog_name: The name of the catalog you wish to create or manage.
--s3_location: The S3 bucket location where the catalog data will be stored.
--role_arn: The Amazon Resource Name (ARN) of the IAM role with necessary permissions to access the S3 bucket.
--principal_name: The name of the principal (user or service) you wish to create.
--principal_role_name: The name of the role to assign to the principal.
--catalog_role_name: The name of the role to assign within the catalog.
--role_type: Specifies the type of role (e.g., admin, reader) that dictates the level of access.

Writing Data to an Iceberg Table with Spark

With your Polaris setup complete and your principal roles configured, you can now use Spark to interact with your data lakehouse. The following steps demonstrate how to write data to an Iceberg table within the demo_catalog using the credentials assigned to your principal.

This example assumes you have a Spark environment ready and configured to communicate with your Polaris instance.

Spark Application for Writing Data to Iceberg

Here’s a Spark application that connects to the demo_catalog catalog in Polaris, creates a new namespace and table, inserts data, and then queries that data:

import pyspark
from pyspark.sql import SparkSession

POLARIS_URI = 'http://localhost:8181/api/catalog'
POLARIS_CATALOG_NAME = 'demo_catalog'
POLARIS_CREDENTIALS = '7f67b9fe1d16a122:aa12267c73e5edb5c83a57e98e0aa4fb'
POLARIS_SCOPE = 'PRINCIPAL_ROLE:ALL'
AWS_REGION = 'us-east-2'

conf = (
    pyspark.SparkConf()
        .setAppName('app_name')
        # SQL Extensions
        .set('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog')
        .set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
        # Configuring Catalog
        .set('spark.sql.catalog.polaris', 'org.apache.iceberg.spark.SparkCatalog')
        .set('spark.sql.catalog.polaris.type', 'rest')
        .set('spark.sql.catalog.polaris.uri', POLARIS_URI)
        .set('spark.sql.catalog.polaris.token-refresh-enabled', 'true')
        .set('spark.sql.catalog.polaris.credential', POLARIS_CREDENTIALS)
        .set('spark.sql.catalog.polaris.warehouse', POLARIS_CATALOG_NAME)
        .set('spark.sql.catalog.polaris.scope', POLARIS_SCOPE)
        .set('spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation', 'true')
        .set('spark.sql.catalog.polaris.io-impl', 'org.apache.iceberg.io.ResolvingFileIO')
        .set('spark.sql.catalog.polaris.s3.region', AWS_REGION)
)

# Start Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
print("Spark Running")

# Run a Query to create a namespace
spark.sql("CREATE NAMESPACE IF NOT EXISTS polaris.bronze").show()

# Create an Iceberg table within the namespace
spark.sql("CREATE TABLE IF NOT EXISTS polaris.bronze.table2 (id INT, name STRING) USING iceberg").show()

# Insert data into the table
spark.sql("INSERT INTO polaris.bronze.table2 VALUES (1, 'Fabiano Pena'), (2, 'John Doe')").show()

# Query the table to retrieve data
spark.sql("SELECT * FROM polaris.bronze.table2").show()

# Stop Spark Session
spark.stop()
print("Spark Session Stopped")

Time for Querying: Exploring Data with StarRocks

After successfully setting up your data lakehouse with Polaris and inserting data using Spark, the next step is to explore this data using StarRocks, a high-performance SQL engine designed for real-time analytics. StarRocks’ ability to handle low-latency queries makes it an ideal tool for quickly accessing and analyzing large datasets stored in your Iceberg tables.

Configuring the External Catalog in StarRocks

To begin querying data, you first need to configure an external catalog in StarRocks that connects to your Polaris-managed Iceberg catalog. This configuration allows StarRocks to interact with the data stored in Polaris.

Here’s how to set up the external catalog in StarRocks:

CREATE EXTERNAL CATALOG demo_catalog
PROPERTIES
(
    "type" = "iceberg",
    "iceberg.catalog.type" = "rest",
    "iceberg.catalog.uri" = "http://polaris.metastore.svc.cluster.local:8181/api/catalog",
    "iceberg.catalog.credential" = "c9145a923a0f25de:340232a65999a503c8fc1f91b4a2f553",
    "iceberg.catalog.scope" = "PRINCIPAL_ROLE:ALL",
    "iceberg.catalog.warehouse" = "demo_catalog"
);

Once the external catalog is configured, you can start exploring the data stored in Polaris through StarRocks.

Summary

By configuring StarRocks with an external catalog that connects to your Polaris-managed Iceberg catalog, you can seamlessly query and explore your data with low latency. This setup allows you to harness the full power of your data lakehouse, enabling quick and efficient data analysis. Whether you’re running complex queries or performing simple data retrieval, StarRocks combined with Polaris and Iceberg provides a robust solution for modern data management and exploration.