Spark Operator, Delta Lake, and Hive Metastore with Postgres backend on Kubernetes (K8s)

5 min readMar 19, 2024

In today’s data-driven world, organizations are constantly seeking efficient and scalable solutions for managing and processing vast amounts of data. Apache Spark has emerged as a leading framework for big data processing, offering speed, versatility, and ease of use. However, as data volumes grow, so do the challenges of managing and optimizing data pipelines.

In this article, we’ll explore how to set up and configure Spark operator with Delta Lake integration and an external Hive metastore running on PostgreSQL in a Kubernetes environment.

Docker image for Spark and Delta

Refer to my previous post, Docker — Spark, Delta-Lake, External Hive-Metastore on Postgres, for complete details on running all these services on Docker. Build the docker image

docker build -t osds-spark8s -f DockerfileSpark .

Kubernetes Cluster on Local — using Docker Desktop

For this demo, using Kubernetes Cluster on local using Docker Desktop. You may use any Kubernetes Cluster of your choice.

To get a Kubernetes cluster using Docker Desktop, follow these steps:

Install Docker Desktop: If you haven’t already, download and install Docker Desktop for your operating system from the official Docker website.
Enable Kubernetes: After installing Docker Desktop, open the application and navigate to its settings/preferences. Look for the Kubernetes tab and click on it. Then, check the box to enable Kubernetes and click on the “Apply & Restart” button. This will start the process of setting up a Kubernetes cluster on your local machine.
Wait for Kubernetes Cluster Setup: Docker Desktop will automatically set up and configure a Kubernetes cluster on your local machine. This process may take a few minutes depending on your system’s resources and internet speed. You can monitor the progress of the cluster setup in the Docker Desktop interface.
Verify Kubernetes Installation: Once the setup process is complete, you can verify that Kubernetes is running by opening a terminal or command prompt and running the following command:

kubectl version

This command should display the version information for both the Kubernetes client and server.

Interact with Kubernetes Cluster: With the Kubernetes cluster up and running, you can start deploying and managing applications using kubectl commands or by using Kubernetes manifests (YAML files). You can also access the Kubernetes dashboard to visualize and manage your cluster using a web browser.

Refer to this post to install and configure Kubernetes Dashboard — Setting up Kubernetes Dashboard

Install Helm

Depending on your operating system, you can download and install Helm from the Helm website or package managers like Homebrew (for macOS) or Chocolatey (for Windows). Follow the installation instructions provided for your specific OS.

Verify helm installation

helm version
helm repo list

Postgres on Kubernetes — backend for Hive Metastore

Follow this post to launch Postgres on K8s with persistent volume - Postgres on Kubernetes(K8s)

Spark applications to Kubernetes

Submitting Spark applications to Kubernetes can be approached in two ways: through spark-submit or utilizing the spark-operator.

spark-submit:

Bundled with Spark, the spark-submit method is a straightforward way to submit Spark applications to Kubernetes.
However, subsequent operations on the Spark application require direct interaction with Kubernetes pod objects.

spark-operator:

Developed and open-sourced by GCP, the spark-operator project offers an alternative approach.
It operates by running a single pod on the cluster, which then converts Spark applications into custom Kubernetes resources.
These custom resources can be defined, configured, and described like other Kubernetes objects, providing more flexibility.
Additionally, the spark-operator offers additional features such as support for mounting ConfigMaps and Volumes directly from the Spark app configuration, enhancing its usability and versatility.

In this demonstration, the spark-operator is utilized. It is recommended to thoroughly review the official documentation for this operator. You can find the relevant links provided below.

Create a namespace, service account, cluster role

kubectl create namespace spark-apps
kubectl create serviceaccount spark --namespace=spark-apps
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark-apps:spark --namespace=spark-apps

Install the spark operator

helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
helm install osds-release spark-operator/spark-operator --namespace spark-operator --create-namespace --set sparkJobNamespace=spark-apps

The sparkJobNamespace parameter is used by the Spark Operator to determine the namespace in which to create Spark jobs.

Set context to spark-apps namespace

kubectl config set-context --current --namespace=spark-apps

spark-k8s.yaml — Spec for SparkApplication

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: osds-spark-app
  namespace: spark-apps
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: "osds-spark8s"
  mainApplicationFile: local:///opt/spark/work-dir/scripts/basic_spark.py
  sparkVersion: "3.5.0"
  restartPolicy:
    type: Never
  driver:
    coreLimit: "1"
    coreRequest: "1m"
    memory: "512m"
    labels:
      version: "3.5.0"
    serviceAccount: spark
    volumeMounts:
      - name: "spark-local-dir-spark-data-volume"
        mountPath: "/opt/spark/work-dir/data"
      - name: "spark-local-dir-spark-scripts-volume"
        mountPath: "/opt/spark/work-dir/scripts"
      - name: "spark-local-dir-spark-source-files-volume"
        mountPath: "/opt/spark/work-dir/source_files"
  executor:
    coreLimit: "1"
    coreRequest: "1m"
    instances: 1
    memory: "512m"
    labels:
      version: "3.5.0"
    volumeMounts:
      - name: "spark-local-dir-spark-data-volume"
        mountPath: "/opt/spark/work-dir/data"
      - name: "spark-local-dir-spark-scripts-volume"
        mountPath: "/opt/spark/work-dir/scripts"
      - name: "spark-local-dir-spark-source-files-volume"
        mountPath: "/opt/spark/work-dir/source_files"
  volumes:
    - name: "spark-local-dir-spark-data-volume"
      hostPath:
        path: "/Users/user/Documents/docker/osdsk8s/data"
        type: Directory
    - name: "spark-local-dir-spark-scripts-volume"
      hostPath:
        path: "/Users/user/Documents/docker/osdsk8s/scripts"
        type: Directory
    - name: "spark-local-dir-spark-source-files-volume"
      hostPath:
        path: "/Users/user/Documents/docker/osdsk8s/source_files"
        type: Directory

This YAML configuration defines a SparkApplication resource for Kubernetes using the spark-operator. Let’s break down the key components:

apiVersion and kind: These fields specify the API version and resource type, respectively. Here, we're using the SparkApplication resource provided by the spark-operator, which is in the v1beta2 version.
metadata: This section contains metadata about the SparkApplication, including its name and namespace.
spec: This section defines the specifications for the SparkApplication.
type: Specifies the type of Spark application, which is set to Python in this case.
pythonVersion: Specifies the Python version to use.
mode: Specifies the execution mode for the Spark application, which is set to cluster.
image: Specifies the Docker image to use for the Spark application.
mainApplicationFile: Specifies the main Python file for the Spark application.
sparkVersion: Specifies the version of Apache Spark to use.
restartPolicy: Specifies the restart policy for the Spark application. Here, it's set to Never, meaning the application won't be automatically restarted if it fails.
driver and executor: These sections define configurations for the Spark driver and executor pods, respectively, including resource limits, memory, and volume mounts.
volumes: Defines the volumes to mount into the Spark driver and executor pods. These volumes are sourced from the host machine and mounted into the containers running the Spark application. Note: The volume names should start with spark-local-dir as per the documentation.

Details about the data, scripts, and source files are provided in the previous post — Docker — Spark, Delta-Lake, External Hive-Metastore on Postgres

The only change in scripts/basic_spark.py will be the spark session builder:

builder = SparkSession.builder \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.sql.warehouse.dir","/opt/spark/work-dir/data/delta/osdp/spark-warehouse") \
    .config("hive.metastore.warehouse.dir","/opt/spark/work-dir/data/delta/osdp/spark-warehouse") \
    .config("javax.jdo.option.ConnectionURL", "jdbc:postgresql://postgres:5432/hive_metastore") \
    .config("spark.sql.catalogImplementation", "hive") \
    .config("javax.jdo.option.ConnectionDriverName", "org.postgresql.Driver") \
    .config("javax.jdo.option.ConnectionUserName", "hive") \
    .config("javax.jdo.option.ConnectionPassword", "hivepass123") \
    .config("datanucleus.schema.autoCreateTables", "true") \
    .config("hive.metastore.schema.verification", "false") \
    .master("local[2]") \
    .enableHiveSupport()