Spark, Delta Lake, and External Hive-Metastore with Postgres — in Docker

OSDS
7 min readMar 19, 2024

--

In the realm of big data processing, Apache Spark has emerged as a powerful and versatile framework, providing developers with a unified analytics engine for large-scale data processing. Combined with Delta Lake, an open-source storage layer that brings ACID transactions, scalable metadata handling, and data versioning to Apache Spark, it forms a potent combination for building robust data pipelines and performing advanced analytics tasks.

Running Spark and Delta Lake in Docker containers offers several advantages, including portability, scalability, and ease of deployment. Docker containers encapsulate the entire runtime environment, including dependencies, libraries, and configurations, making it straightforward to spin up consistent development, testing, and production environments across different systems and platforms.

In this article, we’ll explore how to set up and run Spark and Delta Lake in Docker containers, covering topics such as containerization, Dockerfile configurations, Docker Compose orchestration, and best practices for managing data volumes and network communication. Whether you’re a data engineer, data scientist, or DevOps practitioner, understanding how to harness the power of Docker to run Spark and Delta Lake can streamline your workflow and accelerate your data-driven projects. Join us as we delve into the world of containerized big data processing with Apache Spark and Delta Lake.

Docker image for spark

Using the official Spark image from https://hub.docker.com/_/spark. The installation tasks are executed as root

FROM spark:3.5.0-scala2.12-java11-python3-ubuntu
USER root

The work directory is set to /opt/spark/work-dir. Copy the requirements file that contains the delta-lake package and install it with the pip command.

WORKDIR /opt/spark/work-dir
COPY requirements.txt ${WORKDIR}/requirements.txtRUN pip install --upgrade pip \
&& pip install --no-cache-dir -r ${WORKDIR}/requirements.txt \
&& rm -f ${WORKDIR}/requirements.txt

Set environment variables and install configs

ARG osdsuser=osdsuser
ARG GROUP=osdsuser
ARG WORKDIR=/opt/spark/work-dir
ENV DELTA_PACKAGE_VERSION=delta-spark_2.12:${DELTA_SPARK_VERSION}
ENV PATH="/opt/spark/sbin:/opt/spark/bin:${PATH}"
ENV SPARK_HOME="/opt/spark"
ENV PYSPARK_PYTHON python3
RUN groupadd -r ${GROUP} && useradd -r -m -g ${GROUP} ${osdsuser}
RUN apt -qq update
RUN apt -qq -y install vim curl

Finally, copy the jars from the local to the image and the hive-site.xml file. Get the jars from the official Delta-Lake and Postgres websites. Update the paths on your local machine accordingly. hive-site.xml file is provided below.

COPY --chown=${osdsuser} ./source_files/spark_delta/spark_venv/spark_home/jars/delta-storage-3.1.0.jar /opt/spark/jars/delta-storage-3.1.0.jar
COPY --chown=${osdsuser} ./source_files/spark_delta/spark_venv/spark_home/jars/delta-spark_2.12-3.1.0.jar /opt/spark/jars/delta-spark_2.12-3.1.0.jar
COPY --chown=${osdsuser} ./source_files/spark_delta/spark_venv/spark_home/jars/postgresql-42.2.24.jar /opt/spark/jars/postgresql-42.2.24.jar
COPY --chown=${osdsuser} ./source_files/spark_delta/spark_venv/spark_home/conf/hive-site.xml /opt/spark/conf/hive-site.xml

Putting it all together

requirements.txt

delta-spark==3.1.0

DockerfileSpark

FROM spark:3.5.0-scala2.12-java11-python3-ubuntu

USER root

WORKDIR /opt/spark/work-dir

COPY requirements.txt ${WORKDIR}/requirements.txt

RUN pip install --upgrade pip \
&& pip install --no-cache-dir -r ${WORKDIR}/requirements.txt \
&& rm -f ${WORKDIR}/requirements.txt


ARG osdsuser=osdsuser
ARG GROUP=osdsuser
ARG WORKDIR=/opt/spark/work-dir
ENV DELTA_PACKAGE_VERSION=delta-spark_2.12:${DELTA_SPARK_VERSION}
ENV PATH="/opt/spark/sbin:/opt/spark/bin:${PATH}"
ENV SPARK_HOME="/opt/spark"
ENV PYSPARK_PYTHON python3

RUN groupadd -r ${GROUP} && useradd -r -m -g ${GROUP} ${osdsuser}
RUN apt -qq update
RUN apt -qq -y install vim curl

COPY --chown=${osdsuser} ./source_files/spark_delta/spark_venv/spark_home/jars/delta-storage-3.1.0.jar /opt/spark/jars/delta-storage-3.1.0.jar
COPY --chown=${osdsuser} ./source_files/spark_delta/spark_venv/spark_home/jars/delta-spark_2.12-3.1.0.jar /opt/spark/jars/delta-spark_2.12-3.1.0.jar
COPY --chown=${osdsuser} ./source_files/spark_delta/spark_venv/spark_home/jars/postgresql-42.2.24.jar /opt/spark/jars/postgresql-42.2.24.jar
COPY --chown=${osdsuser} ./source_files/spark_delta/spark_venv/spark_home/conf/hive-site.xml /opt/spark/conf/hive-site.xml


#ENTRYPOINT ["spark-submit --master local", "basic_spark.py"]
#COPY ./entrypoint.sh .
#RUN chmod +x entrypoint.sh
#ENTRYPOINT ["./entrypoint.sh"]

hive-site.xml

<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://postgres:5432/hive_metastore</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hivepass123</value>
<description>Password to use against metastore database</description>
</property>
<property>
<name>datanucleus.schema.autoCreateTables</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
</configuration>

Below is a Docker Compose file that orchestrates multiple Docker containers as services within a network. It utilizes the DockerfileSpark to build the image for the Spark service. Notably, it incorporates volumes to ensure data persistence by linking to a local directory. This is crucial; otherwise, the data will be lost when the container is terminated. Persistence is particularly essential for Postgres, as it relies on storing its metastore data.

docker-compose.yml

version: '3.8'

services:
spark:
build:
context: .
dockerfile: DockerfileSpark
networks:
- osds
volumes:
- ./data:/opt/spark/work-dir/data
- ./scripts:/opt/spark/work-dir/scripts
- ./source_files:/opt/spark/work-dir/source_files
env_file:
- .env.spark
ports:
- "4040:4040" # Spark UI
- "7077:7077" # Spark Master
- "8080:8080" # Spark UI
#entrypoint: "/opt/spark/work-dir/entrypoint.sh"
depends_on:
- hive-metastore

postgres:
networks:
- osds
image: postgres:16
environment:
- POSTGRES_HOST_AUTH_METHOD=md5
- POSTGRES_DB=hive_metastore
- POSTGRES_USER=hive
- POSTGRES_PASSWORD=hivepass123
- PGDATA=/var/lib/postgresql/data/pgdata
ports:
- "5432:5432" # PostgreSQL default port
volumes:
- ./postgres-data:/var/lib/postgresql/data # Use a named volume for persistent data

hive-metastore:
image: apache/hive:4.0.0-alpha-2
networks:
- osds
environment:
- SERVICE_NAME=metastore
- DB_DRIVER=postgres
- SERVICE_OPTS=-Djavax.jdo.option.ConnectionDriverName=org.postgresql.Driver -Djavax.jdo.option.ConnectionURL=jdbc:postgresql://postgres:5432/hive_metastore -Djavax.jdo.option.ConnectionUserName=hive -Djavax.jdo.option.ConnectionPassword=hivepass123
ports:
- "9083:9083"
volumes:
- ./data/delta/osdp/spark-warehouse:/opt/spark/work-dir/data/delta/osdp/spark-warehouse
depends_on:
- postgres

volumes:
postgres-data:
warehouse:

networks:
osds:

Breakdown of the Docker Compose file:

Version Specification:

  • version: ‘3.8’: Specifies the version of the Docker Compose file format being used.

Services Definition:

  • services: Defines the services that will be created and run.

Spark Service Configuration:

  • spark: Defines a service named spark.
  • build: Specifies how to build the Docker image for the service.
  • context: Specifies the build context, which is the current directory.
  • dockerfile: Specifies the Dockerfile to use for building the image.
  • networks: Specifies the network(s) the service will connect to.
  • volumes: Mounts host volumes into the container for data persistence.
  • env_file: Specifies the environment file to use for defining environment variables.
  • ports: Maps container ports to host ports for accessing services.
  • depends_on: Specifies the services this service depends on.

Postgres Service Configuration:

  • postgres: Defines a service named postgres using the official PostgreSQL Docker image.
  • networks: Specifies the network(s) the service will connect to.
  • image: Specifies the Docker image to use for the service.
  • environment: Sets environment variables for the PostgreSQL container.
  • ports: Maps container ports to host ports for accessing PostgreSQL.
  • volumes: Mounts a host volume for persistent PostgreSQL data storage.

Hive Metastore Service Configuration:

  • hive-metastore: Defines a service named hive-metastore using the official Hive Docker image.
  • image: Specifies the Docker image to use for the service.
  • networks: Specifies the network(s) the service will connect to.
  • environment: Sets environment variables for the Hive Metastore container.
  • ports: Maps container ports to host ports for accessing the Hive Metastore service.
  • volumes: Mounts a host volume for persistent Hive Metastore data storage.
  • depends_on: Specifies the services this service depends on.

Volumes and Networks:

  • volumes: Defines named volumes for data persistence.
  • networks: Defines a network named osds for inter-service communication.

Build the docker-compose and start the services:

docker-compose build

docker-compose up

To test if everything works, the below files are used to read sample csv files, read them in Spark, and save them as Delta tables using an external hive metastore on Postgres. The source files and pyspark script files were copied into the image using the persistent volumes mentioned in the docker-compose file.

sample source files in csv

basic_spark.py

This Python script utilizes PySpark to work with data and Delta Lake, a storage layer for Apache Spark that offers ACID transactions. It configures the SparkSession to enable Delta Lake support and sets up the warehouse directory. Then, it initializes the SparkSession and executes a SQL query to display existing databases. Additionally, it defines a function, IngestDeltaCSVHeader, to ingest CSV data into Delta Lake tables. Finally, the script calls this function multiple times to ingest different datasets into corresponding tables, demonstrating efficient data management within a Spark environment.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.types import *
from delta import *

builder = SparkSession.builder \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.sql.warehouse.dir","/opt/spark/work-dir/data/delta/osdp/spark-warehouse") \
.master("local[2]") \
.enableHiveSupport()

spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.sql("show databases").show()

def IngestDeltaCSVHeader(iDBSchema, iTable, iFilePath):
menu_csv = spark.read.option("header", True).csv(iFilePath)
menu_csv.show()
spark.sql("create schema if not exists "+iDBSchema)
menu_csv.write.format("delta").mode("append").saveAsTable(iDBSchema+"."+iTable)

IngestDeltaCSVHeader("restaurant","menu", "/opt/spark/work-dir/data/source_data/menu_items.csv")
IngestDeltaCSVHeader("restaurant","orders", "/opt/spark/work-dir/data/source_data/order_details.csv")
IngestDeltaCSVHeader("restaurant","db_dictionary", "/opt/spark/work-dir/data/source_data/restaurant_db_data_dictionary.csv")

Get into spark service container terminal and execute the above script.

docker-compose run spark bash
spark-submit --master local scripts/basic_spark.py

This creates equivalent delta lake tables. Since persistent volumes are used, the delta files are created in the below path

The metadata for these Delta tables is stored in a Hive Metastore running on a Postgres database with persistent volume. This setup ensures that the tables can be queried at any time from the Spark container, even if the containers are terminated or the Docker services are restarted. As a result, the tables are always preserved and accessible for analysis or processing. Query the tables from the spark bash terminal as below

docker-compose up
docker-compose run spark bash

Initialize a spark-sql session with delta enabled and pointing to the path of the hive warehouse where the delta tables were saved

spark-sql --packages io.delta:delta-spark_2.12:3.1.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" --conf "spark.sql.warehouse.dir=/opt/spark/work-dir/data/delta/osdp/spark-warehouse"

In conclusion, setting up Spark, Delta Lake, and an external Hive Metastore in Docker with Postgres may sound daunting, but fear not! With the right tools, a sprinkle of patience, and perhaps a cup of coffee (or two), you’ll be orchestrating data magic in no time. Now, go forth and dockerize your dreams — the world of big data awaits!

--

--