Building a custom Apache Spark Docker image with AWS Glue Data Catalog support as metastore

Sebastian Daberdaku
Towards Data Engineering
7 min readJun 8, 2024
A Data Lakehouse architecture on AWS

Introduction

The AWS Glue Data Catalog is a fully managed, Apache Hive Metastore compatible, metadata repository, that offers seamless integration with Amazon EMR, as well as third party solutions such as Databrics, TrinoDB, and StarRocks. It can be used as a centralized metastore when implementing a Data Lakehouse solution on AWS, enabling users to access tables created in EMR from Databricks or Trino, or vice versa.

Being a managed service, it simplifies operations for the end user. Moreover, its costs are very contained, allowing users to store the first million objects (tables, partitions, databases) at no charge, and only charging $1 per month for each 100,000 objects over the first million. Finally, users can implement very fine access control rules with IAM policies in order to determine whether someone can create, access or delete AWS Glue databases and tables. All these features make the AWS Glue Data Catalog an extremely valuable tool for building a cost-effective Data Lakehouse on the AWS cloud.

Motivation

Although AWS Glue is advertised as Hive-compatible, Apache Spark cannot use it as a metastore out-of-the-box. One could use Amazon EMR Docker images to achieve this integration, but this approach has its limitations. Notably, the Apache Spark versions available on EMR often lag behind the latest releases, which can restrict access to newer features and improvements.

To enable users to build their own Glue-compatible Apache Spark distribution, the AWS Labs have released an open-source implementation of the Apache Hive Metastore client on Amazon EMR clusters that uses the AWS Glue Data Catalog as an external Hive Metastore. It serves as a reference implementation for building a Hive Metastore-compatible client that connects to the AWS Glue Data Catalog.

At the time of writing of this article (June 8th, 2024), the last commit on the AWS Labs repository https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore.git is dated July 18th, 2023, and the latest Apache Spark version supported is 3.4.0. Moreover, some Maven dependencies have become unavailable because of the now dead Conjars repository. Finally, Apache Spark versions 3.5.0 and 3.5.1 are both affected by a class loading issue resulting in a NoClassDefFoundError being raised when starting the Spark Connect server https://issues.apache.org/jira/browse/SPARK-45201.

For all these reasons I decided to build my own Apache Spark 3.5.1 release with AWS Glue support as metastore. This has allowed me to run custom Spark-powered workloads on Kubernetes while leveraging the centralized AWS Glue metastore.

Implementation

To guarantee portability across systems, I decided to use a Docker image to build my Apache Spark distribution. I am adopting a multi-stage Docker build, where the first image (spark-with-glue-builder) is only used to build the custom Apache Spark distribution, while successive images (spark-glue-python) will copy the built distribution from the former one along with other required dependencies.

Builder image

The full code for the spark-with-glue-builder Docker image is available at: https://github.com/sebastiandaberdaku/spark-with-glue-builder.git.

The https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore repository follows a branch-based approach in managing different versions and has renamed/deleted older branches in the past. To avoid future issues I have forked the original branch-3.4.0 main branch of the repository into my GitHub account and created a specific tag which is referenced in this docker image.

This docker image also fixes the missing conjars maven repository, required when building the Glue Catalog Client.

Finally, this image builds Spark with Spark-Connect support. This repository includes a patch for Spark 3.5.0 that fixes a sporadic NoClassDefFoundError: InternalFutureFailureAccess that can be seen when building Spark from sources.

Lets now take a closer look at the Dockerfile:

# I want to build Spark with PySpark support for Python 3.10, so I need a docker image with both Python and Java.
# It is faster to start from an image with Python and install the JDK later.
FROM python:3.10.14-bookworm

# Install packages
RUN echo "deb http://ftp.de.debian.org/debian sid main" >> /etc/apt/sources.list; \
apt-get update; \
apt-get install -y --no-install-recommends openjdk-8-jdk wget patch; \
rm -rf /var/lib/apt/lists/*

# Install maven
ARG MAVEN_VERSION=3.8.8
RUN wget --quiet -O /opt/maven.tar.gz "https://apache.org/dyn/closer.lua/maven/maven-3/${MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz?action=download"; \
mkdir -p /opt/maven; \
tar zxf /opt/maven.tar.gz --strip-components=1 --directory=/opt/maven; \
rm /opt/maven.tar.gz

ENV MAVEN_HOME=/opt/maven
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
ENV PATH=$PATH:${MAVEN_HOME}/bin

WORKDIR /opt
# Download and extract the Glue Data Catalog Client
ARG SPARK_VERSION=3.5.1
RUN wget --quiet -O /opt/glue.tar.gz "https://github.com/sebastiandaberdaku/aws-glue-data-catalog-client-for-apache-hive-metastore/archive/refs/tags/v${SPARK_VERSION}.tar.gz"; \
mkdir -p /opt/glue; \
tar zxf /opt/glue.tar.gz --strip-components=1 --directory=/opt/glue; \
rm /opt/glue.tar.gz

## Patching Apache Hive and Installing It Locally
# Download and extract Apache Hive2 sources
ARG HIVE2_VERSION=2.3.9
RUN wget --quiet -O /opt/hive2.tar.gz "https://github.com/apache/hive/archive/rel/release-${HIVE2_VERSION}.tar.gz"; \
mkdir -p /opt/hive2; \
tar zxf /opt/hive2.tar.gz --strip-components=1 --directory=/opt/hive2; \
rm /opt/hive2.tar.gz
# Add the 2.3 version patch file
COPY ./HIVE-12679.branch-2.3.patch /opt/hive2
# conjars repository is dead, mirroring to another repo to download jars
COPY ./.mvn/ /opt/hive2/.mvn/
RUN cd /opt/hive2; \
patch -p0 <HIVE-12679.branch-2.3.patch; \
mvn -T $(nproc) clean install -DskipTests

# Download and extract Apache Hive3 sources
ARG HIVE3_VERSION=3.1.3
RUN wget --quiet -O /opt/hive3.tar.gz "https://github.com/apache/hive/archive/rel/release-${HIVE3_VERSION}.tar.gz"; \
mkdir -p /opt/hive3; \
tar zxf /opt/hive3.tar.gz --strip-components=1 --directory=/opt/hive3; \
rm /opt/hive3.tar.gz
# conjars repository is dead, mirroring to another repo to download jars
COPY ./.mvn/ /opt/hive3/.mvn/
# Continue with patching the 3.1 branch:
RUN cp /opt/glue/branch_3.1.patch /opt/hive3; \
cd /opt/hive3; \
patch -p1 --merge <branch_3.1.patch; \
mvn -T $(nproc) clean install -DskipTests

## Building the Glue Data Catalog Client
# Now with Hive patched and installed, build the glue client
# Adding the .mvn folder content fixes the missing conjars repository.
COPY ./.mvn/ /opt/glue/.mvn/
# All clients must be built from the root directory of the AWS Glue Data Catalog Client repository.
# This will build both the Hive and Spark clients and necessary dependencies.
ARG HADOOP_VERSION=3.3.4
RUN cd /opt/glue; \
mvn -T $(nproc) clean install \
-DskipTests \
-Dspark-hive.version="${HIVE2_VERSION}" \
-Dhive3.version="${HIVE3_VERSION}" \
-Dhadoop.version="${HADOOP_VERSION}"

## Build Spark
# Fetch the Spark sources
RUN wget --quiet -O /opt/spark.tar.gz "https://github.com/apache/spark/archive/refs/tags/v${SPARK_VERSION}.tar.gz"; \
mkdir -p /opt/spark; \
tar zxf /opt/spark.tar.gz --strip-components=1 --directory=/opt/spark; \
rm /opt/spark.tar.gz

# Setting up Maven's Memory Usage
ENV MAKEFLAGS="-j$(nproc)"
ENV MAVEN_OPTS="-Xss64m -Xmx2g -XX:ReservedCodeCacheSize=1g"
# Patch (see: https://issues.apache.org/jira/browse/SPARK-45201) and build a runnable Spark distribution
COPY "./spark-${SPARK_VERSION}.patch" /opt/spark/
ARG SCALA_VERSION=2.12
RUN cd /opt/spark; \
patch -p1 <"spark-${SPARK_VERSION}.patch"; \
./dev/make-distribution.sh \
--name spark \
--pip \
-P"scala-${SCALA_VERSION}" \
-Pconnect \
-Pkubernetes \
-Phive \
-Phive-thriftserver \
-P"hadoop-${HADOOP_VERSION%%.*}" \
-Dhadoop.version="${HADOOP_VERSION}" \
-Dhive.version="${HIVE2_VERSION}" \
-Dhive23.version="${HIVE2_VERSION}" \
-Dhive.version.short="${HIVE2_VERSION%.*}"

ARG SPARK_DIST_DIR=/opt/spark/dist

# IMPORTANT! We must delete the spark-connect-commom jar from the jars directory!
# see: https://issues.apache.org/jira/browse/SPARK-45201
RUN rm "${SPARK_DIST_DIR}/jars/spark-connect-common_${SCALA_VERSION}-${SPARK_VERSION}.jar"

# Copy the glue client jars to the spark jars directory
# We are only interested in the AWS Glue Spark Client
RUN cp "/opt/glue/aws-glue-datacatalog-spark-client/target/aws-glue-datacatalog-spark-client-${SPARK_VERSION}.jar" "${SPARK_DIST_DIR}/jars/"

# The following steps are optional
# I am downloading these jars directly to the docker image in order to avoid having to download them when Spark starts up.

# Download the other jars
# AWS Java SDK bundle library
ARG AWS_JAVA_SDK_VERSION=1.12.262
RUN wget --quiet -P "${SPARK_DIST_DIR}/jars/" "https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_JAVA_SDK_VERSION}/aws-java-sdk-bundle-${AWS_JAVA_SDK_VERSION}.jar"
# Hadoop AWS library
RUN wget --quiet -P "${SPARK_DIST_DIR}/jars/" "https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar"
RUN wget --quiet -P "${SPARK_DIST_DIR}/jars/" "https://repo1.maven.org/maven2/org/wildfly/openssl/wildfly-openssl/1.0.7.Final/wildfly-openssl-1.0.7.Final.jar"
# PostgreSQL library
ARG POSTGRES_VERSION=42.6.0
RUN wget --quiet -P "${SPARK_DIST_DIR}/jars/" "https://repo1.maven.org/maven2/org/postgresql/postgresql/${POSTGRES_VERSION}/postgresql-${POSTGRES_VERSION}.jar"
RUN wget --quiet -P "${SPARK_DIST_DIR}/jars/" "https://repo1.maven.org/maven2/org/checkerframework/checker-qual/3.31.0/checker-qual-3.31.0.jar"
# Delta IO libraries
ARG DELTA_VERSION=3.2.0
RUN wget --quiet -P "${SPARK_DIST_DIR}/jars/" "https://repo1.maven.org/maven2/io/delta/delta-spark_${SCALA_VERSION}/${DELTA_VERSION}/delta-spark_${SCALA_VERSION}-${DELTA_VERSION}.jar"
RUN wget --quiet -P "${SPARK_DIST_DIR}/jars/" "https://repo1.maven.org/maven2/org/antlr/antlr4-runtime/4.9.3/antlr4-runtime-4.9.3.jar"
RUN wget --quiet -P "${SPARK_DIST_DIR}/jars/" "https://repo1.maven.org/maven2/io/delta/delta-storage/${DELTA_VERSION}/delta-storage-${DELTA_VERSION}.jar"
RUN wget --quiet -P "${SPARK_DIST_DIR}/jars/" "https://repo1.maven.org/maven2/io/delta/delta-storage-s3-dynamodb/${DELTA_VERSION}/delta-storage-s3-dynamodb-${DELTA_VERSION}.jar"

# Download and install Hadoop native libraries
ARG HADOOP_HOME=/opt/hadoop
RUN wget "https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz" -O /opt/hadoop.tar.gz; \
mkdir -p ${HADOOP_HOME}; \
tar zxf /opt/hadoop.tar.gz --strip-components=1 --directory="${HADOOP_HOME}"; \
rm /opt/hadoop.tar.gz

Note

This image is not intended to be run directly! The resulting image is going to be very large. You should use this image as source in a multi-stage docker build for creating your final Apache Spark images!

The final Apache Spark image

By adopting a multi-stage Docker build approach, we can finally build the final Apache Spark image with AWS Glue metastore support. In the following example I am building am image with Python 3.10 support as well. The complete code is available at: https://github.com/sebastiandaberdaku/spark-glue-python.git.

# I am using the image defined in https://github.com/sebastiandaberdaku/spark-with-glue-builder/releases/tag/spark-v3.5.1
FROM sdaberdaku/spark-with-glue-builder:v3.5.1 AS builder

# Starting with a clean image
FROM python:3.10.14-slim-bookworm

ARG spark_uid=185

RUN groupadd --system --gid=${spark_uid} spark; \
useradd --system --uid=${spark_uid} --gid=spark --create-home spark

# INSTALL Java and other packages
RUN apt-get update; \
apt-get install -y --no-install-recommends openjdk-17-jre tini procps gettext-base; \
rm -rf /var/lib/apt/lists/*

ENV JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
ENV SPARK_HOME=/opt/spark
ENV HADOOP_HOME=/opt/hadoop
ENV HADOOP_COMMON_LIB_NATIVE_DIR="${HADOOP_HOME}/lib/native"
ENV HADOOP_OPTS="${HADOOP_OPTS} -Djava.library.path=${HADOOP_HOME}/lib/native"
ENV PATH="${PATH}:/home/spark/.local/bin:${JAVA_HOME}/bin:${SPARK_HOME}/bin:${SPARK_HOME}/sbin:${HADOOP_HOME}/bin"

COPY --from=builder /opt/spark/dist/ ${SPARK_HOME}/
COPY --from=builder /opt/hadoop/ ${HADOOP_HOME}/

RUN chown -R spark:spark ${SPARK_HOME}/; \
chown -R spark:spark ${HADOOP_HOME}/

RUN cp ${SPARK_HOME}/kubernetes/dockerfiles/spark/entrypoint.sh /opt/entrypoint.sh; \
chmod a+x /opt/entrypoint.sh; \
cp ${SPARK_HOME}/kubernetes/dockerfiles/spark/decom.sh /opt/decom.sh; \
chmod a+x /opt/decom.sh

# switch to spark user
USER spark
WORKDIR /home/spark

COPY ./requirements.txt .
RUN pip install --no-cache-dir --trusted-host pypi.python.org --editable ${SPARK_HOME}/python -r requirements.txt

ENTRYPOINT ["/opt/entrypoint.sh"]

Final considerations

This article introduced a custom Apache Spark 3.5.1 build with AWS Glue support as metastore. The AWS Glue metastore is enabled by setting the following Spark configs:

spark.sql.catalogImplementation=hive
spark.hive.imetastoreclient.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

The provided image is built with Kubernetes and Spark Connect support, as well as PySpark. The following JARs are also included, providing support for AWS S3, PostgreSQL and the Delta Lake format.

1. AWS Glue Data Catalog Spark Client JAR: `aws-glue-datacatalog-spark-client-3.5.1.jar`
2. AWS Java SDK bundle library: `aws-java-sdk-bundle-1.12.262.jar`
3. Hadoop AWS library: `hadoop-aws-3.3.4.jar`
4. Wildfly OpenSSL library: `wildfly-openssl-1.0.7.Final.jar`
5. PostgreSQL library: `postgresql-42.6.0.jar`
6. Checker Qual: `checker-qual-3.31.0.jar`
7. delta-spark: `delta-spark_2.12-3.2.0.jar`
8. antlr4-runtime: `antlr4-runtime-4.9.3.jar`
9. delta-storage: `delta-storage-3.2.0.jar`
10. delta-storage-s3-dynamodb: `delta-storage-s3-dynamodb-3.2.0.jar`

Hadoop native libraries are downloaded and installed in the /opt/hadoop directory.

Conclusion

By building a custom Apache Spark 3.5.1 Docker image with AWS Glue Data Catalog support as the metastore, we can leverage the latest Spark features and maintain a centralized metadata repository.

--

--