Testing Apache Spark locally: docker-compose and Kubernetes deployment.

6 min readJun 19, 2023

This is part 1/3 of the tutorial.
At the time of writing this article, the latest spark version is 3.4.0

Part 1: Deploy Spark using Docker-compose (This actual article)
Part 2: Deploying Apache Spark on a Local Kubernetes Cluster: A Comprehensive Guide
Part 3: Deploy Spark on Kubernetes using Helm Charts

Introduction

This three-part tutorial series is designed to guide you through different deployment methods for Apache Spark, starting with Docker-compose, progressing to deploying on a Kubernetes cluster using a custom binary Spark image, and finally exploring the convenience of deploying Spark on Kubernetes using Helm charts.

Part 1: Deploy Spark using Docker-compose
In the first part of this tutorial series, we will explore deploying Apache Spark using Docker-compose. Docker-compose simplifies the process of managing multi-container applications and provides an ideal solution for setting up Spark clusters in a development environment. We will cover the creation of a Dockerfile for building the Spark image, configuring the entrypoint script to manage Spark workloads, and creating a Docker Compose file to define the Spark cluster’s services. By the end of this part, you will have a working Spark cluster running on your local machine, ready to process data and run Spark jobs.
Part 2: Deploy Spark on Kubernetes Cluster using Custom Binary Spark Image
In the second part, we will delve into deploying Apache Spark on a Kubernetes cluster using a custom binary Spark image. Kubernetes provides a powerful container orchestration platform that enables efficient resource management, scalability, and fault-tolerance. We will guide you through preparing the Spark binary distribution for Kubernetes deployment, creating the necessary Kubernetes deployment YAML files, and configuring the Spark deployment for the Kubernetes cluster. By the end of this part, you will have a Spark cluster up and running on a Kubernetes cluster, ready to handle large-scale data processing tasks.
Part 3: Deploy Spark on Kubernetes using Helm Charts
The third and final part of this tutorial series will focus on deploying Apache Spark on Kubernetes using Helm charts. Helm is a package manager for Kubernetes that simplifies the deployment and management of complex applications. We will introduce Helm and guide you through the process of installing Helm, setting up the environment, creating a Helm chart for Spark deployment, and customizing it to your specific requirements. By the end of this part, you will have a streamlined and repeatable process for deploying Spark on Kubernetes using Helm charts.

By the end of this tutorial, you will have a clearer understanding of how to leverage Docker-compose and Kubernetes cluster to facilitate your Spark development environment.

Please note that this tutorial is not about setting up a production-ready Spark cluster. Instead, it focuses on providing you with a hands-on experience of exploring Spark, Docker images, and Kubernetes to gain a deeper understanding of how they work together.

Prerequisites :

Before proceeding with this tutorial, please ensure that you have docker installed in you local machine.

The easiest way to set up a spark cluster using a defined docker image is to use docker-compose, describing driver and workers using the same docker image.

Docker-compose method

To deploy Apache Spark using Docker-compose, we have to create a docker image, then use it in a docker-compose file that describes the local cluster.

1. Create a Dockerfile and build the image:

For this purpose we have to create a folder for instance ‘docker-compose-way’

Inside the previously created folder, create a Dockerfile

# builder step used to download and configure spark environment
FROM openjdk:11.0.11-jre-slim-buster as builder

# Add Dependencies for PySpark
RUN apt-get update && apt-get install -y curl vim wget software-properties-common ssh net-tools ca-certificates python3 python3-pip python3-numpy python3-matplotlib python3-scipy python3-pandas python3-simpy

RUN update-alternatives --install "/usr/bin/python" "python" "$(which python3)" 1

# Fix the value of PYTHONHASHSEED
# Note: this is needed when you use Python 3.3 or greater
ENV SPARK_VERSION=3.4.0 \
HADOOP_VERSION=3 \
SPARK_HOME=/opt/spark \
PYTHONHASHSEED=1

# Download and uncompress spark from the apache archive
RUN wget --no-verbose -O apache-spark.tgz "https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" \
&& mkdir -p /opt/spark \
&& tar -xf apache-spark.tgz -C /opt/spark --strip-components=1 \
&& rm apache-spark.tgz


# Apache spark environment
FROM builder as apache-spark

WORKDIR /opt/spark

ENV SPARK_MASTER_PORT=7077 \
SPARK_MASTER_WEBUI_PORT=8080 \
SPARK_LOG_DIR=/opt/spark/logs \
SPARK_MASTER_LOG=/opt/spark/logs/spark-master.out \
SPARK_WORKER_LOG=/opt/spark/logs/spark-worker.out \
SPARK_WORKER_WEBUI_PORT=8080 \
SPARK_WORKER_PORT=7000 \
SPARK_MASTER="spark://spark-master:7077" \
SPARK_WORKLOAD="master"

EXPOSE 8080 7077 6066

RUN mkdir -p $SPARK_LOG_DIR && \
touch $SPARK_MASTER_LOG && \
touch $SPARK_WORKER_LOG && \
ln -sf /dev/stdout $SPARK_MASTER_LOG && \
ln -sf /dev/stdout $SPARK_WORKER_LOG

COPY start-spark.sh /

CMD ["/bin/bash", "/start-spark.sh"]

Note: we refer to “/start-spark.sh” script, that is not yet created.
The script will act as an entrypoint for Spark docker image.
Let’s create it before building the image :

#start-spark.sh
#!/bin/bash
. "/opt/spark/bin/load-spark-env.sh"
# When the spark work_load is master run class org.apache.spark.deploy.master.Master
if [ "$SPARK_WORKLOAD" == "master" ];
then

export SPARK_MASTER_HOST=`hostname`

cd /opt/spark/bin && ./spark-class org.apache.spark.deploy.master.Master --ip $SPARK_MASTER_HOST --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT >> $SPARK_MASTER_LOG

elif [ "$SPARK_WORKLOAD" == "worker" ];
then
# When the spark work_load is worker run class org.apache.spark.deploy.master.Worker
cd /opt/spark/bin && ./spark-class org.apache.spark.deploy.worker.Worker --webui-port $SPARK_WORKER_WEBUI_PORT $SPARK_MASTER >> $SPARK_WORKER_LOG

elif [ "$SPARK_WORKLOAD" == "submit" ];
then
    echo "SPARK SUBMIT"
else
    echo "Undefined Workload Type $SPARK_WORKLOAD, must specify: master, worker, submit"
fi

Now we have all needed files to build the image, let’s build it and give it an explicit name so we can spot it easily
(let’s give it a tag :our-own-apache-spark:3.4.0).
Open a terminal and get to the actual working directory, and run

docker build -t our-own-apache-spark:3.4.0 .

This will build the docker image as defined in Dockerfile. After the build process we can check the image by running ‘docker images‘. Now all we have to do, is to create a docker-compose.yml file

version: "3.3"
services:
  spark-master:
    image: our-own-apache-spark:3.4.0
    ports:
      - "9090:8080"
      - "7077:7077"
    volumes:
       - ./apps:/opt/spark-apps
       - ./data:/opt/spark-data
    environment:
      - SPARK_LOCAL_IP=spark-master
      - SPARK_WORKLOAD=master
  spark-worker-a:
    image: our-own-apache-spark:3.4.0
    ports:
      - "9091:8080"
      - "7000:7000"
    depends_on:
      - spark-master
    environment:
      - SPARK_MASTER=spark://spark-master:7077
      - SPARK_WORKER_CORES=1
      - SPARK_WORKER_MEMORY=1G
      - SPARK_DRIVER_MEMORY=1G
      - SPARK_EXECUTOR_MEMORY=1G
      - SPARK_WORKLOAD=worker
      - SPARK_LOCAL_IP=spark-worker-a
    volumes:
       - ./apps:/opt/spark-apps
       - ./data:/opt/spark-data
  spark-worker-b:
    image: our-own-apache-spark:3.4.0
    ports:
      - "9092:8080"
      - "7001:7000"
    depends_on:
      - spark-master
    environment:
      - SPARK_MASTER=spark://spark-master:7077
      - SPARK_WORKER_CORES=1
      - SPARK_WORKER_MEMORY=1G
      - SPARK_DRIVER_MEMORY=1G
      - SPARK_EXECUTOR_MEMORY=1G
      - SPARK_WORKLOAD=worker
      - SPARK_LOCAL_IP=spark-worker-b
    volumes:
        - ./apps:/opt/spark-apps
        - ./data:/opt/spark-data

This will create a spark cluster composed by one driver and two workers.
We have defined as well some mounting points ./app and ./data these folders should contain your potential application and data to be used within that cluster.
Note the usage of ‘SPARK_WORKLOAD=master’ and ‘SPARK_WORKLOAD=worker’ in environment section of the driver and workers, this will be used by start-spark.sh script created previously to know whether it is dealing with a driver or a worker.

2. Run Spark with docker-compose :

This is the final step. All we have to do is running

docker-compose up

This will boot services as described in the docker-compose file.
You can see all logs from driver and workers services.
Now check that spark is running by accessing the ui https://localhost:4040

Runnig a job:

Lets run a job on that cluster: (Spark PI exemple)
First get into the driver container, for that you have to get the driver container name.
To get all running containers, run

docker ps

The one running the driver is prefixed by docker-spark-master.

Get the Container Id (in my case 78a6fbac7acf) and get into it

docker exec -i -t 78a6fbac7acf /bin/bash
# REPLCE WITH YOUR CONTAINER ID

Then get into /bin folder and run

./spark-submit --master spark://0.0.0.0:7077 --name spark-pi --class org.apache.spark.examples.SparkPi  local:///opt/spark/examples/jars/spark-examples_2.12-3.4.0.jar 100

You should see logs of Spark PI exemple running and all the process must succeed.

Conclusion

It is important to note that while the Docker Compose method provides a convenient way to deploy Apache Spark for development purposes, it does have its limitations. Docker Compose is primarily designed for single-host deployments and lacks some advanced features required for production-grade Spark clusters.

In a production environment, where scalability, fault-tolerance, and resource management are crucial, deploying Spark on Kubernetes offers significant advantages. Kubernetes provides a robust container orchestration platform that automates the deployment, scaling, and management of containerized applications across clusters of machines.

In the next part of this article, we will delve into deploying Apache Spark on a Kubernetes cluster. We will explore different deployment approaches, such as using the spark-submit method and leveraging Helm charts for streamlined deployments. By embracing Kubernetes for Spark deployment, you will unlock the full potential of container orchestration and unleash the power of Spark at scale. So stay tuned as we dive into the world of Kubernetes and Spark deployment in the upcoming section.