DeadSimple: PySpark + Docker Spark Cluster on your Laptop

Published in

Programmer’s Journey

4 min readMar 15, 2024

Repo: https://github.com/hughesadam87/pyspark-sandbox-cluster

I recently found myself needing a sandbox spark cluster running on the laptop that I could hit from a Pyspark application. Our production cluster was hard to access, and I needed to sanity-check some basic Pyspark code I was developing. Getting this simple setup up and running was a bit painful, so hopefully, this saves you a few hours of twiddling.

In this basic configuration, we will create the following:

A hello-pyspark.py application that counts numbers 1–1000 but in a parallelized operation (the Driver/client).
A 2-node (1 master, 1 worker) spark cluster that runs in Docker

I ran these both on my laptop, but in theory, the driver and cluster can be on separate machines.

Step 1: Install Java on the Driver Node

This is kind of bizarre, but even though we’re using the Python Driver (Pyspark), Java is still required on the driver node (ie. my laptop). Without it, Pyspark will crash with errors like:

JAVA_HOME not set

Install a JRE and set this env variable:

JAVA_HOME = C:\Program Files\Java\jre-1.8

You may also see warnings about HADOOP_HOME, so optionally install HADOOP or suppress the warnings.

Step 2: Docker Pull a Spark Image

I’m using bitnami/spark images (dockerhub link), mostly because GPT recommends them. I went with the latest, tag=3.5.1 at the time of writing:

docker pull bitnami/spark:3.5.1

Careful — the version of spark is important. This image uses

spark=3.5.1
python=3.11

As such, the Driver node that submits jobs must also be running ` python==3.11 and pyspark==3.5.1

If your driver application is using an older version of Python or Spark, you’ll need to find a compatible matching image. For example, in python==3.8 and pytspark==3.3.1 , you’ll want this image

docker pull bitnami/spark:3.3.1

Incompatibilities will lead to serialization errors when your jobs are submitted to the cluster. For example:

spark-1         | java.io.InvalidClassException: org.apache.spark.deploy.ApplicationDescription; local class incompatible: stream classdesc serialVersionUID = -6826680068825109317, local class serialVersionUID = 1574364215946805297

You can shell into the bitnami container and pip list to see which Python and Pyspark are installed.

Step 3: Create a Python Env and hello-pyspark.py

First, I created a virtual env with the correct Python and Pyspark:

conda create -n pyspark-311 python=3.11 pyspark=3.5.1
conda activate pyspark-311

And then drop in a simple application

from pyspark.sql import SparkSession


def main():
    # Initialize SparkSession
    spark = SparkSession.builder \
        .appName("HelloWorld") \
        .master("spark://localhost:7077") \
        .getOrCreate()

    # Create an RDD containing numbers from 1 to 1000
    numbers_rdd = spark.sparkContext.parallelize(range(1, 1000))

    # Count the elements in the RDD
    count = numbers_rdd.count()

    print(f"Count of numbers from 1 to 1000 is: {count}")

    # Stop the SparkSession
    spark.stop()


if __name__ == "__main__":
    main()

Step 4 (Optional): Create a Dockerfile

You can use the bitnami/spark image directly, but it’s likely you’ll want to customize it. For example, I wanted to change the logging settings. Or you may need to put custom code onto the spark cluster. For example — if your drive is submitting pandas jobs, you’ll need to install pandas into the cluster. As such, I recommend extending the parent image.

FROM bitnami/spark:3.5.1


# Custom logging
COPY log4j2.properties /opt/bitnami/spark/conf/log4j2.properties

# any files/libraries you need on the cluster, install here ie:
# RUN pip install scipy

An example log4j2.propertiesfile is in the repo.

Step 5: Docker Compose

We’ll create a simple master-node setup.

version: '3.7'

services:
  spark:
    build: .
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark
    ports:
      - '8080:8080'
      - '7077:7077'

  spark-worker:
    build: .
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=1G      #<--- adjust accordingly
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark

If you aren’t using a custom docker image, just replace build: . with image: bitnami/spark:3.5.1

Step 6: Stand up the cluster

Make sure Docker-Desktop or whatever engine you use is running, then

docker-compose up --build  #Brings up the cluster

On my setup, I see logs and the containers

Likewise, the spark dashboard is running on http://localhost:8080

Step 7: Submit a job

conda activate pyspark-311
python hello-pyspark.py

If it runs, you should see the following log

Count of numbers from 1 to 1000 is: 999

Troubleshooting

On Step 7 — my app just hangs. It was due to the following:

Networking issue between the driver (my Windows laptop) and cluster (docker containers also on my laptop).

As described in this GH issue, TLDR; add this extra_hosts flag

version: '3.7'

services:
  spark:
    build: .
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark
    ports:
      - '8080:8080'
      - '7077:7077'

    # MUST BE OFF VPN FOR THIS TO WORK
    extra_hosts:
      - "host.docker.internal:${HOST_IP}"
  spark-worker:
    build: .
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=2
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark
    extra_hosts:
      - "host.docker.internal:${HOST_IP}"

Where the value of HOST_IP I just put it into a .env file in the same folder as docker-compose. Also — be mindful that being connected to a VPN will alter this value.

Bonus: sharing a disk between the driver and worker

My app had an additional requirement that the worker nodes should write files. Because the cluster and driver are both on my laptop, I was able to share my filesystem across the container. In an actual application, you’d probably just use cloud storage or something. But I ended up having to add this volume to my worker that binds my local files under /path/to/thing to a drive in the container named /thing

Then workers could write files to /thing and they’d show up on my laptop.

  spark-worker:
    build: .
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=2
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark
    extra_hosts:
      - "host.docker.internal:${HOST_IP}"
    volumes:
      - C:/Users/adam.hughes/path/to/thing:/thing

DeadSimple: PySpark + Docker Spark Cluster on your Laptop

Troubleshooting

Written by Adam Hughes