Using Apache Spark Docker containers to run pyspark programs using spark-submit

4 min readFeb 17, 2023

What is Spark? — A quick overview

Apache Spark is an open-source big data processing framework designed to process and analyze large datasets in a distributed and efficient manner. It provides a faster and more flexible alternative to MapReduce, which was primarily used for batch processing. Spark’s distributed computing model processes data in parallel across multiple nodes in a cluster, and its high-level programming interface simplifies the development of data processing applications. Spark’s speed, flexibility, and ease of use have made it a popular choice for big data processing and analysis in a variety of use cases.

Spark set up — A problem?

Setting up Spark locally is feasible for smaller datasets and simpler use cases, but as data size and application complexity increase, a local setup may not have sufficient resources to handle the load. A local setup may also lack fault tolerance and redundancy features of a distributed setup, leading to potential data loss or downtime. Furthermore, scaling the Spark cluster dynamically to handle varying workloads can be challenging to achieve with a local setup. A distributed cluster setup with sufficient resources would be more appropriate for larger and more complex use cases that require high performance, scalability, and fault tolerance.

The best way to counter this problem is by containerizing Apache Spark.

Containerization — A fix to the problem

Containerization is a software packaging method that involves encapsulating an application and its dependencies into a container for consistent and reliable deployment across different computing environments. It provides benefits such as increased portability, scalability, and security, and is popular for deploying applications in modern computing environments. Docker and Kubernetes are popular containerization platforms.

Containerizing Apache Spark can improve deployment and management of Spark clusters, increase resource utilization and scalability, and enhance security by providing a degree of isolation. Containerization provides a consistent and portable runtime environment for Spark applications, leading to more efficient development and deployment processes.

Setup Spark Container

Here is a detailed explanation on how to set up an Apache Spark container using docker and run PySpark programs on it using spark-submit.

Pre-Requisites

Docker installed and running on your system.
A basic knowledge of Apache Spark and docker containers.
Fimilarity with docker-compose files.

Apache Spark Docker Container

Below is a docker-compose file to set up a Spark cluster with 1 master and 2 worker nodes.

version: '3.7'

services:
 spark-master:
 image: bitnami/spark:latest
 command: bin/spark-class org.apache.spark.deploy.master.Master
 ports:
 — “9090:8080”
 — “7077:7077”
 spark-worker-1:
 image: bitnami/spark:latest
 command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
 depends_on:
 — spark-master
 environment:
 SPARK_MODE: worker
 SPARK_WORKER_CORES: 2
 SPARK_WORKER_MEMORY: 2g
 SPARK_MASTER_URL: spark://spark-master:7077
 spark-worker-2:
 image: bitnami/spark:latest
 command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
 depends_on:
 — spark-master
 environment:
 SPARK_MODE: worker
 SPARK_WORKER_CORES: 2
 SPARK_WORKER_MEMORY: 2g
 SPARK_MASTER_URL: spark://spark-master:7077

This setup uses the bitnami/spark image. You can add more workers and also change the SPARK_WORKER_CORES and SPARK_WORKER_MEMORY in the environment based on your system’s specifications.

After creating the docker-compose.yml file, you just need to go to the directory of the compose file and type:

docker-compose up -d

This will start your Apache Spark container. To stop the container, type the following command.

docker-compose down

Run PySpark on Dockerized Spark using spark-submit

Now that the Spark container is up and running, we need to test if it is working. For that, we’ll run a simple PySpark script using spark-submit command.

# Import the necessary modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# Create a SparkSession
spark = SparkSession.builder \
   .appName("My App") \
   .getOrCreate()

rdd = spark.sparkContext.parallelize(range(1, 100))

print("THE SUM IS HERE: ", rdd.sum())
# Stop the SparkSession
spark.stop()

This is a simple PySpark program to find the sum of a given range.

After setting up the program, we need to copy it to the spark container. To do this, we can use docker cp command.

docker cp -L your_program.py spark_spark-master_1:/opt/bitnami/spark/anyfilename.py

Now we need to get the address where our spark master is running. To do that, we need to use docker logs command.

docker logs spark_spark-master_1

After running this command, you need to find the address which will look something like this:

Here, the Spark Master is running at spark://172.18.0.2:7077.

Now we need to execute the pyspark file using the following command

docker-compose exec spark_spark-master_1 spark-submit --master spark://172.18.0.2:7077 anyfilename.py

If the program executed properly, it will display the sum.

Conclusion

Now, you have a working Apache Spark cluster in docker containers which executes your pyspark programs using spark-submit. You can further enhance this cluster by adding more workers and increasing the number of cores and memory of worker nodes.