In this post we will cover the necessary steps to create a spark standalone cluster with Docker and docker-compose.
We will be using a very basic project structure as follows:
The base Images
We will be using some base images to get the job done, these are the images used to create the cluster:
- spark-base:2.3.1: A base image based on java:alpine-jdk-8 wich ships scala, python3 and spark 2.3.1, basically we need to download spark and scala packages and configure them in the path, also we need to add python 3(for pyspark).
- spark-master:2.3.1: A image based on the previously created spark image, used to create a spark master containers. We just need to configure the basics of the master, configure the master and web UI ports and use a bootstrap script that starts the spark-master service.
- spark-worker:2.3.1: A image based on the previously created spark image, used to create spark worker containers. In the same way as the master image we just need to configure the ports to be exposed and the master UI, please notice that we configured spark://spark-master:7077(this is because docker will resolve this name for you internally)
- spark-submit:2.3.1: A image based on the previously created spark image, used to create spark submit containers(run, deliver driver and die gracefully).
The compose file
The compose file will contain four services (containers) :
These containers are wired by a custom network (10.5.0.0/16), every container has a static ip address from this network to make things easier.
One important thing to notice is that every container has two volume mounts:
- /mnt/spark-apps:/opt/spark-apps: It will be used to make application code and configuration available on every worker and master alike.
- /mnt/spark-data:/opt/spark-data: It will be used to make application input and output files available on every worker and master alike.
Booth mounts will simulate a distributed file system emulated with Docker volume mounts, it come in handy to make available your application code and files without any effort.
Compile the Images
Before running the compose we need to compile every custom image as follows:
Run the docker-compose
The final step to create your test cluster will be to run the compose file: docker-compose up
Validate your cluster
To validate your cluster just access the spark UI on each worker & master URL.
Spark Worker 1
Spark Worker 2
Spark Worker 3
Getting the source
You can get the source and a step by step tutorial on how to deploy the cluster from the scratch.