Scalable Spark/HDFS Setup using Docker
So, why do I need to pack Spark into docker in the first place? We have a cluster with three servers, each has 256 GB RAM and 64 CPU cores. As we run some other applications on this cluster, we have chosen Docker to isolate applications from each other. Yet being able to connect them when necessary. Also, docker can restrict RAM and set up CPU quota for each container. Thus, in theory you can spawn several spark-workers isolated from each other on the same machine. For instance, you can spawn 60 spark-workers on one of those servers with 2 GB RAM and 1/70 of CPU time each.
Docker exists for a couple of years already and has an ecosystem, where users can find dozens of docker images for almost all of mainstream technologies. Including HDFS/Spark. So I collected all the available images by searching for “spark” in the Docker Hub looking for something I can use out of the box. I put all of those images into imagelayers and got a nice visualization. Then I shortlisted them by the following criteria:
- Less than 1 GB in size. The bigger images most likely do not have separation of components, which makes them hard to scalable.
- HDFS/Spark separated? I want to scale HDFS nodes separately from Spark nodes.
- Is it easy to install? Is there documentation?
- Can I scale Spark with this image?
- Is it easy to connect Spark to HDFS?
After checking against criteria 1 and 2, which are relatively easy to figure out without trying images out, I had to run all of them one by one. And I found out that none of the images fit criterion 4.
$ docker network create hadoop
$ docker-compose up
That’s it! buildall.sh will iterate through all the Dockerfiles inside the repo and build necessary images. As I am using version 1 of docker-compose, you’ll have to create docker network manually.
You have 1 namenode, 2 datanodes, 1 spark master and 1 spark worker as well as spark-notebook running. Navigate to http://your.docker.host:50070 to see status of your HDFS and to http://your.docker.host:8080 for Spark.
On Spark page you will only see 1 worker. Let’s scale it up to 5.
$ docker-compose scale spark-worker=5
<-- you should be inside gitrepo folder (and run docker-compose, data dir would be created automatically)
$ cd data
$ wget https://data.mattilsynet.no/vannverk/vannbehandlingsanlegg.csv
$ docker run -it --rm --env-file=../hadoop.env --net hadoop uhopper/hadoop hadoop fs -mkdir -p /user/root
$ docker run -it --rm --env-file=../hadoop.env --volume $(pwd):/data --net hadoop uhopper/hadoop hadoop fs -put /data/vannbehandlingsanlegg.csv /user/root
Go to http://localhost:50070 and check if file exists under the path ‘/user/root/vannbehandlingsanlegg.csv’.
Open spark-notebook on localhost:9000 and choose “core/simple spark” example. Run the cells by clicking on them and pushing play button. Then add new cell (using a plus button in the toolbar). Let’s read our file:
val textFile = sc.textFile(“/user/root/vannbehandlingsanlegg.csv”)
It will show you the execution time and the result as such:
textFile: org.apache.spark.rdd.RDD[String] = /user/root/vannbehandlingsanlegg.csv MapPartitionsRDD at textFile at <console>:53
res8: Long = 4071
That’s it! If you have any questions, feel free to open new issues on github, or leave a comment to this article.