Scalable Spark/HDFS Setup using Docker

Note: extended version (with Hue HDFS FileBrowser) of this article is published on BDE2020 project blog. Repository is migrated to BDE2020 github.

In this article I discuss how I created scalable HDFS/Spark setup using docker and docker-compose.

So, why do I need to pack Spark into docker in the first place? We have a cluster with three servers, each has 256 GB RAM and 64 CPU cores. As we run some other applications on this cluster, we have chosen Docker to isolate applications from each other. Yet being able to connect them when necessary. Also, docker can restrict RAM and set up CPU quota for each container. Thus, in theory you can spawn several spark-workers isolated from each other on the same machine. For instance, you can spawn 60 spark-workers on one of those servers with 2 GB RAM and 1/70 of CPU time each.

Docker exists for a couple of years already and has an ecosystem, where users can find dozens of docker images for almost all of mainstream technologies. Including HDFS/Spark. So I collected all the available images by searching for “spark” in the Docker Hub looking for something I can use out of the box. I put all of those images into imagelayers and got a nice visualization. Then I shortlisted them by the following criteria:

  • Less than 1 GB in size. The bigger images most likely do not have separation of components, which makes them hard to scalable.
  • HDFS/Spark separated? I want to scale HDFS nodes separately from Spark nodes.
  • Is it easy to install? Is there documentation?
  • Can I scale Spark with this image?
  • Is it easy to connect Spark to HDFS?

After checking against criteria 1 and 2, which are relatively easy to figure out without trying images out, I had to run all of them one by one. And I found out that none of the images fit criterion 4.

If I throw out scaling, uhopper/hadoop-spark looks good. So, I forked it and modified for the setup. Let’s see how to use my simplified version! (use git to clone my repo).

$ ./
$ docker network create hadoop
$ docker-compose up

That’s it! will iterate through all the Dockerfiles inside the repo and build necessary images. As I am using version 1 of docker-compose, you’ll have to create docker network manually.

You have 1 namenode, 2 datanodes, 1 spark master and 1 spark worker as well as spark-notebook running. Navigate to to see status of your HDFS and to for Spark.

On Spark page you will only see 1 worker. Let’s scale it up to 5.

$ docker-compose scale spark-worker=5

Refresh the page on You will see 5 spark-workers registered now.

Let’s throw some data into our HDFS now and run a simple job. Let’s use Waterworks — Treatment csv dataset from the EU data portal.

<-- you should be inside gitrepo folder (and run docker-compose, data dir would be created automatically)
$ cd data
$ wget
$ docker run -it --rm --env-file=../hadoop.env --net hadoop uhopper/hadoop hadoop fs -mkdir -p /user/root
$ docker run -it --rm --env-file=../hadoop.env --volume $(pwd):/data --net hadoop uhopper/hadoop hadoop fs -put /data/vannbehandlingsanlegg.csv /user/root

Go to http://localhost:50070 and check if file exists under the path ‘/user/root/vannbehandlingsanlegg.csv’.

Open spark-notebook on localhost:9000 and choose “core/simple spark” example. Run the cells by clicking on them and pushing play button. Then add new cell (using a plus button in the toolbar). Let’s read our file:

val textFile = sc.textFile(“/user/root/vannbehandlingsanlegg.csv”)

It will show you the execution time and the result as such:

textFile: org.apache.spark.rdd.RDD[String] = /user/root/vannbehandlingsanlegg.csv MapPartitionsRDD[2] at textFile at <console>:53
res8: Long = 4071

That’s it! If you have any questions, feel free to open new issues on github, or leave a comment to this article.