Running a Spark cluster setup in Docker containers

Ruben
2 min readApr 29, 2018

After writing my previous post about how to run a Hadoop multi-node setup in Docker it seemed the natural evolution was to install Spark on it.

The task ended up being more difficult than expected, especially because installing Spark and Hadoop separately is the hardest option since Spark can be shipped with its own Hadoop instance embedded.

Spark provides a couple of handy scripts to manage the cluster from the master node:

  • start-master.sh (and stop-master.sh)
  • start-slaves.sh (and stop-slaves.sh)

However (perhaps due to the Docker nature of the setup) the slaves didn’t start due to missing libraries dependencies (slf4j API to be precise) . The setup worked fine when the scripts were started from within the containers so I had to add a little wrapper for that.

One lesson I learned is that even when the -u username option is provided, Docker doesn’t setup the USER env variable by default so you have to do it in your scripts (or use -e USER=username)

After few ups and downs I got there and the result can be cloned from here.

The resulting cluster contains a master node and two Spark slaves to play with.

Two Spark slave nodes

--

--

Ruben

Data analytics, data management in financial services . Solutions Architect @ AWS . http://rubenafo.com