Learn Docker to get started with Spark

A practical introduction to the nuances of Docker.

What is Docker?

Docker is a containerization platform that eases the development, deployment of applications relatively. To put it in a very simplistic way, docker is software that you run to simulate the same environment that your code would run in PROD in your local machine.

Docker on Spark

Before we get started, we need to understand some Docker terminologies.

  1. Registry: It's like the central repo for all your docker images from where you can download the docker image. Docker Hub is one such example. We can also set up a private Docker registry which won’t be publicly available. We can pull ie download an image from or push an image to the registry.
  2. image: It is basically a blueprint on what constitutes your Docker container. For example, to deploy a Spark cluster you might wanna start with base Linux, install java and stuff like that. All of these requirements are baked into as an image that can be pulled from the registry or created locally from your Dockerfile.
  3. container: Well, as per Docker’s documentation it is A standardized unit of software. It is an instance of an image. Basically, Container is like a lightweight, isolated virtual machine(not exactly but a good analogy). There are a couple of cool features of Linux ie namespace and cgroups that Docker utilizes to provide us an isolated environment to run our Light-weight images.
  4. Dockerfile: Its a text file like a script which contains detailed instructions of commands you wanna run, things that you wanna download and stuff like that. We will be writing 1 of these by the end of this article.

Now that we know, some basic definitions. It’s time we ask the main question! Why do I care?

There are many reasons you might wanna use Docker. I will give my perspective on why I started to learn about Docker.

I had to test my Kafka producers and consumers locally instead of deploying my code in DEV/QA even before I was sure things are working fine but also be sure that the same code, when deployed in other environments, should behave the same.

docker run --rm -it -p 2181:2181 -p 3030:3030 -p 8081:8081 -p 8082:8082 -p 8083:8083 -p 9092:9092 -e ADV_HOST= landoop/fast-data-dev

Don’t worry about the specifics, we will get into this in the later part of the blog. But a 1 liner will spin up Kafka broker on my local machine with UI. Isn’t it awesome!!

PS: We need the Docker to be installed on our machine obviously...

No more of this …

In this article, we shall try to scratch the surface of your journey about understanding Docker.

Writing your first Docker file.

Let’s try to create a simple Docker image which will have an isolated environment to run your Spark application. This base image can be used to create a multi-node cluster as well.

#ARG ubuntu_version=18.04
#FROM ubuntu:${ubuntu_version}
#Use ubuntu 18:04 as your base image
FROM ubuntu:18.04
#Any label to recognise this image.
LABEL image=Spark-base-image
#Run the following commands on my Linux machine
#install the below packages on the ubuntu image
RUN apt-get update -qq && \
apt-get install -qq -y gnupg2 wget openjdk-8-jdk scala
#Download the Spark binaries from the repo
RUN wget --no-verbose http://www.gtlib.gatech.edu/pub/apache/spark/spark-2.4.1/spark-2.4.1-bin-hadoop2.7.tgz
# Untar the downloaded binaries , move them the folder name spark and add the spark bin on my class path
RUN tar -xzf /spark-2.4.1-bin-hadoop2.7.tgz && \
mv spark-2.4.1-bin-hadoop2.7 spark && \
echo "export PATH=$PATH:/spark/bin" >> ~/.bashrc
#Expose the UI Port 4040

So this is a sample docker file. When we build this docker file, it creates an image. We can spin up containers using this image. Also, note that each instruction here creates an intermediate layer of the image and reuses the same layer.

FROM: This is the first line of your docker file(unless if you ARGs). This specifies what is the base image that you would be starting from. Here we say we are going to use vanilla Ubuntu and install our binaries on top of it.

ARG: This can be used if we wanna set any variable when we build the image.

docker build --build-arg ubuntu_version=18.04

LABEL: As the name suggests, these are just the labels of the image that would be created.

ENV: This is used to set the environment variables within the containers created from this image. Here we use it to just set the SPARK_VERSION.

RUN: This will run the commands specified to build that image. Here, we use this instruction to perform various operations. Also, it is a good practice to have multiline(using \ ) related RUN instructions.

EXPOSE: This will expose the port within the network of the docker containers but won’t be available outside of the network of Docker containers(Multiple docker containers can be hooked up together and EXPOSE can be used for the inter-container communication).EXPOSE does not make the ports of the container accessible to the host. This comes in handy say when we need to open up ports between Spark Master and Workers in a Spark cluster.

We will have to use docker run -p option when we run our container(More on this later).

docker run — rm -dit -p 14040:4040 — name test001 spark-base-image

Note that we are exposing 4040 port of the docker container as 14040 and will be accessible at http://localhost:14040

To build a Docker image

docker build -t spark-base-image ~/home/myDockerFileFo/

This will create an image and tags it as spark-base-image from the above Dockerfile. If we don’t tag it with a specific name, it will be untagged and will have an ImageId as per the build output. Something like Successfully built 955016ee2387

To list all the images available

docker images --format "table {{.ID}}\t{{.Repository}}\t{{.Tag}}"
IMAGE ID            REPOSITORY                TAG
d1b494784ab8 <none> <none>
ed7532b3f781 spark-base 2.4
ba3684c184e1 spark spark-docker
b1fc416d936f openjdk latest
c842abf5149c openjdk 12-jdk-oraclelinux7
94e814e2efa8 ubuntu 18.04

We can use docker images command to list all the images available on your machine. We can sort of query on the type of images we need. We can filter, format and lot more using this.

To run a docker image

There are a ton of things that we can do when we run a docker.

docker run --rm -dit -p 14040:4040 — name mySpark spark-base

What we are telling the docker is that run a container called mySpark using the spark-base that we just created. Some of the flags that we pass here are --rm this will remove the docker container when we stop the container, -dit is kind of an important one as if we don’t pass this if our container will just start and stop. -d indicates to run this container in detached mode and -i specifies that the STDIN should be kept open so that we can use it later.

To get inside the container, we can use:

user@myMac:~/home $ docker exec -it test001 bin/bash
root@7b5d8dcbd265:/# ls
bin boot dev etc home lib lib64 media mnt opt proc root run sbin spark spark-2.4.1-bin-hadoop2.7.tgz srv sys tmp usr var

To check all the containers running:

docker ps // List containers running currently
docker ps -a // This will list even the stopped containers

To stop and remove the running container:

docker stop mySpark
docker rm mySpark

To remove an image:

docker rmi imagename

These are some of the basics to get started with Docker. I will write a follow-up article on some more involved concepts in Docker. Happy learning! As always, Thanks for reading! Please do share the article, if you liked it. Any comments or suggestions are welcome! Check out my other articles here.