Docker for Data Science

For the past 3 years, I have heard lot of buzz about docker containers. I wanted to figure out about this technology could help more productive developer or data scientist.

I tried to convey my findings through this blog so you don’t need to parse all information out there. Let’s get started.

Docker ?

Docker is an open-source project based on Linux containers. It uses Linux Kernel features like namespaces and control groups to create containers on top of an operating system. Docker is a tool designed to make easier to create, deploy and run applications by using containers.

We can think docker as lighweight virtual machines that contain everything you need to run an application. Even biggies like Google, Amazon, VMware have built services to support it. That’s all you need to know about docker for now.


Containers are not new to tech world, Google has been using their own container technology for years.

A virtual machine like VMware provides hardware level virtualisation, container provides operating system level virtualisation. The big difference between the VM and container is that containers share the host machine’s kernel with other co-hosted containers.

Who is Docker for?

Docker is a tool that is designed to benefit both developers and system administrators, making it a part of many DevOps (developers + operations) toolchains. For developers, it means that they can focus on writing code without worrying about the system that it will ultimately be running on. For operations team , Docker gives flexibility and potentially reduces the number of systems needed because of its light weight and OS level virtualization.

Why should you need docker ?


Docker containers running on a single machine share that machine’s operating system kernel, they start instantly and use less compute and RAM. Images are constructed from filesystem layers and share common files. This minimizes disk usage and image downloads are much faster.


It based on open standards and run on all major Linux distributions, Microsoft Windows, and on any infrastructure including VMs, bare-metal and in the cloud.


Docker containers isolate applications from another and from the underlying infrastructure. Docker provides the strongest default isolation to limit issues to a single container instead of the entire machine.


As a data scientist in machine learning, being able to rapidly changing environment can significantly affect your productivity. Data science work often begin with data cleaning, data transformation and model building. This work often occurs in your laptop, however often comes a moment where different compute resources like more CPUs, RAM could speed up your work. Docker makes the process of porting your environment to a remote machine or cloud environment. Even you can participate in kaggle competitions by taking advantage of porting your environment in any of the cloud services like AWS or Google cloud. The common saying is “ Build once run anywhere “

Docker Terminology

  • Image: Is a blue print for what you want to build. Ex: Ubuntu + Spark + Python and a running jupyter server.
  • Container: It is a instantiation of image that you have brought to running state. You can also have multiple copies of same image running. There might be some confusion between image and container for new comers.
  • Dockerfile: Cookbook for creating an image, A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image.
  • DockerHub/Image Registry: A place where user can share their images publicly or privately. Users can pull the existing images and use it locally for creating the containers.
  • Commit: Like git, Docker containers offer version control. Dockers generally stateless unless you make a call to stateful. You can save the state of your container at anytime as a new image with versions by commiting the changes.

I will be using the above terminologies for the rest of the tutorial, please refer back if you lost somewhere.

Install Docker

You can download and install Docker community edition for free here

Create Your First Docker Image

Get ready to get your hands dirty by creating a first docker image file. Let’s go through the below Dockerfile slowly. With the help of below Dockerfile you can create Pyspark cluster(Local Mode) with Notebook enabled. The same dockerfile works for most of the python for data science packages as well.

From Statement

FROM ubuntu:16.04

The from statement is first line of dockerfile. This specifies the base image for your application. Here we are using ubuntu:16.04 as a base image. This image is minimal installation which means it doesn’t have all the packages that general ubuntu contains.

As soon as it executes the statement the docker looks for the image locally(in your system), if it is not present locally then docker looks into the DockerHub/Image registry to pull the ubuntu official image locally. You can also build a container on top of pre-built application image such as Anaconda image. You can also push your own image to DockerHub.

Worth mention the versions of images are defined after the colon(:) for each image. In our ubuntu:16.04 image 16.04 is version tag, even you can set your tag while building your own image like :latest.

Important note about the docker images, while you pulling images from DockerHub need to be cautious as some random images from random people could potentially damage your system. For best practice use official images from respective products.

LABEL Maintainer

LABEL maintainer=Bharath “”

This statement adds metadata to your image, and is completely optional. I add this such that others know who to contact about the image and also so I can search for my docker containers, especially when there are many of them running concurrently on a server.

RUN Statement

RUN apt-get update && apt-get install -y \
software-properties-common \
wget \
vim \
git \

Run command is used to install any packages required to the image. Like you can run apt-get update and install wget, git and so.

As I mentioned before the base image has minimal packages so we need to install the regular packages which a user required like vim, wget, git are required in further steps of Dockerfile.

ENV Statement

ENV PYSPARK_DRIVER_PYTHON_OPTS "notebook --no-browser --port=8888 -- allow-root --ip=''"
ENV PATH /root/spark/bin:$PATH

As most of the linux users aware of the environment variables. ENV is used to set the environment variables inside the container.

Expose Statement

EXPOSE 8888 4040 8080 8080

Docker container generally don’t expose the ports to the outside world, even to the local system you are working on. So we need to explicitly allow the port which are required to, in our case I am exposing the ports 8888 and 4040 for jupyter notebooks and Spark

WORKDIR Statement

WORKDIR /root/jup

The WORKDIR statement is used to make the directory as present working directory. This comes handy when you want issue a command from the application directory.

ADD Statement

ADD /home/ubuntu/notebooks /home/ubuntu/

ADD statement exports your local files to docker container. You can even export the folders as well.

From documentation of docker

ADD <src> …. <dest>

ADD instruction copies the files, folders from <src> to <dest> location.

CMD Statement

CMD [“pyspark”]

CMD statement issues executable command at the end of the Dockerfile.

There can only be one CMD instruction in a Dockerfile. If you list more than one CMD then only the last CMD will take effect.
The main purpose of a CMD is to provide defaults for an executing container. These defaults can include an executable, or they can omit the executable, in which case you must specify an ENTRYPOINT instruction as well.

Building a Docker Image

We are good to build a image using the recipe of Dockerfile. You can accomplish through the below command.

docker build -t pyspark:ver1 .

The above command able to build a docker image with :latest as a version. Please note that this can build docker image not container(read the terminology in the beginning of this post if you don’t remember the difference)

Run Container from Docker Image

Finally we are ready to run our first and shiny new container.The below command 8888 port exposes to the outside world so that we can have access to the jupyter notebooks.

docker run -it -p 8888:8888 pyspark:ver1

After the command execution the jupyter notebook is up and running. Copy the URL and paste it in your browser to open the jupyter notebook.

List running containers

To the list the running containers in docker environment

docker ps

List Images

docker images

Push your docker image to Dockerhub

When you decided to share your work publicly or with your colleagues, just push this image to Dockerhub. It is a public repository for images.

docker push pyspark:ver1

Before push command you need to setup the credentials for your Dockerhub. This is useful to people who wants to reproduce and extend your research.

Further Reading

Helpful docker commands

Dockerfile reference

Pushing docker image to dockerhub

Further Learning to Manage containers in Production


Production grade container orchestration

If you want run 1000’s of containers in production with scale-in and scale out feature, manage deployment in declarative style, service discovery, load balancing, automated roll-outs and roll-backs.


Stay tuned for my next blog: Kubernetes Container Orchestration