Docker for Data Science
For the past 3 years, I have heard lot of buzz about docker containers. I wanted to figure out about this technology could help more productive developer or data scientist.
I tried to convey my findings through this blog so you don’t need to parse all information out there. Let’s get started.
Docker is an open-source project based on Linux containers. It uses Linux Kernel features like namespaces and control groups to create containers on top of an operating system. Docker is a tool designed to make easier to create, deploy and run applications by using containers.
We can think docker as lighweight virtual machines that contain everything you need to run an application. Even biggies like Google, Amazon, VMware have built services to support it. That’s all you need to know about docker for now.
Containers are not new to tech world, Google has been using their own container technology for years.
A virtual machine like VMware provides hardware level virtualisation, container provides operating system level virtualisation. The big difference between the VM and container is that containers share the host machine’s kernel with other co-hosted containers.
Who is Docker for?
Docker is a tool that is designed to benefit both developers and system administrators, making it a part of many DevOps (developers + operations) toolchains. For developers, it means that they can focus on writing code without worrying about the system that it will ultimately be running on. For operations team , Docker gives flexibility and potentially reduces the number of systems needed because of its light weight and OS level virtualization.
Why should you need docker ?
Docker containers running on a single machine share that machine’s operating system kernel, they start instantly and use less compute and RAM. Images are constructed from filesystem layers and share common files. This minimizes disk usage and image downloads are much faster.
It based on open standards and run on all major Linux distributions, Microsoft Windows, and on any infrastructure including VMs, bare-metal and in the cloud.
Docker containers isolate applications from another and from the underlying infrastructure. Docker provides the strongest default isolation to limit issues to a single container instead of the entire machine.
As a data scientist in machine learning, being able to rapidly changing environment can significantly affect your productivity. Data science work often begin with data cleaning, data transformation and model building. This work often occurs in your laptop, however often comes a moment where different compute resources like more CPUs, RAM could speed up your work. Docker makes the process of porting your environment to a remote machine or cloud environment. Even you can participate in kaggle competitions by taking advantage of porting your environment in any of the cloud services like AWS or Google cloud. The common saying is “ Build once run anywhere “
- Image: Is a blue print for what you want to build. Ex: Ubuntu + Spark + Python and a running jupyter server.
- Container: It is a instantiation of image that you have brought to running state. You can also have multiple copies of same image running. There might be some confusion between image and container for new comers.
- Dockerfile: Cookbook for creating an image, A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image.
- DockerHub/Image Registry: A place where user can share their images publicly or privately. Users can pull the existing images and use it locally for creating the containers.
- Commit: Like git, Docker containers offer version control. Dockers generally stateless unless you make a call to stateful. You can save the state of your container at anytime as a new image with versions by commiting the changes.
I will be using the above terminologies for the rest of the tutorial, please refer back if you lost somewhere.
You can download and install Docker community edition for free here
Create Your First Docker Image
Get ready to get your hands dirty by creating a first docker image file. Let’s go through the below Dockerfile slowly. With the help of below Dockerfile you can create Pyspark cluster(Local Mode) with Notebook enabled. The same dockerfile works for most of the python for data science packages as well.
The from statement is first line of dockerfile. This specifies the base image for your application. Here we are using ubuntu:16.04 as a base image. This image is minimal installation which means it doesn’t have all the packages that general ubuntu contains.
As soon as it executes the statement the docker looks for the image locally(in your system), if it is not present locally then docker looks into the DockerHub/Image registry to pull the ubuntu official image locally. You can also build a container on top of pre-built application image such as Anaconda image. You can also push your own image to DockerHub.
Worth mention the versions of images are defined after the colon(:) for each image. In our ubuntu:16.04 image 16.04 is version tag, even you can set your tag while building your own image like :latest.
Important note about the docker images, while you pulling images from DockerHub need to be cautious as some random images from random people could potentially damage your system. For best practice use official images from respective products.
LABEL maintainer=Bharath “email@example.com”
This statement adds metadata to your image, and is completely optional. I add this such that others know who to contact about the image and also so I can search for my docker containers, especially when there are many of them running concurrently on a server.
RUN apt-get update && apt-get install -y \
Run command is used to install any packages required to the image. Like you can run apt-get update and install wget, git and so.
As I mentioned before the base image has minimal packages so we need to install the regular packages which a user required like vim, wget, git are required in further steps of Dockerfile.
ENV PYSPARK_PYTHON python3
ENV PYSPARK_DRIVER_PYTHON jupyter
ENV PYSPARK_DRIVER_PYTHON_OPTS "notebook --no-browser --port=8888 -- allow-root --ip='0.0.0.0'"
ENV PATH /root/spark/bin:$PATH
As most of the linux users aware of the environment variables. ENV is used to set the environment variables inside the container.
EXPOSE 8888 4040 8080 8080
Docker container generally don’t expose the ports to the outside world, even to the local system you are working on. So we need to explicitly allow the port which are required to, in our case I am exposing the ports 8888 and 4040 for jupyter notebooks and Spark
The WORKDIR statement is used to make the directory as present working directory. This comes handy when you want issue a command from the application directory.
ADD /home/ubuntu/notebooks /home/ubuntu/
ADD statement exports your local files to docker container. You can even export the folders as well.
From documentation of docker
ADD <src> …. <dest>
ADD instruction copies the files, folders from <src> to <dest> location.
CMD statement issues executable command at the end of the Dockerfile.
There can only be one
CMDinstruction in a
Dockerfile. If you list more than one
CMDthen only the last
CMDwill take effect.
The main purpose of a
CMDis to provide defaults for an executing container. These defaults can include an executable, or they can omit the executable, in which case you must specify an
ENTRYPOINTinstruction as well.
Building a Docker Image
We are good to build a image using the recipe of Dockerfile. You can accomplish through the below command.
docker build -t pyspark:ver1 .
The above command able to build a docker image with :latest as a version. Please note that this can build docker image not container(read the terminology in the beginning of this post if you don’t remember the difference)
Run Container from Docker Image
Finally we are ready to run our first and shiny new container.The below command 8888 port exposes to the outside world so that we can have access to the jupyter notebooks.
docker run -it -p 8888:8888 pyspark:ver1
After the command execution the jupyter notebook is up and running. Copy the URL and paste it in your browser to open the jupyter notebook.
List running containers
To the list the running containers in docker environment
Push your docker image to Dockerhub
When you decided to share your work publicly or with your colleagues, just push this image to Dockerhub. It is a public repository for images.
docker push pyspark:ver1
Before push command you need to setup the credentials for your Dockerhub. This is useful to people who wants to reproduce and extend your research.
Further Learning to Manage containers in Production
Production grade container orchestration
If you want run 1000’s of containers in production with scale-in and scale out feature, manage deployment in declarative style, service discovery, load balancing, automated roll-outs and roll-backs.
Stay tuned for my next blog: Kubernetes Container Orchestration