Docker: The most comprehensive tool for Data Science projects

Step by step guide for Docker newbie

Published in

DataRoad

6 min readSep 28, 2020

When I was working at the post-production in the film industry, the first thing we always focus on is the workflow. It can be critical because it’s related to all different departments. We always need to exchange all different kinds of files, like EDL, XML, AAF, etc. Along with various formats of the footage. If we don’t have a clear workflow and specification, it would be hard for other people to continue if we need to hand over our work to other people, or for team collaboration.

The same situation happened when I was working on the final project with other team members at the Le Wagon Data Science Bootcamp. We separated members with different missions and used Github to do the collaboration like making a pull request and merging process etc. However, we encountered the environment set-up issue at the beginning when we did our first merge. Turns out, there are 3 different OS systems with different libraries which caused the issue. In the end, the solution was to uninstall and re-install the libraries to make it work. This issue still cost our time to figure it out.

When I read an article from S Ahmad about how Docker works, it immediately caught my eyes. This is a great solution for CI/CD (Continuous Integration and Continuous Delivery)and definitely helps with collaboration works under a production developing environment. Before the guide starts, I want to share some articles which have a great explanation and deep knowledge if you are an experienced coder.

You’re living in 1985 if you don’t use Docker for your Data Science Projects by S Ahmad (Link)
How Docker Can Help You Become A More Effective Data Scientist by Hamel Husain (Link)
Deploying ML web apps with Streamlit, Docker and AWS by Collin Prather (Link)
Docker Best Practices for Data Scientists by Thushan Ganegedara (Link)
Docker — Containerization for Data Scientists by Dhilip Subramanian (Link)

The above articles are very great and informative, however, it was hard for me at the very beginning. Eventually, I kind of figure out the key concept and share it with you here.

There are 2 purposes and 3 elements you need to know why we are using Docker for our Data Science projects.

2 Purposes:

Reproducibility
Portability

3 Elements:

Dockerfile
Image
Container

Basically said, if you have experience of using the virtual machine, I think you would quickly get the concept. The process is using “Dockerfile” to create the “Image” and then running from the “Container”.

If you need more information about those key points, I strongly recommend you take some time to read Hamel Husain’s article(Link).

Step by Step Guide

Note: I only build a very simple image for this demonstration. The progress would be varied depends on your work scale.

📌 Build Dockerfile and .Dockerignore

Consider Dockerfile as a series of installer processing. It’s like you install a new virtual Linux OS system with the default rules.

FROM: Consider this as the base system you want to install. Each image will be an independent Linux system.

LABEL: This is the metadata and provides the information for this image

WORKDIR: It will create a default working directory in the virtual machine.

COPY: Copy all files from the current directory (your project directory) to your virtual machine’s working directory.

RUN: (Assume you already have a requirements.txt file)It will automatically run the command you have. If you are curious about how to create this file, please check out my previous article here👈.

EXPOSE: You can assign a port to allow the virtual machine to access the internet.

CMD: This is the command used when you start running the container.

It’s strongly recommended one image with only one purpose and one command.

.Dockerfile: It works just like “.gitignore” file. You can avoid some unnecessary files in the building image process.

Once you have files ready, you can execute the code below to build your image. (❗Don’t forget a period in the end if you are in the project folder, otherwise, give the full project folder path where your source is.)

docker build -t <image_repository> .

📌 Basic function on Docker

Once the image is built, there are some basic functions to manipulate the image.

⚙ Check all available images

docker images

⚙ Rename the image tag

docker image tag <image_id> <new_image_name>:<image_version>

⚙ Remove image
(Note: if the image has been used in any container, you might need to force remove the image, or remove the related container first.)

docker rmi <image_repository> or <image_id>

⚙ Check container status

docker ps       # Check current running conatiner
docker ps -a    # Check all containers (include running and stopped)

⚙ Remove container(s)

docker rm <container_id> or <image_name>    # Remove one container
docker rm $(docker ps -a -q)                # Remove all containers

⚙ Stop the container

docker stop <container_id> or <names>

⚙ Run the container
(Note: Running a container can be tricky because there are different flags to activate the function. In the example, I use it to run the jupyter notebook. If you have a different purpose, I suggest you check the documentation here.)

docker run --rm -it -p 8888:8888 --name=custom_name -v /host/project/folder/path:/image/target/path <image_repository>

I’m going to go through this code and explain how it works.

rm: Clean up the container. By default a container’s file system persists even after the container exits. If you are running short-term foreground process and you don’t want to stack up your containers, adding rm will allow the “automatically remove the container” when you exit.

it: For interactive process, you can use “-it” to allocate the stdin and tty for the container procress.

p: This is the port config. This setting is working with EXPOSE from your Dockerfile. You publish a container’s port or a range of ports to the host. The format is 8888(host port):8888(container port)

name: Assign a name to the container.

v(mount): Bind mount to a volume. This is the key element for making codebase changing under production environment. When I mount the host directory, which is my project directory, with the container’s working directory, it will automatically sync and update the code in the container. In this way, you don’t need to rebuild the image every time when you modify your code which is time consuming. By using this method, you need to make sure it’s only tied with the same version update. If you want to create a different version to make progress record, I would recommend using docker commit to make a new version first, and then modify the code on the top of that new version.

I still recommend Hamel Husain’s article becasue he has well-explained each function(Link).

⚙ Copy files from / to the container

docker cp /host/file/path docker_name:/container/file/path          # host -> conatinerdocker cp docker_name:/container/file/path /host/file/path
# container -> host

This is very useful when you have new data you want to add into the container or you want to copy container’s content to the host.

⚙ Docker execute Shell

docker exec -it container_name bash

This allow you to browse the file system under shell just like operating the Linux system. If you don’t set up any group policy, the default will be the root user.

Docker Hub

Just like the Github, Docker Hub provide the platform for people to share images. For free account, there is only 1 private image allowed. However, $5/month to have unlimited private images on the Docker Hub might be a good option for building a personal image library.

Push the image to the Docker Hub is easy and straight forward. Before you push, make sure 2 things:

Register a Docker Hub account
Rename your repository name to this format: username/image_name

docker push username/image_name

Once you successfully push to the Docker Hub, you can remove the image and pull from the hub and check if it works.

docker pull username/image_name

Note: The image you upload to the Docker Hub will be private by default. If you are the free account, I would assume the second image will be public unless you have unlimited private images allowed.

Docker Hub has many pre-built images which allow users to pull and work immediately. It’s recommend download the official images. The reason is becasue image can be created by anyone and we don’t know if it contains any suspicious files.

One cool thing about Docker Hub is that I can connect my Github repo to the image, and enable the automatically build new image when I git push to the master, just like Heroku automatic deployment.

This article just scratch the surface of the Docker. It has so many advanced functions like compose multiple contains together, create and mount new volumes etc. You can take time to read the article I recommended or this video here👈.

Please feel free to leave the comment if there is any misinformation or any advanced suggestions for using this powerful tool.

Happy coding.