Create compact and transferable dockerized environment for Data Scientists
World’s leading research and advisory companies like Gartner¹ are predicting upcoming trends for data and analytics markets. According to Gartner¹ the number one trend is Augmented Analytics. Thanks to machine learning and AI techniques Augmented Analytics transforms the development, consumption, and sharing of analytics content. Augmented Analytics is part of the 2020 milestone in IT and will be the main generator of purchases of analytics and data science platforms.
At ableneo, we are able to provide all the competencies for data scientists according to latest trends like Augmented Analytics, research in machine learning and cutting-edge technologies. ableneo’s values strongly focus on teamwork and customer satisfaction. This also applies to the competences of data scientists.
In this blog, we will show you how to achieve rapid fast start with Augmented Analytics anywhere at your company, personal machine or at the customer with dockerized environment for data scientists.
Competence of a Data Scientist
A valuable data scientist in a company manages not only design and conduction of analyses but also incorporates analyses into pipelines and foremost incorporates these pipelines into the business. Nowadays, data scientists have to possess the skills of data engineers and machine learning engineers. Besides standard statistical analysis and machine learning modeling, many data scientists manage to design advanced solutions using deep or reinforcement learning.
Working for the customer requires a preset or pre-arranged environment for fast prototyping of analyses, business insights, and modeling. There is simply no time for installing drivers, libraries and packages for all those kinds of analyses and models.
Here comes the solution — Docker!
In short, a docker container encapsulates an application with all its dependencies. Containers provide reproducible and reliable execution of applications without the full virtual machine overhead.
Standard docker containers only enable CPU-based apps to be deployed across multiple machines, also a standard container cannot communicate with host’s GPU. The solution for this problem is provided by nvidia-docker. Using NVIDIA GPU Cloud, we can simply pull a nvidia-docker image with our favorite deep learning framework, pre-installed python and useful libraries like numpy or pandas.
Figure 1. depicts how the nvidia-docker tools sit above the host operating system and its GPU drivers. These tools create containers and containers have apps, SDK’s and CUDA Toolkit.
Without the use of nvidia-docker, deep learning SDK’s and CUDA Toolkit must be installed per machine. This is often very tedious and time consuming.
Currently, the only drawback is the compatibility only with Windows Server (at least 2015), Linux distributions and MacOS. You cannot install nvidia-docker on your Windows PC because of the lack of GPU passthrough with Windows’s Hyper-V which would require discrete device assignment that is currently only in Windows Server.
Docker container ready for Data science
- Installed docker && nvidia-docker
- Working installation guide https://docs.nvidia.com/deeplearning/dgx/preparing-containers/index.html — tested on Ubuntu 18.04
- Register an account on NVIDIA GPU Cloud — https://ngc.nvidia.com/catalog/containers
- Create an account API key
- Configuration -> Get API key -> Generate API key -> Confirm
You will need API key when pulling a docker image from the NVIDIA GPU Cloud.
Now you are able to pull your favorite image from the cloud and create a working container
docker pull nvcr.io/nvidia/pytorch:19.01-py3nvidia-docker run -it -rm –v local_dir:container_dir nvcr.io/nvidia/pytorch:19.01-py3
- local_dir — absolute path from your host system that you want to access from the inside of your container
- container_dir — target directory inside your container
You are now ready to use the container with working pytorch framework and other python libraries.
Nvidia-docker with Jupyter Lab
However, this container lacks environment for developing our analysis, applications or prototypes. Jupyter Lab is a popular tool for this. To create a container with additional packages and libraries like Jupyter and docker images provided by NVIDIA GPU Cloud we need to create separate container using Dockerfile first.
- Create file “Dockerfile”
- Enter the following:
RUN pip install jupyter
- Run building of container inside the location of Dockerfile
docker build -t my-pytorch-container .
We have created our container. It automatically saves created files in a specific container folder on the host using volume mounting, which is a very handy feature.
- Create a volume
docker volume create PyTorch
- Configure nvidia-container to use the volume with access for Jupyter Lab
docker run — runtime=nvidia -it -p “8888:8888” -v PyTorch:/project-files my-pytorch-container
- To run Jupyter Lab inside of the docker container
jupyter lab -port=8888 -ip=0.0.0.0 -allow-root
We are now ready to create solutions using PyTorch and Jupyter notebooks inside our nvidia-docker container which has access to our host’s GPU resources. Furthermore, when we create our notebooks they are automatically stored in our host machine inside PyTorch volume. The -v parameter also mounts the current content of PyTorch volume into our container.
Let’s sum it up
We have shown how to create compact, transferable and reproducible environment using nvidia-docker containers. Following these simple steps, we are able to create dockerized environment on any virtual machine somewhere in a cloud. At ableneo we try to maximize efficiency when working for customers. Using this tutorial, we are able to concentrate on the main parts of our data science work successfully.
Stay tuned for more data science topics from ableneo #BeAbleToChange
👏Clap, 👂follow for more awesome content.