An Introduction to Automating Data Processes with Docker

-featuring python, venv, and Azure blobs

Charlotte Patola
CodeX
13 min readMar 29, 2022

--

Docker logo from https://www.docker.com/, Azure logo from https://azure.microsoft.com/ & python logo from https://www.python.org/

What is Docker?

Docker is a tool that can be used to create a virtual environment, or a container, for your code to run in. If you run a piece of code on your local machine, for example, print('Hello') in python, the code will run in the operating system (OS) of your computer, with the version of python and python-packages you have installed, within your current file structure and possible using your locally defined environment settings.

With Docker, you can define that your code shall be run in a completely different environment. The definition of this environment is called an image. On Docker hub, you find generic base images where only the OS is set, but also more specific images. An example of the latter is the python base image. Here Linux Debian is used as OS, base settings are defined, and python is installed. Base images are used as starting points for your own images, i.e., environment definitions.

Anyone can upload images to Docker Hub and use them for their own needs, a bit like with GitHub or GitLab. This means that you will also find complete projects on Docker Hub. For example, images that run whole applications.

Why should I use Docker?

Using Docker is a way to solve the problem of code running on one person’s local machine but not on another one’s. These problems arise since the environments of the local machines differ.

Developer A might have version 2.2 of a python package, while developer B has version 1.9. A used a new function that was added in version 2.2 and thus, this does not work for B, who has the earlier version of the package. This situation could also arise if A later updates a package used in the code. The new version might not anymore be compatible with other packages used and the code, therefore, does not compile.

Another reason for using Docker (and Docker compose, which is a way of orchestrating runs of several Docker images at the same time) is that it drastically simplifies development, especially for large applications. If you are developing a web application for which you have a frontend, a backend, and several databases, with Docker, you can deploy themall at once with a single command!

Even though the benefits of Docker might be most evident when multiple people work together on web applications, it still brings benefits also in smaller projects. Docker might not make it easier to develop and run your code, but it will surely save you the hassle of debugging environment mismatches (which is not a neglectable problem!).

Tutorial Prerequisites

To follow this docker tutorial, you

Need to:

  1. have admin rights to your computer. If you do not have this on your physical computer, you can create a virtual machine via Azure/AWS/GoogleCloud and use that one instead
  2. install docker. You will find the instructions here. If you use Linux, you should also create a docker user group so you don’t have to type sudo before all docker commands. You find instructions for that here.

Good to have:

3. a basic understanding of python

4. pip installed

5. venv installed

6. an Azure account with a Storage account tied to it (for part 2 of the demo)

Our Demo in a Nutshell

In this demo, we are going to:

  1. Read a CSV file with course feedback data into python
  2. Add a column to the data frame
  3. Save the data frame to a new CSV file
  4. Make the above happen within a Docker container, which is run from an image that we will define
  5. Schedule the run of the docker container with crontab

We start by reading and writing the files locally (Demo part 1) and then switch to reading and writing from Azure Containers (Demo part 2).

Main building blocks of our Project

To dockerize our new project, we need the project directory to have the following structure:

Project Directory Structure
  1. The code we want to run, in this case, a file named handle_data.py
  2. A list of packages to install that the Dockerfile will refer to. In this case, named requirements.txt. We create the package list with the help of the venv package
  3. A Dockerfile. This is the definition/recipe for the environment in which we want to run the code (operating system, python version, python packages…)
  4. A data folder for our input and output data
  5. A .dockerignore file containing names of files and folders not needed when the docker image is built. In our case, one line with /venv and one with .env.
  6. (A .env file holding environment variables referred to in handle_data.py.)
  7. (If you use version control, a .gitignore file having one line with /venv and one with .env)

Demo Part 1: Local Files

In the first part of the demo, we will read and write files from our local computer. This is an easy way to get started and does not require connectors to outside sources.

1. Code

Our code in the file handle_data.py is pretty straightforward. We start by importing pandas and then read, transform, and write the data. The file paths are going to be created and used within the docker image and the part before /data does not necessarily have to correspond to your local file structure.

import pandas as pd# Read base data from local csv file
INPUT_FILEPATH= ‘./data/courses.csv’
course_feedback = pd.read_csv(INPUT_FILEPATH)
# Create overall grade column
course_feedback[‘Overall’] = course_feedback.iloc[:, 4:7].mean(axis=1).round(2)
# Write new data frame to CSV locally
OUTPUT_FILEPATH = ‘./data/course_feedback_finished.csv’
course_feedback.to_csv(OUTPUT_FILEPATH, encoding=’utf-8', index=False)

2. Virtual Environment

Next, we need a list of all packages the code above uses so that we can make sure they are all available in our Docker image and the code will run smoothly. Even though we only import pandas in the code, it does not mean that that is the only package needed. Pandas has dependencies that we also need to include on the list. One could google which they are and which the most current versions are. However, an easier way is to use a python virtual environment.

The idea behind virtual environments is similar to the idea of docker. When docker creates a separate environment starting from the OS and ending with the execution of your project code, python virtual environment creates a separate environment starting with the python version and ending with python packages and their versions. When we create a python virtual environment for a specific project, we can install only python packages (and versions of them) that we need for that specific project. Hence we get a slim and clean environment. When we then list the installed packages in the virtual environment, it is a much shorter list than if we would list all the packages ever downloaded in our base environment. Our Docker image will perhaps not break if we install a lot of extra packages into it, but it increases the size of the image (which can anyway be quite large).

With the help of the python package venv, we get our thin package list like this:

  1. When you are inside the project directory, run python -m venv ./venvto create a virtual environment named venv in that directory.
  2. Run source venv/bin/activate to activate the environment
  3. Now install pandas. Pip will install needed dependencies automatically.
  4. Save the list of packages to a text file that the Docker image can refer to: pip freeze > requirements.txt . Now the file has been created n the root of your project directory
  5. Deactivate the virtual environment: deactivate
  6. (If you use version control, ignore the virtual environment directory)

3. Dockerfile

The Dockerfile is a text file that (usually) has the name Dockerfile without any extension. The “recipe” it contains is to be read top-down. The type of every step in the recipe (FROM, COPY…) is written with capital letters. You find detailed info on the file structure here.

Every Dockerfile starts with a FROM clause. Here we define which image we want to use as our starting point. As we will be running a python script, we can use an image with python installed. I chose python version 3.9.10 built on Linux Debian’s bullseye version.

After updating pip with RUN pip install —upgrade pip (installations are done with the RUN command), we create a working directory to use within this Docker image. This makes it easier to manage files within the image and to map them to files on your local computer (more about that in a minute). I chose the path /handle_data and write that after the WORKDIR clause.

The next step is to install all the packages we need. This is done by first copying (COPY clause) requirements.txt from our local computer to the docker image and — after running pip upgrade — installing all the packages with pip install.

After this, we copy all the files from the project directory on our local machine to the docker image.

Lastly, in the CMD, i.e. command step, we define that the file handle_data.py shall be run in python when we use this image. All steps until this will be executed when we build our image (mix the ingredients) but this one will only be executed when we run a container from the image (start consuming the finished image).

FROM python:3.9.10-bullseyeRUN pip install --upgrade pipWORKDIR /handle_dataCOPY requirements.txt ./RUN pip install --no-cache-dir -r requirements.txtCOPY . .CMD [ "python", "handle_data.py" ]

4. Place to Store Input and output Data

In the first part of the demo, we will read the input data locally from a file. We, therefore, create a data folder that contains this file.

5. Time for Action!

We are now ready to build our Docker image and then run a container from it!

We aim to read the input CSV file from the data folder, let handle_data.py add a new column to the data frame, and write out the result as a new CSV file to the data folder. We start by building the Dockerfile. While being in the root directory, run this code docker image build . -t <the name you want to give your image>. I named the image handle_data.

The build will take a few minutes and you can follow the process in your terminal. If there are any problems during the process, you will get an error message and can investigate it from there.

Our image is now ready for use. The basic way of running a container from an image is to use the command docker run <image name>. However, when running a container, we must be aware that it is running in its own environment, separately from our local machine. The files will be read and written inside the container, and we will not be able to access them from within our local machine. This is not what we want right now.

Fortunately, there is a way to fix this. We can map a directory on our local machine to the working directory we set up in the docker image. This way we can make the output CSV file appear in the data folder in the project directory on our local machine. This is done with the -v flag followed by path in the local machine (we can use $(pwd)/data, if we run the container directly from the project directory), colon, and path in the docker image.

When running a container from a new Docker image, it is also a clever idea to map the docker terminal output to the terminal of our local machine. This way we can see info or error messages if there would be any (otherwise these would also be visible only inside the container). This is done with the flag -it.

The full command to run the container from within the project directory is now: docker run -it -v “$(pwd)/data:/handle_data/data" handle_data.

When the code has run successfully, you should see the output file course_feedback_finished.csv in the data folder in your project directory.

It is great that we have now successfully set up a basic docker structure. However, how often do we need to automate a file read and write on our local computer? Probably not so often. Usually, we read from one external source and write to another external source. This is what we will be doing in part 2 of this demo.

Demo Part 2: Files in Azure Containers

In this part of the demo, we will read data from one Azure container and write it to another. Both containers are private, and we use Shared Access Signature (SAS) URLs for authentication.

1. Code

Our code requires a few updates to read and write from Azure Containers. To write to our Azure containers, we need to import ContainerClient. We will store our SAS credentials as environment variables and therefore need to import environ to access them.

Next, we change the path to read_csv to use the value of the new environment variable URL_TO_INPUT_BLOB. Finally, we add a few lines to write the output file to our container in Azure ( URL_TO_OUTPUT_CONTAINER). You find all the updates in bold below.

from os import environimport pandas as pdfrom azure.storage.blob import ContainerClient# Read base data via blob sas url
course_feedback = pd.read_csv(environ.get('URL_TO_INPUT_BLOB'))
# Create overall grade column
course_feedback['Overall'] = course_feedback.iloc[:, 4:7].mean(axis=1).round(2)
# Write new dataframe to csv locally
LOCAL_FILEPATH = './data/course_feedback_finished.csv'
course_feedback.to_csv(LOCAL_FILEPATH, encoding='utf-8', index=False)#Write local csv to Azure blob
cont_cli = ContainerClient.from_container_url(environ.get('URL_TO_OUTPUT_CONTAINER'))
with open(LOCAL_FILEPATH, 'rb') as data:
cont_cli.upload_blob('course_feedback_finished.csv', data, overwrite=True)

2. Virtual Environment

As we just imported more modules into our code, we need to update our package list requirements.txt. Do it by going through steps 2–5 in the virtual environment part of Demo part 1, but exchange pip install pandas for pip install azure-storage-blob We do not need to install os, as it is already included in the python base image we use as FROM in our Dockerfile.

3. Environment Variables

As you saw in the changes in our handle_data.py file, we will use two different Azure containers, one as input and one as output. Let’s create those!

When you have signed in to your azureaccount and gone to your storage account, start by creating a container where you will store the input data as a blob. When you have uploaded the file, left-click it and choose “Generate SAS”. In the menu that pops up, choose the timeframe for when the SAS is to be valid and click “Generate sas token and URL”. Don’t close the browser tab.

Input Blob
Generate SAS for blob

Next, open a new browser tab and create a new container for your output data, Left click that container and choose “Generate SAS” again. Now, be sure to choose permissions that are extensive enough. I checked all but “Immutable storage”. Then, choose the validity period and click “Generate SAS token and URL” again. Don’t close the browser tab.

Generate SAS for output container
SAS options

We will pass the environment variables via a file named .env. Create that file in the project directory and write one line with URL_TO_INPUT_BLOB=the URL of your input blob and one with URL_TO_OUTPUT_CONTAINER=the URL of your output container.

.env file model

4. Time for Action!

Our modifications are done! Let’s build the updated Docker image again: docker image build . -t handle_data.

This time, when we run a container from the image, we won’t need to map any local folder to a docker folder. However, what we instead need to do is tell docker where it will find the environment variables we have referred to in our code. This is done with the flag --env-file, followed by the path of the file.

The -it flag is still good to keep, as it is the first time we use the updated version of the image and we want to see possible error messages. If you run it from the root of the project directory, the full run command isdocker run --env-file .env -it handle_data.

If everything works, you will see the output CSV in your Azure output container.

5. Schedule Automatic Runs of the Container

Now, when we have tested the image and know that it works, we can schedule the container to run regularly, for example, every night. For this, we use the scheduling tool crontab, which is available on Linux and Mac machines.

Open crontab in the terminal by writing crontab -e. To run the container once a day at 7 pm, write like this:

0 19 * * * cd <full path to your project directory on your local computer> && docker run --env-file .env handle_data

If you want to learn more about how cron works, you can have a look at this guide.

Cleanup

If you have a few docker images on your system, they quickly start taking up a significant amount of space on your machine. It is good to take it as a habit to always clean up images you don’t need anymore. You can inspect all your images with the command docker image ls. Remove not needed images with the command docker image rm <image name>.

It is also good practice to remove redundant containers. You can do this one by one, like for images, but a faster way is to use the docker system prune. This will remove all containers not currently running. If you have built cache or unused networks, they will also get deleted. You find more info about the command here.

Conclusion

We now have a way to run our code on schedule from any machine. No matter what OS, python packages or file structure the local machine has, our code will always be run on the same OS, with the same packages and the same file structure.

With this, we can enjoy an easily transferable and stable execution of our code.

If you are interested in learning more about docker, I recommend the University of Helsinki’s Docker MOOC.

--

--