Building ML Pipelines

Dockerizing Your Code

John Aven
Hashmap, an NTT DATA Company
10 min readNov 19, 2020

--

Machine learning continues to evolve. While many folks are moving towards more modern approaches, there is still an apprehension around using Docker and Docker related technologies in much of data science. Making the change and using a containerized solution space like Docker is not necessarily an easy step to take — much like moving your data science workloads to the cloud wasn’t (and maybe you still haven’t taken that step).

Docker, often called a containerization technology, uses an abstraction of virtualization that reuses the host system Linux kernel to package and run an application in a platform-independent immutable deployment unit. This packaging creates immutably deployable software meant to be identically executable (and often abstractly reusable) across any platform with a running Docker or Kubernetes installation.

Why!

Now, what does this have to with building an ML pipeline? That’s an easy question. The difficult one is how to dockerize your data science code.

Data Science as a practice has some shortcomings — many sciences really do — but scientists don’t really want to admit to this. In data science, it is around how they execute their experiments, manage the results, and version their executable code. In reality, if you want to be as ethical in your experiments as possible, then you want to have completely versioned steps end-to-end. But we aren’t here to discuss the end-to-end problem: baby steps are in order.

An incremental improvement in your data science/machine learning practice is to get into the habit of dockerizing (creating Docker executable packages/images) the various components of your pipeline. Each piece, or if an appropriate collection of executable pieces, should be written into an immutable, versioned, and tagged docker image. This will, much more than version control, ensure that the code training is repeatable (regardless of where it runs). By using a Docker hub/repository, whether public or private, should be used so that you can store copies of all of your execution artifacts. All of this (and more) will help make sure that you are using ethical and compliant ML practices.

Stitching the execution of these pipelines together is not always a simple matter. It does take some time to educate yourself on building the pipelines and solving the approach that is the best fit for you and your team.

Dockerizing Your Pipeline

The first step in dockerizing your code is to know the basics of Docker. And since I have no intention of creating a tutorial on Docker at this time, I refer you to this excellent tutorial. The remaining steps are:

  1. Create a Dockerfile for each section of your pipeline that you wish to keep separate in execution from the others. This separation can be done at language boundaries (R vs. Python), library version differences, natural/organic separations (feature selection vs. model training), data locality (on-premises vs. cloud), and so on.
  2. Choose a base Docker Image that will pacify most of your requirements — there are many pre-built Docker images tailored to handle different machine learning needs. But as a rule, most Docker images will not fit your specific needs. You should create images that fit your common stack so that you minimize build times for repetitive actions (like rerunning pipelines with small changes). The Docker image used as your base is placed on a line using the FROM statement.
Base docker image for Tensorflow

3. Add Requirements to the Docker image. These requirements can be anything from small datasets, configuration information (be careful here on storing secrets when your image may be public — and even when it is private — there are better strategies for this), code dependencies, and much more. These are the things that are needed to make your code GO. Adding content is done through either the COPY or ADD statement.

Copying python requirements

4. Copy Code that will be used in your execution. If you are working with a large project, you may have many Docker images for the same code base, and so on, then it’s most likely a better option to copy the entire codebase into your image (often these really are small in the grand scheme of things). This is done the same as adding requirements using the COPY or ADD statement — we have a different intent here.

Copying code to be executed.

5. Install Dependencies and Build Code that will be needed to execute your pipeline. Sometimes, like in most Python and R pipelines, it suffices to install your libraries, but if you are running code that is Java/Scala, C/C++, etc., you will need to compile your code within the Docker image. Dependencies are installed using RUN. A RUN statement will execute a bash command that follows. To make sure that your image is as small as possible — it's a Docker thing — you should chain your installation and build steps into a single bash command using && whenever possible.

Executing a pip install on python requirements

6. Set Execution Entrypoint so that when you run the Docker image, you choose to do so — it will know what code to run. This is done using the CMD statement. This command is a bit odd when you first encounter it but is essentially a list of strings that make up a bash command.

Execution using preferred exec format

Putting it all together, you have a rather simple layout of what you should do. Now, Dockerfile images in the wild often have much more going on. Sometimes the underlying OS dependencies need to be updated, more complex installations are executed, more and different kinds of files are copied, environment variables are set, and much more.

Contents of a Simple Dockerfile

7. Building the Image is the act of taking the instructions laid out in the Dockerfile and executing them against the Docker build engine to compile an executable engine. This step is not unlike the process one would take when compiling executable code — and if you read enough and happen to have the right software engineering background, you will start seeing a lot of parallels with object orient programming and software engineering general. Anyway, this image is built through a command-line bash command. It creates a binary image on your device that is now portable and executable on any platform where Docker is running — magic!

Command to build your docker image

8. Tag & Publish Image your image to make sure that it is available to remote environments, relative to where you built the image, to be executable. Tagging is how you assign identifiable versions to an image. It is not uncommon to have multiple tags for an image. The tag LATEST is most often used for the latest stable image — e.g., not test or experimental images.

Tag and Push commands

Organizing your Code

When working with Docker, especially when you have to manage to create your Docker images yourself, it is important to have a logical way of organizing your code. While the way I present here is how I personally like to manage the code with Docker, I don’t think that I would call it a preferred way. Ideally, whatever way you use, and any tooling you may use with it should give you the freedom to use your own organization.

My mind likes to compartmentalize different units of work. This is how I operate. So, naturally, I tend to organize my code similarly (all things code). Each logical portion of a pipeline/workflow tends to be placed within its own directory. This allows me to look at the code and see what I am doing. In the sample below, I will make a working assumption that all code for each stage is in one file (this is usually not the case — it is the exception — in well-organized code with a high reuse factor).

Roughly how I organize my code

Of course, you wouldn’t necessarily break things down this way — it doesn’t always beneficial to have such granularity — but the idea remains. What is important is for each thing I deploy is that I have a separate Dockerfile (not gonna talk about advanced approaches with templatization or other tools out there — we have one coming very soon and will fill it in here when ready — pay attention).

When I am working manually, I like to create a Bash script that encapsulates all of the work that I would be doing when deploying my code. In this case — as a point — that means having a complete build of all of the images (this is very quickly put together bash and not what someone would do in production).

Sample docker build and publish script.

Whatever you do, make sure that you have some organization that helps you organize your build process — whatever that may be (we can help).

Using your Dockerized Pipeline

Running your dockerized data science pipelines can be done in many ways. In traditional approaches (tech ages FAST), you would run it on an on-premises system or a VM using docker-compose — hopefully not running using Docker run (unless you are testing). Docker-compose, Kubernetes (with Argo — see this blog), Prefect, Apache Airflow, and various other technologies can be used to build an ML pipeline. These are all used to create an execution pipeline known as a DAG (see my blog here explaining what a DAG is). Once you have deployed your Docker and have chosen your pipeline orchestration technology, you can use it to run either locally or remotely (the preferred approach).

Your solution is now versioned, in a repository, and the data experiments you are performing are completely repeatable (cough, data changes, cough).

The best part, though, is that you will get the same results when you run it remotely, no matter where it is running. You won't need to worry about where it is running, and you will have your machine free from computational load to continue to develop more and better models.

What I Didn't Talk About

Quite honestly, a ton of topics — it is just not possible to give this justice in a single blog post. I can only give you a taste, and it is up to you where to go next. Building automation around your training process, storing and organizing your simulation results, deploying containerized models, obfuscating or hiding away the creation of your pipelines in complex tools, and so on — there are many topics. And I will be walking through these topics one at a time in this Building ML Pipelines series. They aren’t in any particular order: they will reference each other (as you have surely seen some links above), are meant to provide you some insight into the intricacies of doing ML in our evolving tech world, and hopefully help guide you down your path to MLOps maturity.

How Hashmap Can Help

The next step is deciding whether MLflow should be part of the data analytics solution for your organization. Hashmap can help you here. Our machine learning and MLOps experts are here to help you on your journey — to bring you and your organization to the next level. Let us help you get ahead of your competition and become truly efficient in your data analytics.

If you’d like assistance along the way, then please contact us.

Hashmap offers a range of enablement workshops and assessment services, cloud modernization and migration services, data science, MLOps, and various other technology consulting services.

John Aven, Ph.D., is the Director of Engineering at Hashmap, providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers. Be sure and connect with John on LinkedIn and reach out for more perspectives and insight into accelerating your data-driven business outcomes.

--

--

John Aven
Hashmap, an NTT DATA Company

“I’d like to join your posse, boys, but first I’m gonna sing a little song.”