Conducto for Data Science

Execution Environment

Matt Jachowski
Conducto
Published in
5 min readApr 16, 2020

--

In this tutorial, you will learn how to specify the dependencies and code necessary for your commands to run. Conducto strives to make this as simple as possible.

When we walked through creating your first pipeline, we glossed over an important detail — specifying the execution environment of your commands. That is, for each command, you must be able to specify:

Explore our live demo, view the source code for this tutorial, or clone the demo and run it for yourself.

git clone https://github.com/conducto/demo.git
cd demo/data_science
python execution_env.py --local

Alternatively, download the zip archive here.

Containers and Images

Conducto achieves this by running each of your exec node commands inside of a docker container, which is defined by an image that you help to configure. An image is a template for an execution environment that contains a base operating system and filesystem contents, including libraries, packages, and user code. A container is an instantiation of an image, and is like virtual machine, but lighter weight and quicker to create and destroy.

It is ok if you are new to containers, Conducto handles a lot of the details for you.

We will deep dive into how you configure an image. As a refresher, this is the pipeline from your first pipeline tutorial, with the image parameter bolded.

import conducto as codef download_and_plot() -> co.Serial:
dockerfile = "./docker/Dockerfile.first"
image = co.Image(dockerfile=dockerfile, copy_dir="./code")
with co.Serial(image=image) as pipeline:
co.Exec(download_command, name="download")
with co.Parallel(name="plot"):
# ...
return pipeline
if __name__ == "__main__":
co.main(default=download_and_plot)

Image Specification

In Conducto, there are two ways to specify an image.

Existing Image

Specifying a existing image looks like this.

image = co.Image("r-base:3.6.0")

This particular image contains R, a programming language and environment for statistical computing, in a Debian Linux operating system, and is one of the many official R images available on DockerHub. You can specify any image from any public image registry, or a locally built image.

Python Image + Python Requirements

If you specify an image with python installed, we also allow you to specify any python package requirements inline.

image = co.Image("python:3.8-slim", reqs_py=["numpy"])

This specific example is equivalent to having python 3.8 installed in Debian Linux, with the following pip command having been run.

pip install numpy

Custom Dockerfile

For more control, you can specify your own Dockerfile, which Conducto will build into an image. You may specify dockerfile with an absolute or relative path, which is evaluated relative to the location of your pipeline script. You must also specify context, which is the docker build context.

image = co.Image(
dockerfile="./docker/Dockerfile.simple",
context="."
)

Here is a very simple Dockerfile that results in an image equivalent to the python example from the previous section.

FROM python:3.8-slim
RUN pip install numpy

Adding Your Own Code

So far we have discussed how to use images to include required software dependencies. But, you likely also need to include your own code in the image.

Fun fact: Conducto was almost named Blue Steel.

There are a few ways to do this.

Copy a Local Directory

You can specify a local directory with your own files to be copied into your image with the copy_dir argument. You may use an absolute or relative path for the directory, which is evaluated relative to the location of your pipeline script.

image = co.Image("r-base:3.6.0", copy_dir="./code")

This copies the directory ./code into your image. You may specify copy_dir for any version of image specification from above: existing image or dockerfile.

Clone from Git

You can also specify a git repository and branch to clone into your image with the copy_url and copy_branch arguments. This is useful for ensuring that your data science pipelines run against clean, versioned code, and not scripts with local uncommitted changes that could be lost. Here is an example using our demo repo on GitHub.

git_url = f"https://github.com/conducto/demo.git
dockerfile = "./docker/Dockerfile.git"
image = co.Image(
dockerfile=dockerfile, copy_url=git_url, copy_branch="master"
)

Just like copy_dir, you can specify copy_url and copy_branch to any version of image specification.

COPY or ADD in Dockerfile

Finally, if you specify your own custom Dockerfile, you can COPY or ADD any files you want. Here is a Dockerfile that explicitly copies a code directory into the image. In this example, ./code is a path relative to the docker build context, specified by the context argument as seen earlier.

FROM r-base:3.6.0
COPY ./code /root/code

Mounting Local Code for Debugging

One of our favorite features in Conducto is live debugging. We show an example of this in our debugging tutorial. When you debug a node, you get a shell in a container with your full execution environment, including any code you have added to the image. If possible, we will mount your local code, creating a live debug environment. In this mode, any edits you make to your code outside of the container are visible inside the container, where you can test your command in its full execution environment. This allows you to use your regular editor and debug tools outside of the container to make the debug process as painless as possible.

We can do this in two scenarios:

  • you add your code with copy_dir, or
  • you specify path_map to explicitly map paths outside the container to inside the container.

So, you get the feature for no effort if you use copy_dir, but you have to specify an extra parameter if you want to use live debug with the clone from git or dockerfile image specifications.

Clone from Git + path_map

If you always have a local checkout of the git repo that you specify to an image, you can safely specify a path_map to make any later debugging easier. Here is the example from above with path_map added.

git_url = f"https://github.com/conducto/demo.git
path_map = {".": "data_science"}
image = co.Image(
dockerfile="./docker/Dockerfile.git",
copy_url=git_url,
copy_branch="master",
path_map=path_map
)

This maps the local directory ., relative to the location of the pipeline script, which is outside the container, to the data_science directory relative to the root of the cloned git repo inside the container.

COPY or ADD in Dockerfile + path_map

It works the same way for a image with a dockerfile that adds its own files, except that the target path inside the container must be absolute. This is because in this scenario, Conducto has no way to choose a reasonable default root directory inside the container. Here is an example.

path_map = {"./code": "/root/code"}
image = co.Image(
dockerfile="./docker/Dockerfile.copy",
context=".",
path_map=path_map
)

Where the Dockerfile is the same as above.

FROM r-base:3.6.0
COPY ./code /root/code

Image Inheritance

Finally, a node with unspecified image parameter will inherit the values of it’s parent. The pipeline from our first tutorial shows this, with all nodes sharing an image with the root node.

import conducto as codef download_and_plot() -> co.Serial:
dockerfile = "./docker/Dockerfile.first"
image = co.Image(dockerfile=dockerfile, copy_dir="./code")
with co.Serial(image=image) as pipeline:
co.Exec(download_command, name="download")
with co.Parallel(name="plot"):
# ...
return pipeline
if __name__ == "__main__":
co.main(default=download_and_plot)

That is all there is to it! Now, with the information you learned in Your First Pipeline, Environment Variables and Secrets, Data Stores, Node Parameters, Easy and Powerful Python Pipelines, and here, you can create arbitrarily complex data science pipelines.

--

--