Stories by Matt Jachowski on Medium

Conducto for Data Science

Matt Jachowski — Tue, 21 Apr 2020 22:34:45 GMT

Conducto for Data Science

We make bold claims about why Conducto is great for Data Science. Our intelligent container-based architecture and thoughtful developer-driven design make it possible to:

easily write pipelines with the full power of python,
dynamically modify pipelines at runtime,
execute locally for free, or in the cloud for immediate scale.,
interact with our intuitive and simple pipeline view in the web app,
debug and deploy fixes to live pipelines easily, and
effortlessly collaborate with teammates.

Conducto’s container-based architecture.

How much more data do you need? Explore our live demo to get a taste. Then, get started on Linux, macOS, Windows, or WSL and immediately become more productive.

If you have already started but want to learn more, here is our recommended reading list.

Conducto for Data Science was originally published in Conducto on Medium, where people are continuing the conversation by highlighting and responding to this story.

Easy and Powerful Python Pipelines

Matt Jachowski — Tue, 21 Apr 2020 22:18:41 GMT

Conducto for Data Science

You can build pipelines out of commands in any language with Conducto, but we have some extra support for python that allows you to easily glue python functions together into rich and dynamic pipelines.

Pass a python function to co.Exec.
Lazily define your pipeline at runtime with co.Lazy.
Use Markdown to display rich output in an Exec node. (This is not specific to python.)

This example does a parallel word count over a randomly generated list of words. The algorithm is simple but illustrates a common pattern in data science.

Get the data.
Do parallelized analysis over the data.
Aggregate the results.

Explore our live demo, view the source code for this tutorial, or clone the demo and run it for yourself.

git clone https://github.com/conducto/demo.git
cd demo/data_science
python easy_python.py --local

Alternatively, download the zip archive here.

Pass a Python Function to co.Exec

Conducto can automatically call python functions from the shell so you do not have to build your own command-line interface. Instead of calling co.Exec with a shell command, pass it a function and its arguments.

In this example, we want to execute this function in an Exec node.

def gen_data(path: str, count: int):
    words = _get_words(count)
    text = b"\n".join(words) + b"\n"
    co.temp_data.puts(path, text)

So, we pass the gen_data function and its arguments to co.Exec.

co.Exec(gen_data, WORDLIST_PATH, count=50000)

This auto-generates the shell command below for the Exec node. Note that the conducto executable is largely just a wrapper for python.

conducto easy.py gen_data \
    --path=conducto/demo_data/wordlist --count=50000

Requirements

Conducto needs to be able to find this function in the image that the Exec node runs. Therefore, the Exec node must run with a co.Image that has copy_dir, copy_url, or path_map set. Also:

The image must include the file with the function.
The function name cannot start with an underscore (_).
The image must install conducto.
You must set typical node parameters like image, env, doc, etc. outside of the constructor, either in a parent node or by setting the fields directly.

Function Arguments

All arguments are serialized to the command line, so only pass parameters and paths. Large amounts of data should be passed via a data store like co.temp_data instead.

Arguments can be basic python types (int, float, etc.), date/time/datetime, or lists thereof. Conducto infers types from the default arguments or from type hints, and deserializes accordingly.

Lazy Pipeline Definition

Data science pipelines often benefit from dynamically defining the pipeline structure based on the properties of data that only become evident as you being analyzing it. For example, you may not know the size of your data until you download it, which determines how you want to chunk your parallel analysis for maximum efficiency.

Conducto empowers you to lazily define your pipeline such that new nodes can be defined as the pipeline runs. Simply write a function that returns a Parallel or Serial node that represents a new subtree to add to the pipeline, and call it with co.Lazy.

The parallel_word_count node defines a pipeline to chunk and analyze the input data in parallel. This is the parallel_word_count function declaration. Importantly, it is type-hinted to return a Parallel node.

def parallelize(
    wordlist_path, result_dir, top: int, chunksize: int
) -> co.Parallel:

The lazy node is generated by assigning the node to the result of co.Lazy:

output["parallel_word_count"] = co.Lazy(
    parallelize, WORDLIST_PATH, RESULT_DIR, top=15, chunksize=1000
)

co.Lazy produces two nodes inside the parallel_word_count Serial node.

The Generate and Execute nodes are auto-generated by co.Lazy. Note that the Execute node is an empty parallel node, because the Generate node that populates it has not run yet.

The first Generate node is an Exec node that calls the parallelize funcion and prints out the pipeline that it returns. This is the command it runs:

conducto easy.py parallelize \
    --wordlist_path=conducto/demo_data/wordlist \
    --result_dir=conducto/demo_data/results \
    --top=15 --chunksize=1000

Once the Generate node finishes and returns its new pipeline subtree, the subtree is deserialized into an Execute node, which then runs.

The output of the Generate node is the pipeline definition for the Execute node, which can then run.

Requirements

co.Lazy has all the same limitations as co.Exec(func) that you saw above. Additionally, the function must be type hinted to return a Parallel or Serial node, as in def func() -> co.Parallel.

When to use it

The demo pipeline uses co.Lazy to dynamically parallelize over input data, but there are many other common uses:

Processing streaming data in batches: When processing a new batch, use co.Lazy to filter out data that has already been processed, and only generate nodes for new data. Use the same logic to backfill data.
Relational mapping: To join relational data, simply use a for loop. When joining datasets A and B, iterate over A at runtime and create Exec nodes that run in parallel. Each node looks up the rows in B that correspond to its A value. You have full control over the parallelism and can debug any failed or incorrect mappings.
Time-consuming pipeline generation logic: Sometimes, even figuring out the work to do can take a while. Use co.Lazy to parallelize pipeline creation and get it out of the critical path.

These uses can arise multiple times in the same pipeline. co.Lazy is fully nestable, so you can handle them all and lazily generate as sophisticated a pipeline as you need.

Markdown to Display Rich Output

The goal of data science pipelines is often to produce human-understandable results. While you are always free to send data to external visualization tools, Conducto supports using Markdown to display tables, links, and graphs in your node’s output. Note that this is not specific to python and can be used by any commands.

Simply print ... in your stdout/stderr, and Conducto will render the Markdown between the tags.

The summarize node in the demo summarizes the results of the parallel_word_count step using a graph and a table. This is the relevant output code from the summarize function.

print("")
print(f"![img]({url})")
print()
print("rank | word | count")
print("-----|------|------")
for rank, (word, count) in enumerate(summary.most_common(top), 1):
    print(f"#{rank} | {word} | {count}")
print("")

And this is the output as rendered in the node pane.

Show a graph and a table in stdout using Markdown.

That’s it! By now you should know how to construct some powerful data science pipelines with Conducto. If you think you missed anything, check out our recommended reading list here.

Easy and Powerful Python Pipelines was originally published in Conducto on Medium, where people are continuing the conversation by highlighting and responding to this story.

Easy Error Resolution

Matt Jachowski — Tue, 21 Apr 2020 19:18:15 GMT

Conducto for Data Science

Anyone who has spent time with complex data science pipelines has spent a lot of that time resolving errors with them. Bugs are just a reality when you are trying to implement a complex system. Conducto makes it as easy as possible to resolve the three types of errors we think that you are most likely to encounter:

flaky errors that you should fix, but do not have time for now,
pipeline specification errors, like a typo in a command or missing env, and
errors that require serious debugging

We think that our thoughtful approach to error surfacing and handling will save you a ton of time and make you more productive.

Explore our live demo, view the source code, or clone the demo and run it for yourself.

git clone https://github.com/conducto/demo.git
cd demo/data_science
python error_resolution.py --local

Alternatively, download the zip archive here.

Flaky Errors

Sometimes your pipeline has a flaky command that periodically fails for no good reason. You really should fix it, but you do not want it to block you now. Or, your pipeline computes features over 500 days worth of data in parallel, and 2 days out of 500 fail due to corrupt data. In the first case, you can Reset the node to try again. Or, in either case, you can Skip the node to ignore the error and move on.

This is the flaky error example from our demo with the Reset and Skip buttons boxed in yellow.

Reset

If the test passes 80% of the time and fails 20% of the time, and you just want to run it again to give it a chance to pass, click the Reset button in the toolbar to try re-run the node. If it passes, then great, your pipeline will continue on.

After clicking Reset, the node still fails, as seen in the timeline.

Skip

In this scenario, the command keeps failing even after a few resets. In this case, you should just skip the node. Select the errored feature2 node and click the Skip button in the web app to let your pipeline continue to the build_model node. Alternatively, you can select the errored parent compute_features node, which will mark all subnodes as skipped, and let your pipeline continue to the deploy node.

After skipping the errored feature2 node, the pipeline is able to continue to the build_model node.

Specification Errors

You are going to make typos or forget things like environment variables when you write a pipeline specification, that is just human. In Conducto, quickly fix errors like these by selecting the errored node, click the Modify button in the toolbar, fix the offending parameter, then click the Reset button to immediately re-run the node.

Note that these fixes are isolated to the live instance of the pipeline, and do not modify anything in the pipeline script. You need to port your fixes to the pipeline script so that future runs do not suffer from the same errors.

Fix an Environment Variable

In the demo, we made a typo in the name of an environment variable. You can fix the error by selecting either the errored env_error node or its specification_error parent node, clicking the Modify button, then correcting the typo: CRATCH_DIR -> SCRATCH_DIR.

Correct the typo, CRATCH_DIR -> SCRATCH_DIR, in the Modify modal.

After clicking Update, you can verify that you see the expected diff in the right hand node pane.

Verify that the change you made is correct by viewing the Execution Parameters diff.

Finally, click Reset and you will see the node complete successfully.

Fix a Command

In the next node, we made a typo in the command. You can fix that error by selecting the errored command_error node, clicking the Modify button, then correcting the typo: lss -> ls.

Correct the typo in the command, lss -> ls, in the Modify modal.

After clicking Update, you can verify that you see the expected diff in the right hand node pane.

Verify that the change you made is correct by viewing the Execution Parameters diff.

Finally, click Reset and you will see the node complete successfully.

Errors Requiring Debugging

Sometimes you have a real issue that you need to debug. You can use debug mode by clicking the empty bug icon or live debug mode by clicking the lightning bug icon.

You can choose to debug with a snapshot of your code or live debug with your local code mounted directly into your debug container.

Debug Mode

Debug mode gives you a shell in a container with the node’s command and execution environment, including environment variables and a copy of your code. You can immediately reproduce the exact results you see in your pipeline. You can modify command, environment, and code in this container. Any changes are discarded when you exit this shell, so you must manually port your fixes back to your local code.

Live Debug Mode

Live debug mode gives you the same shell as debug mode, but also mounts your local code so that you can edit code outside of the shell with your own editor. Conversely, any changes you make inside the livedebug container persist outside on your local host even after you exit the shell, allowing you to instantly commit any of your fixes to your repo.

Debug Example

In this example, you should use live debug mode. Click the lightning bug in the upper right hand corner of the node pane to get a command copied to your clipboard. Paste that command into a local shell. Run the command to immediately reproduce the error reported by the pipeline.

Now, since the live debug container mounts the code from your local filesystem, you can edit and debug using your own editor and debug environment. Test your fix by re-running the command in the live debug container.

A debug container works the same way, but the code is copied into the container and has no connection to your local machine. So, you must edit and debug entirely within the debug shell.

A live debug session starts with a command that you paste into a shell. In the debug container you can cat the command, execute it to immediately reproduce the error, and re-run it to test your fix once you have debugged it in your own local editor.

Once you have fixed the code, you must click Rebuild Image to rebuild the image so that the pipeline can see the updated code. Once the image is rebuilt, you can click Reset to re-run the node to see it run successfully. As a shortcut, you can click Rebuild and Reset in the upper right hand corner of the node pane.

Rebuild the image then Reset to re-run the node in one step by clicking Rebuild and Reset, which is conveniently the default button displayed in the yellow box.

You can view the history of each run of a node in the node pane timeline. Select any row in the timeline to see the Command, Execution Parameters, Stdout, and Stderr for that run of the node. Here, we can see the output of the first run that errored, and the second run that was successful.

Toggle between different rows in the timeline to see Command, Execution Parameters, Stdout, and Stderr for different runs of a node.

If you have not yet, get started with Conducto now. Local mode is always free and is only limited by the cpu and memory on your machine. Cloud mode gives you immediate scale. Use the full power of python to write pipelines with ease. And, enjoy easy error resolution.

Easy Error Resolution was originally published in Conducto on Medium, where people are continuing the conversation by highlighting and responding to this story.

Easy Error Resolution

Matt Jachowski — Tue, 21 Apr 2020 09:25:36 GMT

Conducto for CI/CD

Anyone who has spent time with complex CI/CD pipelines has spent a lot of that time resolving errors with them. Bugs are just a reality when you are trying to implement a complex system. Conducto makes it as easy as possible to resolve the three types of errors we think that you are most likely to encounter:

flaky errors that you should fix, but do not have time for now,
pipeline specification errors, like a typo in a command or missing env, and
errors that require serious debugging

We think that our thoughtful approach to error surfacing and handling will save you a ton of time and make you more productive.

Explore our live demo, view the source code, or clone the demo and run it for yourself.

git clone https://github.com/conducto/demo.git
cd demo/cicd
python error_resolution.py --local

Alternatively, download the zip archive here.

Flaky Errors

Sometimes your pipeline has a flaky test that periodically fails for no good reason. You really should fix it, but you do not want it to block you now. You have two options: you can Reset the node to try again, or you can Skip the node to ignore the error and move on.

This is the flaky error example from our demo with the Reset and Skip buttons boxed in yellow.

Reset

After clicking Reset, the node still fails, as seen in the timeline.

Skip

In this scenario, the test keeps failing even after a few resets. In this case, you should just skip the node. Select the errored test2 node and click the Skip button in the toolbar to let your pipeline continue to the deploy node. Alternatively, you can select the errored parent test node, which will mark all subnodes as skipped, and let your pipeline continue to the deploy node.

After skipping the errored test2 node, the pipeline is able to continue to the deploy node.

Specification Errors

Fix an Environment Variable

Correct the typo, CRATCH_DIR -> SCRATCH_DIR, in the Modify modal.

After clicking Update, you can verify that you see the expected diff in the right hand node pane.

Verify that the change you made is correct by viewing the Execution Parameters diff.

Finally, click Reset and you will see the node complete successfully.

Fix a Command

In the next node, we made a typo in the command. You can fix that error by selecting the errored command_error node, clicking the Modify button, then correcting the typo: lss -> ls.

Correct the typo in the command, lss -> ls, in the Modify modal.

After clicking Update, you can verify that you see the expected diff in the right hand node pane.

Verify that the change you made is correct by viewing the Execution Parameters diff.

Finally, click Reset and you will see the node complete successfully.

Errors Requiring Debugging

Sometimes you have a real issue that you need to debug. You can use debug mode by clicking the empty bug icon or live debug mode by clicking the lightning bug icon.

You can choose to debug with a snapshot of your code or live debug with your local code mounted directly into your debug container.

Debug Mode

Live Debug Mode

Debug Example

A debug container works the same way, but the code is copied into the container and has no connection to your local machine. So, you must edit and debug entirely within the debug shell.

Rebuild the image then Reset to re-run the node in one step by clicking Rebuild and Reset, which is conveniently the default button displayed in the yellow box.

Toggle between different rows in the timeline to see Command, Execution Parameters, Stdout, and Stderr for different runs of a node.

We are developers who know the pain of re-creating execution environments and debugging in fragile setups. So, we built Conducto to make error resolution as quick and easy as possible. We hope that you will find debugging in Conducto to be a breath of fresh air. Check out Rapid and Painless Debugging to see us applying these techniques to our actual internal CI/CD pipeline.

Easy Error Resolution was originally published in Conducto on Medium, where people are continuing the conversation by highlighting and responding to this story.

Node Parameters

Matt Jachowski — Tue, 21 Apr 2020 00:25:50 GMT

Conducto for Data Science

Exec, Serial, and Parallel Nodes support several parameters that make pipeline specification in Conducto extremely powerful. You have already learned about image and env. You can also specify:

cpu and mem to constrain resources
requires_docker to run docker commands
stop_on_error to implement the finally pattern
same_container to control container sharing
doc to show pretty documentation in the web app
skip to default skip a node

Explore our live demo, view the source code for this tutorial, or clone the demo and run it for yourself.

git clone https://github.com/conducto/demo.git
cd demo/data_science
python node_params.py --local

Alternatively, download the zip archive here.

You can view most of these parameters for any node in the Execution Parameters section of the node pane.

Most node and image parameters are listed in the node pane.

And, you can modify most of these parameters for any node in a live pipeline from the Modify modal and Reset the node to re-run in place.

You can modify many of the node parameters in the Modify modal.

cpu and mem

The cpu and mem parameters limit the cpu and memory that get assigned to an Exec node. The default values are cpu=1 cpu and mem=2 GB. Allocate less if your commands require very little cpu or memory to allow your local pipeline to launch more nodes in parallel. Allocate more if necessary.

co.Exec("echo not doing much", cpu=0.25, mem=0.25)

requires_docker

To enable running docker commands like docker build, docker run, etc. in a node, you must set requires_docker=True. This is because your commands run within a docker container already, and running docker within docker requires non-trivial setup that Conducto will not do by default. Also, note that your image must have docker installed.

image = co.Image("docker:19.03")
co.Exec("docker run hello-world", requires_docker=True, image=image)

stop_on_error

A Serial node defaults to stop_on_error=True, which means that it stops and reports itself as errored as soon as any child node encounters an error. If stop_on_error=False, then it will run all child nodes, but will still report itself as errored if any child encountered an error. This is useful for implementing a finally pattern to guarantee that your pipeline always runs a cleanup step.

with co.Serial(name="stop_on_error_false", stop_on_error=False):
    co.Exec("echo doing some setup", name="setup")
    co.Exec("this_command_will_fail", name="bad_command")
    co.Exec("echo doing some cleanup", name="finally_cleanup")

A pipeline with the default stop_on_error=True behavior (above) vs one with stop_on_error=False (below). You can ensure that a final cleanup step always runs with stop_on_error=False.

same_container

Exec nodes are not guaranteed to run in the same containers, although Conducto will reuse containers when possible for efficiency. You can force commands to run in the same container with the argument same_container=co.SameContainer.NEW. All child nodes will have the default same_container=co.SameContainer.INHERIT and will share the container with the parent. This is useful if you want greater visibility into a command that chains together multiple subcommands. An error in a single subcommand will be easier to identify than an error in a long command.

long_command = """set -ex
echo This is a long command.
echo First I do this.
echo Then I do that.
oops_this_is_not_a_valid_command
echo Then I go home.
"""
co.Exec(long_command)

versus

with co.Serial(name="example", same_container=co.SameContainer.NEW):
    co.Exec("echo This is a long command.", name="intro")
    co.Exec("echo First I do this.", name="do_this")
    co.Exec("echo Then I do that.", name="do_that")
    co.Exec("oops_this_is_not_a_valid_command", name="oops")
    co.Exec("echo Then I go home.", name="go_home")

It is easier to identify where the error occurred after splitting a long command int several commands sharing the same container.

Another reason to use same_container=co.SameContainer.NEW to force container sharing is when you want your commands to share a filesystem. This makes a download and analyze pipeline very easy, for example, because you simply download the data to the filesystem in one node, and the analyze node can automatically see it. There is no need to put the binary in a separate data store.

with co.Serial(name="shared", same_container=co.SameContainer.NEW):
    co.Exec(f"curl {data_url} > /tmp/data.zip", name="download")
    co.Exec("unzip -pq /tmp/data.zip > /tmp/data", name="unzip")
    co.Exec("wc -l /tmp/data", name="analyze")

However, there is a downside to this same_container mode. When sharing a container, Exec nodes will always run in serial, even if the parent is a Parallel node. So, you lose the ability to parallelize.

with co.Parallel(
    name="always_serial", same_container=co.SameContainer.NEW
):
    co.Exec("echo I cannot run in parallel", name="parallel_exec_1")
    co.Exec("echo even if I want to", name="parallel_exec_2")

doc

Nodes can be documented with the doc parameter. It supports Markdown and is rendered at the top of the node pane. Nodes with docs are marked with a doc icon in the pipeline pane. We make extensive use of this feature in all of our demos.

markdown_doc = "### I _can_ **use** `markdown`"

more_markdown_doc = """
Markdown even supports [links](https://www.conducto.com)
and images ![alt text](
http://cdn.loc.gov/service/pnp/highsm/21700/21778r.jpg "a pretty picture")
"""

co.Exec("echo doc example 1", doc=markdown_doc)
co.Exec("echo doc example 2", doc=more_markdown_doc)

The example uses simple Markdown in the doc.

This example uses Markdown to display a link and an image.

skip

Nodes can be skipped in the web app or with skip=True. This is useful, for example, if you have a pipeline that has a reasonable default way to run, but you want the ability to manually enable (unskip) additional steps from the web app. A specific example might be deploying a production model. You could skip the deployment node by default, and require that someone manually reviews the output of the pipeline before unskipping and running the node to complete the deployment.

image = co.Image("bash:5.0")
with co.Serial(image=image) as skip_example:
    co.Exec("echo build model", name="build")
    co.Exec("echo test model", name="test")
    co.Exec("echo deploy model", name="deploy", skip=True)
    co.Exec("echo send status email", name="send email")

Default skip the deploy step, and force someone to manually unskip it from the toolbar.

Now, with the information you learned in Your First Pipeline, Execution Environment, Environment Variables and Secrets, Data Stores, Easy and Powerful Python Pipelines, and here, you can create arbitrarily complex pipelines.

Node Parameters was originally published in Conducto on Medium, where people are continuing the conversation by highlighting and responding to this story.

Data Stores

Matt Jachowski — Mon, 20 Apr 2020 22:37:49 GMT

Conducto for Data Science

Data science pipelines necessarily generate data, plots, or intermediate results that need to be stored for some amount of time. You cannot simply persist these files on the local filesystem, because each command runs in a container with it’s own filesystem that disappears when the container exits. And, in cloud mode, containers run on different machines, so there is no shared filesystem to mount. So, Conducto supports a few different approaches that work in a containerized world.

Connect to your own data store.
Use co.data.pipeline/conducto-data-pipeline as a pipeline-local key-value store.
Use co.data.user/conducto-data-user as a user-scoped persistent key-value store.

Explore our live demo, view the source code for this tutorial, or clone the demo and run it for yourself.

git clone https://github.com/conducto/demo.git
cd demo/data_science
python data_stores.py --local

Alternatively, download the zip archive here.

Your Own Data Store

There are many standard ways to store persistent data: databases, AWS S3, and in-memory caches like redis, just to name a few. An exec node can run any shell command, so it is easy to use any of these approaches. Here is a trivial example that sets AWS credentials and writes to S3 with the AWS CLI.

image = co.Image("python:3.8-alpine", reqs_py=["awscli"]
env = {
    "AWS_ACCESS_KEY_ID": "my_access_key_id",
    "AWS_SECRET_ACCESS_KEY": "my_secret_key"
}
s3_command = "aws s3 cp my_file s3://my_s3_bucket/"
s3_exec_node = co.Exec(s3_command, image=image, env=env)

Note that in a real pipeline, you would want to store your AWS credentials as secrets.

Use co.data.pipeline / conducto-data-pipeline

co.data.pipeline is a pipeline-local key-value store. This data is only visible to your pipeline and persists until your pipeline is deleted. It is useful for writing data in one pipeline step, and reading it in another. In local mode, pipeline data lives on your local filesystem. In cloud mode, pipeline data lives in AWS S3.

co.data.pipeline has both a python interface and command line interface as conducto-data-pipeline. Here is the condensed interface. Our demo prints the command line usage to show the full interface.

usage: conducto-data-pipeline [-h]  [< --arg1 val1 --arg2 val2 ...>]

methods:
    delete         (name)    
    exists         (name)    
    get            (name, file)    
    gets           (name, byte_range:List[int]=None)    
    list           (prefix)    
    put            (name, file)    
    puts           (name)    
    url            (name)    
    cache-exists   (name, checksum)    
    clear-cache    (name, checksum=None)    
    save-cache     (name, checksum, save_dir)    
    restore-cache  (name, checksum, restore_dir)

One useful application is performing and summarizing a parameter search. In this example, we try different parameterizations of an algorithm in parallel. Each one stores its results using co.data.pipeline.puts(). Once all of the parallel tasks are done, it reads the results using co.data.pipeline.gets() and prints a summary.

Here is the pipeline specification. Each pipeline node is bolded for clarity.

# Location to store data.
data_dir = "demo/data_science/pipeline_data"

# Image installs python, R, and conducto.
output = co.Serial(image=image)

# Parameter search over 3 parameters in nested for loops.
output["parameter_search"] = ps = co.Parallel()

for window in [25, 50, 100]:
    ps[f"window={window}"] = w = co.Parallel()

    for mean in [.05, .08, .11]:
        w[f"mean={mean}"] = m = co.Parallel()

        for volatility in [.1, .125, .15, .2]:
            m[f"volatility={volatility}"] = co.Exec(                                                                                                                                                                                                                                                                                                                                                    
                f"python temp_data.py --window={window} "
                f"--mean={mean} --volatility={volatility} "                                                                                                                                                                                                                                                                                                                                 
                f"--data-dir={data_dir}"                                                                                                                                                                                                                                                                                                                                      
            )

# Summarize parameter search results.
output["summarize"] = co.Exec(f"Rscript temp_data.R {data_dir}")

This results in the following pipeline, where I have drilled down to an arbitrary step of the parameter search.

View of the pipeline pane for the parameter search example pipeline.

Any Exec node shows the command being run for a single step of the parameter search.

The node pane shows the command being run for a single step of the parameter search.

The script being run for each step of the parameter search is temp_data.py and can be viewed here. In particular, this is the code it uses to store results to co.data.pipeline.

# Save result to Conducto's pipeline data store
path = "{}/mn={:.2f}_vol={:.2f}_win={:03}".format(
    data_dir, mean, volatility, window
)
data = json.dumps(output).encode()
co.data.pipeline.puts(path, data)

In contrast, the summarize steps runs temp_data.R, which can be viewed here, and uses the the command line interface conducto-data-pipeline.

# Use `conducto-data-pipeline list` command to get all the files.
cmd = sprintf("conducto-data-pipeline list --prefix=%s", argv$dir)
files = fromJSON(system(cmd, intern=TRUE))

names(files) <- gsub(".*/", "", files)
datas = lapply(files, function(f) {
    # Call `conducto-data-pipeline gets` to get an individual dataset.
    cmd = sprintf("conducto-data-pipeline gets --name=%s", f)
    fromJSON(system(cmd, intern=TRUE))
})

Use co.data.user / conducto-data-user

co.data.user is a user-scoped persistent key-value store. This is just like co.data.pipeline, but data is visible in all pipelines and persists beyond the lifetime of your pipeline. You are responsible for manually clearing your data when you no longer need it. In local mode, user data lives on your local filesystem. In cloud mode, user data lives in AWS S3.

co.data.user has both a python interface and command line interface as conducto-data-user. Here is the condensed interface. Our demo prints the command line usage to show the full interface.

usage: conducto-data-user [-h]  [< --arg1 val1 --arg2 val2 ...>]

methods:
    delete         (name)    
    exists         (name)    
    get            (name, file)    
    gets           (name, byte_range:List[int]=None)    
    list           (prefix)    
    put            (name, file)    
    puts           (name)    
    url            (name)    
    cache-exists   (name, checksum)    
    clear-cache    (name, checksum=None)    
    save-cache     (name, checksum, save_dir)    
    restore-cache  (name, checksum, restore_dir)

One useful application in data science is storing downloaded data. In this example, we download data from the Bitcoin blockchain. This can be time-consuming, so we want to avoid downloading the same data twice. By storing the data in co.data.user, we pull it once and persist it across pipelines.

# Image installs python and conducto.
with co.Serial(image=image) as out:
    out["download_20-11"] = \
        co.Exec("python btc.py download --start=-20 --end=-11")
    out["download_15-6"] = \
        co.Exec("python btc.py download --start=-15 --end=-6")
    out["download_10-now"] = \
        co.Exec("python btc.py download --start=-10 --end=-1")

Notice that this example contains three “download” nodes with overlapping ranges. They each download their range and skip any blocks that are already downloaded.

The code using co.data.user is in btc.py, which you can view here. This is a relevant section of the download function, with co.data.user usage bolded.

for height in range(start, end + 1):
    path = f"conducto/demo/btc/height={height}"

    # Check if `co.data.user` already has this block.
    if co.data.user.exists(path):
        print(f"Data already exists for block at height {height}")
        data_bytes = co.data.user.gets(path)
        _print_block(height, data_bytes)
        continue

    print(f"Downloading block at height={height}")
    data = _download_block(height)

    # Put the data into `co.data.user`.
    data_bytes = json.dumps(data).encode()
    co.data.user.puts(path, data_bytes)

If you download the demo, you can run this pipeline and see that it takes some time to download the data. But, if you click the Reset button and re-run the pipeline, you will see that it runs much faster. This is expected, because all of the data, aside from any new data generated since the pipeline last ran, is already in user data. Select any of the download nodes and look at the timeline in the node pane to see how long your first and second runs took.

The timeline shows that the first run took 1 minute and 88 MB of memory. The second run took 2.7 seconds and 47 MB of memory because the data was already in co.data.user.

That’s it! Now, with the information you learned in Your First Pipeline, Execution Environment, Environment Variables and Secrets, Node Parameters, Easy and Powerful Python Pipelines, and here, you can create arbitrarily complex data science pipelines.

Data Stores was originally published in Conducto on Medium, where people are continuing the conversation by highlighting and responding to this story.

Introduction to Conducto Pipelines

Matt Jachowski — Sun, 19 Apr 2020 08:57:04 GMT

Getting Started with Conducto

A pipeline is a sequence of commands that must be executed in a specific order. Some steps can happen concurrently, while other steps must happen one after another.

Conducto is a tool for writing, executing, visualizing, and debugging pipelines. At its most basic level, Conducto makes it trivial to chain together sequences of shell commands into pipelines using a simple python interface.

Explore the live demo, view the source code for this tutorial, or clone the demo and run it for yourself.

git clone https://github.com/conducto/demo.git
cd demo
python demo.py islands --local

Alternatively, download the zip archive here.

Boilerplate

In this introduction, we will build a simple pipeline of echo commands. First, create a empty python file (mine is called demo.py), then add this standard Conducto boilerplate code.

import conducto as co

# We will add more code here.

if __name__ == "__main__":
    co.main()

Nodes

You can conceptualize a pipeline as a sequence of commands that happen in parallel (at the same time), or in serial (one after the other). Conducto exposes three Node classes that directly map onto these ideas: Exec, Parallel, and Serial. Note that the code below is just for illustration purposes and should not be copied into your python file.

An Exec Node is a shell command.

exec_node = co.Exec("echo hello world")

A Parallel Node holds other nodes that can be executed in parallel.

parallel_node = co.Parallel()
parallel_node["task1"] = co.Exec("echo whistle")
parallel_node["task2"] = co.Exec("echo while you work")

A Serial Node holds other nodes that must be executed in serial.

serial_node = co.Serial()
serial_node["task1"] = co.Exec("echo first do this")
serial_node["task2"] = co.Exec("echo then do that")

Pipeline Specification

Pipeline Function

A pipeline is specified in a function that returns the root node of a tree that combines Exec, Parallel, and Serial Nodes. So, let us go back to our file, and create an empty pipeline function. Here, we begin by defining a pipeline function named islands.

import conducto as co

def islands() -> co.Serial:
    return None

if __name__ == "__main__":
    co.main()

The islands function is annotated with a type hint indicating that it will return a Serial Node. It is ok if you are not familiar with type hints. Just ensure that your pipeline function signature always ends with -> co.[NodeType].

Pipeline Definition

Now we can actually define our pipeline. We are going to define a toy pipeline that prints the nickname of each Hawaiian island, starting with the southernmost island and moving north. Islands in the same county will be grouped into either a Parallel or Serial node. In pseudocode, the pipeline should look like:

hawaii -> echo big island
maui county:
    maui -> echo valley isle
    lanai -> echo pineapple isle
    molokai -> echo friendly isle
    kahoolawe -> echo target isle
oahu -> echo gathering place
kauai county:
    kauai -> echo garden isle
    niihau -> echo forbidden isle

We can easily translate this into python using Node objects. Note that the choice of Parallel and Serial Nodes for maui_county and kauai_county below is arbitrary.

pipeline = co.Serial()
pipeline["hawaii"] = co.Exec("echo big island")

pipeline["maui_county"] = co.Parallel()
pipeline["maui_county"]["maui"] = co.Exec("echo valley isle")
pipeline["maui_county"]["lanai"] = co.Exec("echo pineapple isle")
pipeline["maui_county"]["molokai"] = co.Exec("echo friendly isle")
pipeline["maui_county"]["kahoolawe"] = co.Exec("echo target isle")

pipeline["oahu"] = co.Exec("echo gathering place")

pipeline["kauai_county"] = co.Serial()
pipeline["kauai_county"]["kauai"] = co.Exec("echo garden isle")
pipeline["kauai_county"]["niihau"] = co.Exec("echo forbidden isle")

This is straightforward, but I believe that the pipeline structure is even clearer when we leverage python’s with statement. This code is an equivalent way to express our pipeline.

with co.Serial() as pipeline:
    pipeline["hawaii"] = co.Exec("echo big island")

    with co.Parallel(name="maui_county") as maui_county:
       maui_county["maui"] = co.Exec("echo valley isle")
       maui_county["lanai"] = co.Exec("echo pineapple isle")
       maui_county["molokai"] = co.Exec("echo friendly isle")
       maui_county["kahoolawe"] = co.Exec("echo target isle")

    pipeline["oahu"] = co.Exec("echo gathering place")

    with co.Serial(name="kauai_county") as kauai_county:
       kauai_county["kauai"] = co.Exec("echo garden isle")
       kauai_county["niihau"] = co.Exec("echo forbidden isle")

Now, we can put this code into our islands function from before, return the root pipeline node, and we are done.

import conducto as co

def islands() -> co.Serial:
    with co.Serial() as pipeline:
        pipeline["hawaii"] = co.Exec("echo big island")
        with co.Parallel(name="maui_county") as maui_county:
            maui_county["maui"] = co.Exec("echo valley isle")
            maui_county["lanai"] = co.Exec("echo pineapple isle")
            maui_county["molokai"] = co.Exec("echo friendly isle")
            maui_county["kahoolawe"] = co.Exec("echo target isle")
        pipeline["oahu"] = co.Exec("echo gathering place")
        with co.Serial(name="kauai_county") as kauai_county:
            kauai_county["kauai"] = co.Exec("echo garden isle")
            kauai_county["niihau"] = co.Exec("echo forbidden isle")
    return pipeline

if __name__ == "__main__":
    co.main()

Pipeline Execution

The python file contains our full pipeline specification. Now, we can execute it. First, run the script with the --help option.

python demo.py --help

You will see a message like the one below. You can see that Conducto recognizes our pipeline function from the bolded text.

usage: demo.py [-h]  [< --arg1 val1 --arg2 val2 ...>]
                [--cloud] [--local] [--run] [--sleep-when-done]
methods that return conducto pipelines:
    islands  () -> Serial

optional arguments:
  -h, --help  show this help message and exit
  --version   show conducto package version

Now, execute the script in local mode, which means that the entire pipeline will execute on your local machine. In a future release, you will also be able to execute the same script in cloud mode for immediate scale.

python demo.py islands --local

This should open a new browser window or tab to conducto.com where can see the pipeline. If this does not happen, copy the printed URL into your browser.

The left-hand side of the screen is called the pipeline pane and has a toolbar with icons at the top. Click the View button to expand the pipeline and see the pipeline tree we have created. Click the Run button to execute the pipeline.

This is the pipeline pane. Click View to expand the pipeline tree and Run to execute the pipeline.

This interactive tree representation gives you a useful visual summary of the pipeline. You can see that Exec, Parallel, and Serial Node types are indicated by unique icons.

Notice how closely the pipeline tree in the web app mirrors our python specification.

Pipeline specification and visualization mirror each other.

Finally, click on one of the Exec nodes and examine the execution details. It contains useful information like the command, duration, memory used, return code, and stdout.

Summary

Now you have written and executed a simple pipeline in Conducto. I hope you are already imagining how Conducto can enable you to easily write and execute your own pipelines.

In my previous job, the predecessor to Conducto was the secret sauce that enabled our algorithmic trading team to run an ultra-productive data science and machine learning effort that has run for a decade and driven billions of dollars in revenue. So it stands to reason that Conducto is great for data science.

But, pipelines are everywhere, and when we switched our internal CI/CD pipeline from CircleCI to Conducto, we immediately became more productive. Try Conducto for CI/CD if you do not love your current solution.

Introduction to Conducto Pipelines was originally published in Conducto on Medium, where people are continuing the conversation by highlighting and responding to this story.

Environment Variables and Secrets

Matt Jachowski — Fri, 17 Apr 2020 01:29:00 GMT

Conducto for Data Science

Non-trivial pipelines require the specification of environment variables and secrets. This is easy in Conducto.

Explore our live demo, view the source code for this tutorial, or clone the demo and run it for yourself.

git clone https://github.com/conducto/demo.git
cd demo/data_science
python env_secrets.py --local

Alternatively, download the zip archive here.

Environment Variables

To specify environment variables, just supply the env argument to any node. Assign a dictionary of key value pairs where both keys and values must be strings.

env = {
    "NUM_THREADS": "4",
    "MY_DATASET": "volcano_data",
}
image = co.Image("bash:5.0")
command = "env | grep -e NUM_THREADS -e MY_DATASET
env_test = co.Exec(command, env=env, image=image)

Secrets

Some environment variables, like passwords and tokens, are sensitive and should not be hardcoded into any scripts. You can configure Conducto with both user- and org-level secrets (if you are an admin), which will be injected into each running exec node. You can specify a dictionary of secrets with our Secrets API.

# get_my_secrets_dict() returns a dict of string to string
user_secrets = get_my_secrets_dict()
token = co.api.Auth().get_token_from_shell()
secrets = co.api.Secrets()
secrets.put_user_secrets(token, user_secrets, replace=False)

Or you can enter them through our web interface.

Specifying AWS keys as user-level secrets.

That’s it! Now, with the information you learned in Your First Pipeline, Execution Environment, Data Stores, Node Parameters, Easy and Powerful Python Pipelines, and here, you can create arbitrarily complex data science pipelines.

Environment Variables and Secrets was originally published in Conducto on Medium, where people are continuing the conversation by highlighting and responding to this story.

Execution Environment

Matt Jachowski — Thu, 16 Apr 2020 20:31:13 GMT

Conducto for Data Science

In this tutorial, you will learn how to specify the dependencies and code necessary for your commands to run. Conducto strives to make this as simple as possible.

When we walked through creating your first pipeline, we glossed over an important detail — specifying the execution environment of your commands. That is, for each command, you must be able to specify:

the software dependencies required, and
a copy of your own code

Explore our live demo, view the source code for this tutorial, or clone the demo and run it for yourself.

git clone https://github.com/conducto/demo.git
cd demo/data_science
python execution_env.py --local

Alternatively, download the zip archive here.

Containers and Images

Conducto achieves this by running each of your exec node commands inside of a docker container, which is defined by an image that you help to configure. An image is a template for an execution environment that contains a base operating system and filesystem contents, including libraries, packages, and user code. A container is an instantiation of an image, and is like virtual machine, but lighter weight and quicker to create and destroy.

It is ok if you are new to containers, Conducto handles a lot of the details for you.

We will deep dive into how you configure an image. As a refresher, this is the pipeline from your first pipeline tutorial, with the image parameter bolded.

import conducto as co

def download_and_plot() -> co.Serial:
    dockerfile = "./docker/Dockerfile.first"
    image = co.Image(dockerfile=dockerfile, copy_dir="./code")
    with co.Serial(image=image) as pipeline:
        co.Exec(download_command, name="download")
        with co.Parallel(name="plot"):
            # ...
    return pipeline

if __name__ == "__main__":
    co.main(default=download_and_plot)

Image Specification

In Conducto, there are two ways to specify an image.

Specifying an existing image from DockerHub or another image registry.
Specifying a custom Dockerfile.

Existing Image

Specifying a existing image looks like this.

image = co.Image("r-base:3.6.0")

This particular image contains R, a programming language and environment for statistical computing, in a Debian Linux operating system, and is one of the many official R images available on DockerHub. You can specify any image from any public image registry, or a locally built image.

Python Image + Python Requirements

If you specify an image with python installed, we also allow you to specify any python package requirements inline.

image = co.Image("python:3.8-slim", reqs_py=["numpy"])

This specific example is equivalent to having python 3.8 installed in Debian Linux, with the following pip command having been run.

pip install numpy

Custom Dockerfile

For more control, you can specify your own Dockerfile, which Conducto will build into an image. You may specify dockerfile with an absolute or relative path, which is evaluated relative to the location of your pipeline script. You must also specify context, which is the docker build context.

image = co.Image(
    dockerfile="./docker/Dockerfile.simple",
    context="."
)

Here is a very simple Dockerfile that results in an image equivalent to the python example from the previous section.

FROM python:3.8-slim
RUN pip install numpy

Adding Your Own Code

So far we have discussed how to use images to include required software dependencies. But, you likely also need to include your own code in the image.

Fun fact: Conducto was almost named Blue Steel.

There are a few ways to do this.

Copy a local directory directly into the image.
Clone a specific branch from a git repository into the image.
COPY or ADD files explicitly in a Dockerfile.

Copy a Local Directory

You can specify a local directory with your own files to be copied into your image with the copy_dir argument. You may use an absolute or relative path for the directory, which is evaluated relative to the location of your pipeline script.

image = co.Image("r-base:3.6.0", copy_dir="./code")

This copies the directory ./code into your image. You may specify copy_dir for any version of image specification from above: existing image or dockerfile.

Clone from Git

You can also specify a git repository and branch to clone into your image with the copy_url and copy_branch arguments. This is useful for ensuring that your data science pipelines run against clean, versioned code, and not scripts with local uncommitted changes that could be lost. Here is an example using our demo repo on GitHub.

git_url = f"https://github.com/conducto/demo.git
dockerfile = "./docker/Dockerfile.git"
image = co.Image(
    dockerfile=dockerfile, copy_url=git_url, copy_branch="master"
)

Just like copy_dir, you can specify copy_url and copy_branch to any version of image specification.

COPY or ADD in Dockerfile

Finally, if you specify your own custom Dockerfile, you can COPY or ADD any files you want. Here is a Dockerfile that explicitly copies a code directory into the image. In this example, ./code is a path relative to the docker build context, specified by the context argument as seen earlier.

FROM r-base:3.6.0
COPY ./code /root/code

Mounting Local Code for Debugging

One of our favorite features in Conducto is live debugging. We show an example of this in our debugging tutorial. When you debug a node, you get a shell in a container with your full execution environment, including any code you have added to the image. If possible, we will mount your local code, creating a live debug environment. In this mode, any edits you make to your code outside of the container are visible inside the container, where you can test your command in its full execution environment. This allows you to use your regular editor and debug tools outside of the container to make the debug process as painless as possible.

We can do this in two scenarios:

you add your code with copy_dir, or
you specify path_map to explicitly map paths outside the container to inside the container.

So, you get the feature for no effort if you use copy_dir, but you have to specify an extra parameter if you want to use live debug with the clone from git or dockerfile image specifications.

Clone from Git + path_map

If you always have a local checkout of the git repo that you specify to an image, you can safely specify a path_map to make any later debugging easier. Here is the example from above with path_map added.

git_url = f"https://github.com/conducto/demo.git
path_map = {".": "data_science"}
image = co.Image(
    dockerfile="./docker/Dockerfile.git",
    copy_url=git_url,
    copy_branch="master",
    path_map=path_map
)

This maps the local directory ., relative to the location of the pipeline script, which is outside the container, to the data_science directory relative to the root of the cloned git repo inside the container.

COPY or ADD in Dockerfile + path_map

It works the same way for a image with a dockerfile that adds its own files, except that the target path inside the container must be absolute. This is because in this scenario, Conducto has no way to choose a reasonable default root directory inside the container. Here is an example.

path_map = {"./code": "/root/code"}
image = co.Image(
    dockerfile="./docker/Dockerfile.copy",
    context=".",
    path_map=path_map
)

Where the Dockerfile is the same as above.

FROM r-base:3.6.0
COPY ./code /root/code

Image Inheritance

Finally, a node with unspecified image parameter will inherit the values of it’s parent. The pipeline from our first tutorial shows this, with all nodes sharing an image with the root node.

import conducto as co

def download_and_plot() -> co.Serial:
    dockerfile = "./docker/Dockerfile.first"
    image = co.Image(dockerfile=dockerfile, copy_dir="./code")
    with co.Serial(image=image) as pipeline:
        co.Exec(download_command, name="download")
        with co.Parallel(name="plot"):
            # ...
    return pipeline

if __name__ == "__main__":
    co.main(default=download_and_plot)

That is all there is to it! Now, with the information you learned in Your First Pipeline, Environment Variables and Secrets, Data Stores, Node Parameters, Easy and Powerful Python Pipelines, and here, you can create arbitrarily complex data science pipelines.

Execution Environment was originally published in Conducto on Medium, where people are continuing the conversation by highlighting and responding to this story.

Your First Data Science Pipeline

Matt Jachowski — Thu, 16 Apr 2020 06:39:07 GMT

Conducto for Data Science

In this tutorial, you will learn how to define, execute, and interact with a simple Conducto pipeline.

Upon completion, you will understand how to use the following minimal API.

co.Exec, co.Serial, and co.Parallel node classes,
co.Image to specify execution environment, and
co.main() to make your pipeline executable

Explore our live demo, view the source code for this tutorial, or clone the demo and run it for yourself.

git clone https://github.com/conducto/demo.git
cd demo/data_science
python first_pipeline.py --local

Alternatively, download the zip archive here.

Define Your Pipeline

In Conducto, you express your pipeline as a series of commands that need to be executed in serial and/or parallel. Our python API exposes a minimal set of Node classes to get this done quickly and painlessly. Then, you have the full power of python to nest these nodes for arbitrarily complex pipelines.

First, you need to import conducto.

import conducto as co

Then, you start building your pipeline with nodes.

Exec Node

An exec node simply wraps a shell command.

plot = co.Exec("python plot.py --dataset heating")

Serial Node

A serial node specifies that a series of sub-nodes must happen in one after another. If one of the sub-nodes fails, execution stops and the entire serial node is marked as failed.

steps = co.Serial()
steps["download"] = co.Exec(download_command)
steps["plot"]  = co.Exec("python plot.py --dataset heating")

Note that the definition of download_command is omitted for clarity. See the source code in the demo for the full details.

Parallel Node

A parallel node specifies that a series of sub-nodes can occur in parallel. All nodes are executed, and if any nodes fail, the entire parallel node is marked as failed.

plot = co.Parallel()
plot["heating"] = co.Exec("python plot.py --dataset heating")
plot["cooling"] = co.Exec("python plot.py --dataset cooling")

Nesting

Serial and parallel nodes may contain any node type, not just exec nodes. This allows the creation of non-trivial pipelines.

pl = co.Serial()
pl["download"] = co.Exec(download_command)
pl["plot"] = co.Serial()
pl["plot"]["heating"] = co.Exec("python plot.py --dataset heating")
pl["plot"]["cooling"] = co.Exec("python plot.py --dataset cooling")

Easy to do, but perhaps more verbose than you prefer. We can use python to make it nicer.

with co.Serial() as pl:
    co.Exec(download_command, name="download")
    with co.Serial(name="plot"):
        co.Exec("python plot.py --dataset heating", name="heating")
        co.Exec("python plot.py --dataset cooling", name="cooling")

Image

Of course, your commands will only be able to run in an execution environment with:

your software dependencies installed,
a copy of your own code present, and
any necessary environment variables set

Conducto achieves this by running each of your exec commands inside of a docker container, which is defined by an image that you help to configure. Read full details in the Execution Environment and Environment Variables and Secrets tutorials. But for now, we will skip over these details, and just provide an appropriate image for our example. This particular image includes python and some packages to manipulate data, and copies over your local ./code directory. Note that the . is relative to the location of the pipeline script.

dockerfile = "./docker/Dockerfile.first"
image = co.Image(dockerfile=dockerfile, copy_dir="./code")
with co.Serial(image=image) as pipeline:
    # ...

Main

Now that you have a pipeline specified, make it executable. First, wrap your pipeline in a function that returns the top-level node.

def download and plot() -> co.Serial:
    dockerfile = "./docker/Dockerfile.first"
    image = co.Image(dockerfile=dockerfile, copy_dir="./code")
    with co.Serial(image=image) as pipeline:
        co.Exec(download_command, name="download")
        with co.Parallel(name="plot"):
            # ...
    return pipeline

Conducto requires that you write a type hint to indicate the node return type of the function. Do not worry if type hints are new to you. Simply ensure that the first line of your function includes -> co.[NodeClass], like this:

def download_and_plot() -> co.Serial:

Finally, define the main function of your python script.

def download_and_plot() -> co.Serial:
    dockerfile = "./docker/Dockerfile.first"
    image = co.Image(dockerfile=dockerfile, copy_dir="./code")
    with co.Serial(image=image) as pipeline:
        co.Exec(download_command, name="download")
        with co.Parallel(name="plot"):
            # ...
    return pipeline

if __name__ == "__main__":
    co.main(default=download_and_plot)

Execute Your Pipeline

Executing your pipeline is easy. First, if you want to spot-check your pipeline, run your script with no arguments.

python first_pipeline.py

You will see a pipeline serialization like this.

/
├─0 download   set -ex\ncurl http://...
└─1 plot
  ├─ heating   python plot.py --dataset heating
  └─ cooling   python plot.py --dataset cooling

To execute the pipeline on your local machine, which is always free, run this. Note that in local mode, your code never leaves your machine.

python first_pipeline.py --local

Coming soon, you will be able to effortlessly run the same pipeline in the cloud too.

python first_pipeline.py --cloud

Interact With Your Pipeline

The script will print a URL and pop it open in your browser. You can view your pipeline,

The pipeline summary is the row at the top, the pipeline pane is on the left, and the node pane is on the right. The pipeline pane shows your pipeline, with parallel, serial, and exec nodes getting unique icons.

run it and quickly identify pipeline status,

Press the Run button in the upper left of the pipeline pane. See the execution status of each node: Pending, Queued, Running, Done, Errored, and Killed.

examine the output of any exec node,

View the command, execution params, stdout, and stderr of a node in the right hand node pane. Stdout can even include plots!

and rapidly and painlessly debug errors. Collaborate with anyone else in your org by sharing the URL.

Put your pipeline to sleep when you are finished with it. Its state, logs, and data are stored for 7 days. During this period you can wake it up. After 7 days, it is deleted.

The “zzz” icon in the pipeline summary puts the pipeline to sleep. When no pipelines are selected you see a list of available ones. Click the “alarm clock” button on a sleeping pipeline to get a wakeup command to run into a local shell.

How Much More Data Do You Need?

This was a simple example, but once you add in Environment Variables and Secrets, Data Stores, Node Parameters, and Easy and Powerful Python Pipelines, you can easily express the most complex of data science pipelines in Conducto.

In my previous job, the predecessor to Conducto was the secret sauce that enabled our algorithmic trading team to run an ultra-productive data science and machine learning effort that has driven billions of dollars in revenue for a decade. Simply put, Conducto multiplied the impact of each team member by a lot.

How much more data do you need? Get started with Conducto now. Local mode is always free and is only limited by the CPU and memory on your machine. Cloud mode gives you immediate scale. Use the full power of python to write pipelines with ease. And, experience painless debugging and easy error resolution.

Your First Data Science Pipeline was originally published in Conducto on Medium, where people are continuing the conversation by highlighting and responding to this story.