<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Matt Jachowski on Medium]]></title>
        <description><![CDATA[Stories by Matt Jachowski on Medium]]></description>
        <link>https://medium.com/@jachowski?source=rss-42ece86b5daf------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/2*0uOEyQS6timBw9wlosY06w.jpeg</url>
            <title>Stories by Matt Jachowski on Medium</title>
            <link>https://medium.com/@jachowski?source=rss-42ece86b5daf------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sun, 24 May 2026 02:29:02 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@jachowski/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Conducto for Data Science]]></title>
            <link>https://medium.com/conducto/conducto-for-data-science-59f426ee57b?source=rss-42ece86b5daf------2</link>
            <guid isPermaLink="false">https://medium.com/p/59f426ee57b</guid>
            <category><![CDATA[workflow]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[containers]]></category>
            <category><![CDATA[data-visualization]]></category>
            <dc:creator><![CDATA[Matt Jachowski]]></dc:creator>
            <pubDate>Tue, 21 Apr 2020 22:34:45 GMT</pubDate>
            <atom:updated>2020-05-12T07:17:05.811Z</atom:updated>
            <content:encoded><![CDATA[<h3>Conducto for Data Science</h3><p>We make bold claims about why Conducto is great for Data Science. Our intelligent container-based architecture and thoughtful developer-driven design make it possible to:</p><ul><li><a href="https://medium.com/conducto/your-first-data-science-pipeline-cc9ceac142f6">easily write pipelines with the full power of python</a>,</li><li><a href="https://medium.com/conducto/easy-and-powerful-python-pipelines-2de5825375f2">dynamically modify pipelines at runtime</a>,</li><li>execute locally for free, or in the cloud for immediate scale.,</li><li><a href="https://medium.com/conducto/your-first-data-science-pipeline-cc9ceac142f6">interact with our intuitive and simple pipeline view in the web app</a>,</li><li><a href="https://medium.com/conducto/easy-error-resolution-45ca08d40f1d">debug and deploy fixes to live pipelines easily</a>, and</li><li>effortlessly collaborate with teammates.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*S1giSQF9vM5tkdzR58cIgA.png" /><figcaption>Conducto’s container-based architecture.</figcaption></figure><p>How much more data do you need? Explore our <a href="https://www.conducto.com/demo/data_science">live demo</a> to get a taste. Then, <a href="https://medium.com/conducto/getting-started/home">get started</a> on Linux, macOS, Windows, or WSL and immediately become more productive.</p><p>If you have already started but want to learn more, here is our recommended reading list.</p><ol><li><a href="https://medium.com/conducto/your-first-data-science-pipeline-cc9ceac142f6">Your First Pipeline</a></li><li><a href="https://medium.com/conducto/execution-environment-3bb663549a0c">Execution Environment</a></li><li><a href="https://medium.com/conducto/environment-variables-and-secrets-9acab502ec77">Environment Variables and Secrets</a></li><li><a href="https://medium.com/conducto/data-stores-f6dc90104029">Data Stores</a></li><li><a href="https://medium.com/conducto/node-parameters-7be236eaeaac">Node Parameters</a></li><li><a href="https://medium.com/conducto/easy-and-powerful-python-pipelines-2de5825375f2">Easy and Powerful Python Pipelines</a></li><li><a href="https://medium.com/conducto/easy-error-resolution-45ca08d40f1d">Easy Error Resolution</a></li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=59f426ee57b" width="1" height="1" alt=""><hr><p><a href="https://medium.com/conducto/conducto-for-data-science-59f426ee57b">Conducto for Data Science</a> was originally published in <a href="https://medium.com/conducto">Conducto</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Easy and Powerful Python Pipelines]]></title>
            <link>https://medium.com/conducto/easy-and-powerful-python-pipelines-2de5825375f2?source=rss-42ece86b5daf------2</link>
            <guid isPermaLink="false">https://medium.com/p/2de5825375f2</guid>
            <category><![CDATA[python]]></category>
            <category><![CDATA[containers]]></category>
            <category><![CDATA[workflow]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[data-visualization]]></category>
            <dc:creator><![CDATA[Matt Jachowski]]></dc:creator>
            <pubDate>Tue, 21 Apr 2020 22:18:41 GMT</pubDate>
            <atom:updated>2020-07-29T20:28:49.787Z</atom:updated>
            <content:encoded><![CDATA[<h4><a href="https://medium.com/conducto/data/home">Conducto for Data Science</a></h4><p>You can build pipelines out of commands in any language with Conducto, but we have some extra support for python that allows you to easily glue python functions together into rich and dynamic pipelines.</p><ul><li><a href="#84e0">Pass a python function to </a><a href="#84e0">co.Exec</a>.</li><li><a href="#d675">Lazily define your pipeline at runtime with </a><a href="#d675">co.L</a>azy.</li><li><a href="#2139">Use Markdown to display rich output in an Exec node</a>. (This is not specific to python.)</li></ul><p>This example does a parallel word count over a randomly generated list of words. The algorithm is simple but illustrates a common pattern in data science.</p><ol><li>Get the data.</li><li>Do parallelized analysis over the data.</li><li>Aggregate the results.</li></ol><p>Explore our <a href="https://www.conducto.com/demo/data_science">live demo</a>, view the <a href="https://github.com/conducto/demo/blob/main/data_science/easy_python.py">source code for this tutorial</a>, or clone the <a href="https://github.com/conducto/demo">demo</a> and run it for yourself.</p><pre>git clone <a href="https://github.com/conducto/demo.git">https://github.com/conducto/demo.git</a><br>cd demo/data_science<br>python easy_python.py --local</pre><p>Alternatively, download the zip archive <a href="https://github.com/conducto/demo/archive/master.zip">here</a>.</p><h3>Pass a Python Function to co.Exec</h3><p>Conducto can automatically call python functions from the shell so you do not have to build your own command-line interface. Instead of calling co.Exec with a shell command, pass it a function and its arguments.</p><p>In this example, we want to execute this function in an Exec node.</p><pre><strong>def gen_data(path: str, count: int):<br></strong>    words = _get_words(count)<br>    text = b&quot;\n&quot;.join(words) + b&quot;\n&quot;<br>    co.temp_data.puts(path, text)</pre><p>So, we pass the gen_data function and its arguments to co.Exec.</p><pre>co.Exec(<strong>gen_data</strong>, WORDLIST_PATH, count=50000)</pre><p>This auto-generates the shell command below for the Exec node. Note that the conducto executable is largely just a wrapper for python.</p><pre>conducto easy.py gen_data \<br>    --path=conducto/demo_data/wordlist --count=50000</pre><h4>Requirements</h4><p>Conducto needs to be able to find this function in the image that the Exec node runs. Therefore, the Exec node must run with a co.Image that has copy_dir, copy_url, or path_map set. Also:</p><ul><li>The image must include the file with the function.</li><li>The function name cannot start with an underscore (_).</li><li>The image must install conducto.</li><li>You must set typical node parameters like image, env, doc, etc. outside of the constructor, either in a parent node or by setting the fields directly.</li></ul><h4>Function Arguments</h4><p>All arguments are serialized to the command line, so only pass parameters and paths. Large amounts of data should be passed via a data store like co.temp_data instead.</p><p>Arguments can be basic python types (int, float, etc.), date/time/datetime, or lists thereof. Conducto infers types from the default arguments or from type hints, and deserializes accordingly.</p><h3>Lazy Pipeline Definition</h3><p>Data science pipelines often benefit from dynamically defining the pipeline structure based on the properties of data that only become evident as you being analyzing it. For example, you may not know the size of your data until you download it, which determines how you want to chunk your parallel analysis for maximum efficiency.</p><p>Conducto empowers you to lazily define your pipeline such that new nodes can be defined as the pipeline runs. Simply write a function that returns a Parallel or Serial node that represents a new subtree to add to the pipeline, and call it with co.Lazy.</p><p>The parallel_word_count node defines a pipeline to chunk and analyze the input data in parallel. This is the parallel_word_count function declaration. Importantly, it is type-hinted to return a Parallel node.</p><pre>def parallelize(<br>    wordlist_path, result_dir, top: int, chunksize: int<br>) -&gt; co.Parallel:</pre><p>The lazy node is generated by assigning the node to the result of co.Lazy:</p><pre>output[&quot;parallel_word_count&quot;] = co.Lazy(<br>    parallelize, WORDLIST_PATH, RESULT_DIR, top=15, chunksize=1000<br>)</pre><p>co.Lazy produces two nodes inside the parallel_word_count Serial node.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*sbSHhRni-MlA5IkvdJzmcQ.png" /><figcaption>The <strong>Generate</strong> and <strong>Execute </strong>nodes are auto-generated by <strong>co.Lazy</strong>. Note that the <strong>Execute</strong> node is an empty parallel node, because the <strong>Generate </strong>node that populates it has not run yet.</figcaption></figure><p>The first Generate node is an Exec node that calls the parallelize funcion and prints out the pipeline that it returns. This is the command it runs:</p><pre>conducto easy.py parallelize \<br>    --wordlist_path=conducto/demo_data/wordlist \<br>    --result_dir=conducto/demo_data/results \<br>    --top=15 --chunksize=1000</pre><p>Once the Generate node finishes and returns its new pipeline subtree, the subtree is deserialized into an Execute node, which then runs.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*EbVoR5ckfRaUWZPedmgNNQ.png" /><figcaption>The output of the <strong>Generate</strong> node is the pipeline definition for the <strong>Execute</strong> node, which can then run.</figcaption></figure><h4>Requirements</h4><p>co.Lazy has all the same limitations as co.Exec(func) that you saw above. Additionally, the function must be type hinted to return a Parallel or Serial node, as in def func() -&gt; co.Parallel.</p><h4>When to use it</h4><p>The demo pipeline uses co.Lazy to dynamically parallelize over input data, but there are many other common uses:</p><ul><li><strong>Processing streaming data in batches</strong>: When processing a new batch, use co.Lazy to filter out data that has already been processed, and only generate nodes for new data. Use the same logic to backfill data.</li><li><strong>Relational mapping</strong>: To join relational data, simply use a for loop. When joining datasets A and B, iterate over A at runtime and create Exec nodes that run in parallel. Each node looks up the rows in B that correspond to its A value. You have full control over the parallelism and can debug any failed or incorrect mappings.</li><li><strong>Time-consuming pipeline generation logic</strong>: Sometimes, even figuring out the work to do can take a while. Use co.Lazy to parallelize pipeline creation and get it out of the critical path.</li></ul><p>These uses can arise multiple times in the same pipeline. co.Lazy is fully nestable, so you can handle them all and lazily generate as sophisticated a pipeline as you need.</p><h3>Markdown to Display Rich Output</h3><p>The goal of data science pipelines is often to produce human-understandable results. While you are always free to send data to external visualization tools, Conducto supports using Markdown to display tables, links, and graphs in your node’s output. Note that this is not specific to python and can be used by any commands.</p><p>Simply print &lt;ConductoMarkdown&gt;...&lt;/ConductoMarkdown&gt; in your stdout/stderr, and Conducto will render the Markdown between the tags.</p><p>The summarize node in the demo summarizes the results of the parallel_word_count step using a graph and a table. This is the relevant output code from the summarize function.</p><pre>print(&quot;&lt;ConductoMarkdown&gt;&quot;)<br>print(f&quot;![img]({url})&quot;)<br>print()<br>print(&quot;rank | word | count&quot;)<br>print(&quot;-----|------|------&quot;)<br>for rank, (word, count) in enumerate(summary.most_common(top), 1):<br>    print(f&quot;#{rank} | {word} | {count}&quot;)<br>print(&quot;&lt;/ConductoMarkdown&gt;&quot;)</pre><p>And this is the output as rendered in the node pane.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*lvTkhpYH4UhGpynlLQElAQ.png" /><figcaption>Show a graph and a table in stdout using Markdown.</figcaption></figure><p>That’s it! By now you should know how to construct some powerful data science pipelines with <a href="https://www.conducto.com/">Conducto</a>. If you think you missed anything, check out our recommended reading list <a href="https://medium.com/conducto/conducto-for-data-science-59f426ee57b">here</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=2de5825375f2" width="1" height="1" alt=""><hr><p><a href="https://medium.com/conducto/easy-and-powerful-python-pipelines-2de5825375f2">Easy and Powerful Python Pipelines</a> was originally published in <a href="https://medium.com/conducto">Conducto</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Easy Error Resolution]]></title>
            <link>https://medium.com/conducto/easy-error-resolution-45ca08d40f1d?source=rss-42ece86b5daf------2</link>
            <guid isPermaLink="false">https://medium.com/p/45ca08d40f1d</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[debugging]]></category>
            <category><![CDATA[workflow]]></category>
            <category><![CDATA[containers]]></category>
            <dc:creator><![CDATA[Matt Jachowski]]></dc:creator>
            <pubDate>Tue, 21 Apr 2020 19:18:15 GMT</pubDate>
            <atom:updated>2020-07-29T20:28:21.240Z</atom:updated>
            <content:encoded><![CDATA[<h4><a href="https://medium.com/conducto/data/home">Conducto for Data Science</a></h4><p>Anyone who has spent time with complex data science pipelines has spent <em>a lot</em> of that time resolving errors with them. Bugs are just a reality when you are trying to implement a complex system. Conducto makes it as easy as possible to resolve the three types of errors we think that you are most likely to encounter:</p><ul><li><a href="#0746">flaky errors that you should fix, but do not have time for now</a>,</li><li><a href="#41f8">pipeline specification errors, like a typo in a command or missing env</a>, and</li><li><a href="#abd2">errors that require serious debugging</a></li></ul><p>We think that our thoughtful approach to error surfacing and handling will save you a ton of time and make you more productive.</p><p>Explore our <a href="https://www.conducto.com/demo/data_science">live demo</a>, view the <a href="https://github.com/conducto/demo/blob/main/data_science/error_resolution.py">source code</a>, or clone the <a href="https://github.com/conducto/demo">demo</a> and run it for yourself.</p><pre>git clone <a href="https://github.com/conducto/demo.git">https://github.com/conducto/demo.git</a><br>cd demo/data_science<br>python error_resolution.py --local</pre><p>Alternatively, download the zip archive <a href="https://github.com/conducto/demo/archive/master.zip">here</a>.</p><h3>Flaky Errors</h3><p>Sometimes your pipeline has a flaky command that periodically fails for no good reason. You really should fix it, but you do not want it to block you now. Or, your pipeline computes features over 500 days worth of data in parallel, and 2 days out of 500 fail due to corrupt data. In the first case, you can <strong><em>Reset</em></strong> the node to try again. Or, in either case, you can <strong><em>Skip</em></strong> the node to ignore the error and move on.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*96j_dGnupcuT5oyVk2qokA.png" /><figcaption>This is the flaky error example from our demo with the <strong>Reset</strong> and <strong>Skip </strong>buttons boxed in yellow.</figcaption></figure><h3>Reset</h3><p>If the test passes 80% of the time and fails 20% of the time, and you just want to run it again to give it a chance to pass, click the <em>Reset</em> button in the toolbar to try re-run the node. If it passes, then great, your pipeline will continue on.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*19O1AZtmodNhTEzTW7vWZg.png" /><figcaption>After clicking <strong>Reset</strong>, the node still fails, as seen in the <strong>timeline</strong>.</figcaption></figure><h3>Skip</h3><p>In this scenario, the command keeps failing even after a few resets. In this case, you should just skip the node. Select the errored feature2 node and click the <em>Skip</em> button in the web app to let your pipeline continue to the build_model node. Alternatively, you can select the errored parent compute_features node, which will mark all subnodes as skipped, and let your pipeline continue to the deploy node.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*yM7eDwPNZtpH1uonjsq0Vg.png" /><figcaption>After skipping the errored <strong>feature2</strong> node, the pipeline is able to continue to the <strong>build_model</strong> node.</figcaption></figure><h3>Specification Errors</h3><p>You are going to make typos or forget things like environment variables when you write a pipeline specification, that is just human. In Conducto, quickly fix errors like these by selecting the errored node, click the <em>Modify</em> button in the toolbar, fix the offending parameter, then click the <em>Reset</em> button to immediately re-run the node.</p><p>Note that these fixes are isolated to the <em>live instance</em> of the pipeline, and do not modify anything in the pipeline script. You need to port your fixes to the pipeline script so that future runs do not suffer from the same errors.</p><h3>Fix an Environment Variable</h3><p>In the demo, we made a typo in the name of an environment variable. You can fix the error by selecting either the errored env_error node or its specification_error parent node, clicking the <em>Modify</em> button, then correcting the typo: CRATCH_DIR -&gt; SCRATCH_DIR.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*jTK4CTlB6UhXcIX4.png" /><figcaption>Correct the typo, CRATCH_DIR -&gt; SCRATCH_DIR, in the Modify modal.</figcaption></figure><p>After clicking <em>Update</em>, you can verify that you see the expected diff in the right hand node pane.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*FFnObXQ_ZjYogad0.png" /><figcaption>Verify that the change you made is correct by viewing the Execution Parameters diff.</figcaption></figure><p>Finally, click <em>Reset</em> and you will see the node complete successfully.</p><h3>Fix a Command</h3><p>In the next node, we made a typo in the command. You can fix that error by selecting the errored command_error node, clicking the <em>Modify</em> button, then correcting the typo: lss -&gt; ls.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*v6qPHwHPK3c5RAVH.png" /><figcaption>Correct the typo in the command, lss -&gt; ls, in the Modify modal.</figcaption></figure><p>After clicking <em>Update</em>, you can verify that you see the expected diff in the right hand node pane.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*cEtO45TxOOki6xHS.png" /><figcaption>Verify that the change you made is correct by viewing the Execution Parameters diff.</figcaption></figure><p>Finally, click <em>Reset</em> and you will see the node complete successfully.</p><h3>Errors Requiring Debugging</h3><p>Sometimes you have a real issue that you need to debug. You can use <strong><em>debug</em></strong> mode by clicking the <em>empty bug</em> icon or <strong><em>live debug</em></strong> mode by clicking the <em>lightning bug</em> icon.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*0nN3wWl3NPc2Wutn.png" /></figure><p>You can choose to <strong>debug</strong> with a snapshot of your code or <strong>live debug</strong> with your local code mounted directly into your debug container.</p><h3>Debug Mode</h3><p>Debug mode gives you a shell in a container with the node’s command and execution environment, including environment variables and a <em>copy</em> of your code. You can immediately reproduce the exact results you see in your pipeline. You can modify command, environment, and code in this container. Any changes are discarded when you exit this shell, so you must manually port your fixes back to your local code.</p><h3>Live Debug Mode</h3><p>Live debug mode gives you the same shell as debug mode, but also mounts your local code so that you can edit code outside of the shell with your own editor. Conversely, any changes you make inside the livedebug container persist outside on your local host even after you exit the shell, allowing you to instantly commit any of your fixes to your repo.</p><h3>Debug Example</h3><p>In this example, you should use <em>live debug</em> mode. Click the lightning bug in the upper right hand corner of the node pane to get a command copied to your clipboard. Paste that command into a local shell. Run the command to <em>immediately reproduce</em> the error reported by the pipeline.</p><p>Now, since the <em>live debug</em> container mounts the code from your local filesystem, you can edit and debug using your own editor and debug environment. Test your fix by re-running the command in the live debug container.</p><p>A <em>debug</em> container works the same way, but the code is copied into the container and has no connection to your local machine. So, you must edit and debug entirely within the debug shell.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*cuwy6-hMEp1hstFO.png" /><figcaption>A <strong>live debug</strong> session starts with a command that you paste into a shell. In the debug container you can cat the command, execute it to immediately reproduce the error, and re-run it to test your fix once you have debugged it in your own local editor.</figcaption></figure><p>Once you have fixed the code, you must click <em>Rebuild Image </em>to rebuild the image so that the pipeline can see the updated code. Once the image is rebuilt, you can click <em>Reset</em> to re-run the node to see it run successfully. As a shortcut, you can click <em>Rebuild and Reset</em> in the upper right hand corner of the node pane.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*agDRSzB8KMvPleCg.png" /><figcaption><strong>Rebuild</strong> the image then <strong>Reset</strong> to re-run the node in one step by clicking <strong>Rebuild and Reset</strong>, which is conveniently the default button displayed in the yellow box.</figcaption></figure><p>You can view the history of each run of a node in the node pane <em>timeline</em>. Select any row in the timeline to see the Command, Execution Parameters, Stdout, and Stderr for that run of the node. Here, we can see the output of the first run that errored, and the second run that was successful.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ZiN7l-tNNEn8Yo_5i_tPcw.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*k0SCoZD7WEts3wfBnlA9zQ.png" /><figcaption>Toggle between different rows in the timeline to see Command, Execution Parameters, Stdout, and Stderr for different runs of a node.</figcaption></figure><p>We are developers who know the pain of re-creating execution environments and debugging in fragile setups. So, we built Conducto to make error resolution as quick and easy as possible. We hope that you will find debugging in Conducto to be a breath of fresh air.</p><p>If you have not yet, <a href="https://medium.com/conducto/conducto-for-data-science-59f426ee57b">get started with Conducto now</a>. Local mode is always free and is only limited by the cpu and memory on your machine. Cloud mode gives you immediate scale. <a href="https://medium.com/conducto/your-first-data-science-pipeline-cc9ceac142f6">Use the full power of python to write pipelines with ease</a>. And, enjoy easy error resolution.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=45ca08d40f1d" width="1" height="1" alt=""><hr><p><a href="https://medium.com/conducto/easy-error-resolution-45ca08d40f1d">Easy Error Resolution</a> was originally published in <a href="https://medium.com/conducto">Conducto</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Easy Error Resolution]]></title>
            <link>https://medium.com/conducto/easy-error-resolution-b9f2b54f22b7?source=rss-42ece86b5daf------2</link>
            <guid isPermaLink="false">https://medium.com/p/b9f2b54f22b7</guid>
            <category><![CDATA[python]]></category>
            <category><![CDATA[ci-cd-pipeline]]></category>
            <category><![CDATA[cicd]]></category>
            <category><![CDATA[devops]]></category>
            <category><![CDATA[containers]]></category>
            <dc:creator><![CDATA[Matt Jachowski]]></dc:creator>
            <pubDate>Tue, 21 Apr 2020 09:25:36 GMT</pubDate>
            <atom:updated>2020-07-29T20:24:42.655Z</atom:updated>
            <content:encoded><![CDATA[<h4><a href="https://medium.com/conducto/cicd/home">Conducto for CI/CD</a></h4><p>Anyone who has spent time with complex CI/CD pipelines has spent <em>a lot</em> of that time resolving errors with them. Bugs are just a reality when you are trying to implement a complex system. Conducto makes it as easy as possible to resolve the three types of errors we think that you are most likely to encounter:</p><ul><li><a href="#2540">flaky errors that you should fix, but do not have time for now</a>,</li><li><a href="#b24c">pipeline specification errors, like a typo in a command or missing env</a>, and</li><li><a href="#ce43">errors that require serious debugging</a></li></ul><p>We think that our thoughtful approach to error surfacing and handling will save you a ton of time and make you more productive.</p><p>Explore our <a href="https://www.conducto.com/demo/cicd">live demo</a>, view the <a href="https://github.com/conducto/demo/blob/main/cicd/error_resolution.py">source code</a>, or clone the <a href="https://github.com/conducto/demo">demo</a> and run it for yourself.</p><pre>git clone <a href="https://github.com/conducto/demo.git">https://github.com/conducto/demo.git</a><br>cd demo/cicd<br>python error_resolution.py --local</pre><p>Alternatively, download the zip archive <a href="https://github.com/conducto/demo/archive/master.zip">here</a>.</p><h3>Flaky Errors</h3><p>Sometimes your pipeline has a flaky test that periodically fails for no good reason. You really should fix it, but you do not want it to block you now. You have two options: you can <strong><em>Reset</em></strong> the node to try again, or you can <strong><em>Skip</em></strong> the node to ignore the error and move on.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*goXkY1KLnDZn3g-h9NzvmA.png" /><figcaption>This is the flaky error example from our demo with the <strong>Reset</strong> and <strong>Skip </strong>buttons boxed in yellow.</figcaption></figure><h4>Reset</h4><p>If the test passes 80% of the time and fails 20% of the time, and you just want to run it again to give it a chance to pass, click the <em>Reset</em> button in the toolbar to try re-run the node. If it passes, then great, your pipeline will continue on.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*SK_ytqL9ZWFxKLZYpnAG2A.png" /><figcaption>After clicking <strong>Reset</strong>, the node still fails, as seen in the <strong>timeline</strong>.</figcaption></figure><h4>Skip</h4><p>In this scenario, the test keeps failing even after a few resets. In this case, you should just skip the node. Select the errored test2 node and click the <em>Skip</em> button in the toolbar to let your pipeline continue to the deploy node. Alternatively, you can select the errored parent test node, which will mark all subnodes as skipped, and let your pipeline continue to the deploy node.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*hJZs3KwZyyKt4g4GepdFtw.png" /><figcaption>After skipping the errored <strong>test2</strong> node, the pipeline is able to continue to the <strong>deploy </strong>node.</figcaption></figure><h3>Specification Errors</h3><p>You are going to make typos or forget things like environment variables when you write a pipeline specification, that is just human. In Conducto, quickly fix errors like these by selecting the errored node, click the <em>Modify</em> button in the toolbar, fix the offending parameter, then click the <em>Reset</em> button to immediately re-run the node.</p><p>Note that these fixes are isolated to the <em>live instance</em> of the pipeline, and do not modify anything in the pipeline script. You need to port your fixes to the pipeline script so that future runs do not suffer from the same errors.</p><h4>Fix an Environment Variable</h4><p>In the demo, we made a typo in the name of an environment variable. You can fix the error by selecting either the errored env_error node or its specification_error parent node, clicking the <em>Modify</em> button, then correcting the typo: CRATCH_DIR -&gt; SCRATCH_DIR.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*uYlYsOY53jrzKfBNDpXt0Q.png" /><figcaption>Correct the typo, CRATCH_DIR -&gt; SCRATCH_DIR, in the Modify modal.</figcaption></figure><p>After clicking <em>Update</em>, you can verify that you see the expected diff in the right hand node pane.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*VjGaeJq8dXPDrAA8p822Fg.png" /><figcaption>Verify that the change you made is correct by viewing the Execution Parameters diff.</figcaption></figure><p>Finally, click <em>Reset</em> and you will see the node complete successfully.</p><h4>Fix a Command</h4><p>In the next node, we made a typo in the command. You can fix that error by selecting the errored command_error node, clicking the <em>Modify</em> button, then correcting the typo: lss -&gt; ls.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ZuMOTppOi-6Mw4TKPdNqgQ.png" /><figcaption>Correct the typo in the command, lss -&gt; ls, in the Modify modal.</figcaption></figure><p>After clicking <em>Update</em>, you can verify that you see the expected diff in the right hand node pane.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*FbwtshucwdnrwENsyeCdww.png" /><figcaption>Verify that the change you made is correct by viewing the Execution Parameters diff.</figcaption></figure><p>Finally, click <em>Reset</em> and you will see the node complete successfully.</p><h3>Errors Requiring Debugging</h3><p>Sometimes you have a real issue that you need to debug. You can use <strong><em>debug</em></strong> mode by clicking the <em>empty bug</em> icon or <strong><em>live debug</em></strong> mode by clicking the <em>lightning bug</em> icon.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*NcEKmh2fVxaquBHMEwEEsQ.png" /><figcaption>You can choose to <strong>debug</strong> with a snapshot of your code or <strong>live debug</strong> with your local code mounted directly into your debug container.</figcaption></figure><h4>Debug Mode</h4><p>Debug mode gives you a shell in a container with the node’s command and execution environment, including environment variables and a <em>copy</em> of your code. You can immediately reproduce the exact results you see in your pipeline. You can modify command, environment, and code in this container. Any changes are discarded when you exit this shell, so you must manually port your fixes back to your local code.</p><h4>Live Debug Mode</h4><p>Live debug mode gives you the same shell as debug mode, but also mounts your local code so that you can edit code outside of the shell with your own editor. Conversely, any changes you make inside the livedebug container persist outside on your local host even after you exit the shell, allowing you to instantly commit any of your fixes to your repo.</p><h4>Debug Example</h4><p>In this example, you should use <em>live debug</em> mode. Click the lightning bug in the upper right hand corner of the node pane to get a command copied to your clipboard. Paste that command into a local shell. Run the command to <em>immediately reproduce</em> the error reported by the pipeline.</p><p>Now, since the <em>live debug</em> container mounts the code from your local filesystem, you can edit and debug using your own editor and debug environment. Test your fix by re-running the command in the live debug container.</p><p>A <em>debug</em> container works the same way, but the code is copied into the container and has no connection to your local machine. So, you must edit and debug entirely within the debug shell.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*XEuiWbljK7xzCvPetEkq_A.png" /><figcaption>A <strong>live debug</strong> session starts with a command that you paste into a shell. In the debug container you can cat the command, execute it to immediately reproduce the error, and re-run it to test your fix once you have debugged it in your own local editor.</figcaption></figure><p>Once you have fixed the code, you must click <em>Rebuild Image </em>to rebuild the image so that the pipeline can see the updated code. Once the image is rebuilt, you can click <em>Reset</em> to re-run the node to see it run successfully. As a shortcut, you can click <em>Rebuild and Reset</em> in the upper right hand corner of the node pane.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2qGMYCDn_Ts4YdGUCMTbuQ.png" /><figcaption><strong>Rebuild</strong> the image then <strong>Reset</strong> to re-run the node in one step by clicking <strong>Rebuild and Reset</strong>, which is conveniently the default button displayed in the yellow box.</figcaption></figure><p>You can view the history of each run of a node in the node pane <em>timeline</em>. Select any row in the timeline to see the Command, Execution Parameters, Stdout, and Stderr for that run of the node. Here, we can see the output of the first run that errored, and the second run that was successful.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ZiN7l-tNNEn8Yo_5i_tPcw.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*k0SCoZD7WEts3wfBnlA9zQ.png" /><figcaption>Toggle between different rows in the timeline to see Command, Execution Parameters, Stdout, and Stderr for different runs of a node.</figcaption></figure><p>We are developers who know the pain of re-creating execution environments and debugging in fragile setups. So, we built Conducto to make error resolution as quick and easy as possible. We hope that you will find debugging in Conducto to be a breath of fresh air. Check out <a href="https://medium.com/conducto/rapid-and-painless-debugging-ff2abdba44c1">Rapid and Painless Debugging </a>to see us applying these techniques to our actual internal CI/CD pipeline.</p><p>If you have not yet, <a href="https://medium.com/conducto/getting-started-with-conducto-for-ci-cd-b6afb626f410">get started with Conducto now</a>. Local mode is always free and is only limited by the cpu and memory on your machine. Cloud mode gives you immediate scale. <a href="https://medium.com/conducto/your-first-pipeline-32a303b2cc5d">Use the full power of python to write pipelines with ease</a>. And, enjoy easy error resolution.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b9f2b54f22b7" width="1" height="1" alt=""><hr><p><a href="https://medium.com/conducto/easy-error-resolution-b9f2b54f22b7">Easy Error Resolution</a> was originally published in <a href="https://medium.com/conducto">Conducto</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Node Parameters]]></title>
            <link>https://medium.com/conducto/node-parameters-7be236eaeaac?source=rss-42ece86b5daf------2</link>
            <guid isPermaLink="false">https://medium.com/p/7be236eaeaac</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[containers]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[workflow]]></category>
            <category><![CDATA[data-visualization]]></category>
            <dc:creator><![CDATA[Matt Jachowski]]></dc:creator>
            <pubDate>Tue, 21 Apr 2020 00:25:50 GMT</pubDate>
            <atom:updated>2020-07-29T20:27:48.147Z</atom:updated>
            <content:encoded><![CDATA[<h4><a href="https://medium.com/conducto/data/home">Conducto for Data Science</a></h4><p>Exec, Serial, and Parallel Nodes support several parameters that make pipeline specification in Conducto extremely powerful. You have already learned about <a href="https://medium.com/conducto/execution-environment-3bb663549a0c#6743">image</a> and <a href="https://medium.com/conducto/environment-variables-and-secrets-9acab502ec77">env</a>. You can also specify:</p><ul><li><a href="#8990">cpu and </a><a href="#8990">mem to constrain resources</a></li><li><a href="#5942">requires_docker to run docker commands</a></li><li><a href="#b670">stop_on_error to implement the <em>finally</em> pattern</a></li><li><a href="#fc0c">same_container to control container sharing</a></li><li><a href="#d7aa">doc to show pretty documentation in the web app</a></li><li><a href="#7d76">skip to default skip a node</a></li></ul><p>Explore our <a href="https://www.conducto.com/demo/data_science">live demo</a>, view the <a href="https://github.com/conducto/demo/blob/main/data_science/node_params.py">source code for this tutorial</a>, or clone the <a href="https://github.com/conducto/demo">demo</a> and run it for yourself.</p><pre>git clone <a href="https://github.com/conducto/demo.git">https://github.com/conducto/demo.git</a><br>cd demo/data_science<br>python node_params.py --local</pre><p>Alternatively, download the zip archive <a href="https://github.com/conducto/demo/archive/master.zip">here</a>.</p><p>You can view most of these parameters for any node in the <em>Execution Parameters</em> section of the node pane.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*3uoeRyhI6WV4Yg3w.png" /><figcaption>Most node and image parameters are listed in the node pane.</figcaption></figure><p>And, you can modify most of these parameters for any node in a live pipeline from the <em>Modify</em> modal and <em>Reset </em>the node to re-run in place.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*lTbxsKSLyEBps-rk.png" /><figcaption>You can modify many of the node parameters in the <strong>Modify</strong> modal.</figcaption></figure><h3>cpu and mem</h3><p>The cpu and mem parameters limit the cpu and memory that get assigned to an Exec node. The default values are cpu=1 cpu and mem=2 GB. Allocate less if your commands require very little cpu or memory to allow your local pipeline to launch more nodes in parallel. Allocate more if necessary.</p><pre>co.Exec(&quot;echo not doing much&quot;, <strong>cpu=0.25</strong>, <strong>mem=0.25</strong>)</pre><h3>requires_docker</h3><p>To enable running docker commands like docker build, docker run, etc. in a node, you must set requires_docker=True. This is because your commands run within a docker container already, and running docker within docker requires non-trivial setup that Conducto will not do by default. Also, note that your image must have docker installed.</p><pre>image = co.Image(<strong>&quot;docker:19.03&quot;</strong>)<br>co.Exec(&quot;docker run hello-world&quot;, <strong>requires_docker=True</strong>, image=image)</pre><h3>stop_on_error</h3><p>A Serial node defaults to stop_on_error=True, which means that it stops and reports itself as errored as soon as any child node encounters an error. If stop_on_error=False, then it will run <em>all</em> child nodes, but will still report itself as errored if any child encountered an error. This is useful for implementing a <em>finally</em> pattern to guarantee that your pipeline always runs a cleanup step.</p><pre>with co.Serial(name=&quot;stop_on_error_false&quot;, <strong>stop_on_error=False</strong>):<br>    co.Exec(&quot;echo doing some setup&quot;, name=&quot;setup&quot;)<br>    co.Exec(&quot;this_command_will_fail&quot;, name=&quot;bad_command&quot;)<br>    co.Exec(&quot;echo doing some cleanup&quot;, name=&quot;finally_cleanup&quot;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*ReV5e_lQEUlDA4W7.png" /><figcaption>A pipeline with the default stop_on_error=True behavior (above) vs one with stop_on_error=False (below). You can ensure that a final cleanup step always runs with stop_on_error=False.</figcaption></figure><h3>same_container</h3><p>Exec nodes are not guaranteed to run in the same containers, although Conducto will reuse containers when possible for efficiency. You can force commands to run in the same container with the argument same_container=co.SameContainer.NEW. All child nodes will have the default same_container=co.SameContainer.INHERIT and will share the container with the parent. This is useful if you want greater visibility into a command that chains together multiple subcommands. An error in a single subcommand will be easier to identify than an error in a long command.</p><pre>long_command = &quot;&quot;&quot;set -ex<br>echo This is a long command.<br>echo First I do this.<br>echo Then I do that.<br>oops_this_is_not_a_valid_command<br>echo Then I go home.<br>&quot;&quot;&quot;<br>co.Exec(long_command)</pre><p>versus</p><pre>with co.Serial(name=&quot;example&quot;, same_container=co.SameContainer.NEW):<br>    co.Exec(&quot;echo This is a long command.&quot;, name=&quot;intro&quot;)<br>    co.Exec(&quot;echo First I do this.&quot;, name=&quot;do_this&quot;)<br>    co.Exec(&quot;echo Then I do that.&quot;, name=&quot;do_that&quot;)<br>    co.Exec(&quot;oops_this_is_not_a_valid_command&quot;, name=&quot;oops&quot;)<br>    co.Exec(&quot;echo Then I go home.&quot;, name=&quot;go_home&quot;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*fCv7_CtRp9jPI0PX.png" /><figcaption>It is easier to identify where the error occurred after splitting a long command int several commands sharing the same container.</figcaption></figure><p>Another reason to use same_container=co.SameContainer.NEW to force container sharing is when you want your commands to share a filesystem. This makes a download and analyze pipeline very easy, for example, because you simply download the data to the filesystem in one node, and the analyze node can automatically see it. There is no need to put the binary in a separate data store.</p><pre>with co.Serial(name=&quot;shared&quot;, same_container=co.SameContainer.NEW):<br>    co.Exec(f&quot;curl {data_url} &gt; /tmp/data.zip&quot;, name=&quot;download&quot;)<br>    co.Exec(&quot;unzip -pq /tmp/data.zip &gt; /tmp/data&quot;, name=&quot;unzip&quot;)<br>    co.Exec(&quot;wc -l /tmp/data&quot;, name=&quot;analyze&quot;)</pre><p>However, there is a downside to this same_container mode. When sharing a container, Exec nodes will <em>always run in serial</em>, even if the parent is a Parallel node. So, you lose the ability to parallelize.</p><pre>with co.Parallel(<br>    name=&quot;always_serial&quot;, same_container=co.SameContainer.NEW<br>):<br>    co.Exec(&quot;echo I cannot run in parallel&quot;, name=&quot;parallel_exec_1&quot;)<br>    co.Exec(&quot;echo even if I want to&quot;, name=&quot;parallel_exec_2&quot;)</pre><h3>doc</h3><p>Nodes can be documented with the doc parameter. It supports Markdown and is rendered at the top of the node pane. Nodes with docs are marked with a <em>doc</em> icon in the pipeline pane. We make extensive use of this feature in all of our demos.</p><pre>markdown_doc = &quot;### I _can_ **use** `markdown`&quot;</pre><pre>more_markdown_doc = &quot;&quot;&quot;<br>Markdown even supports [links](https://www.conducto.com)<br>and images ![alt text](<br><a href="http://cdn.loc.gov/service/pnp/highsm/21700/21778r.jpg">http://cdn.loc.gov/service/pnp/highsm/21700/21778r.jpg</a> &quot;a pretty picture&quot;)<br>&quot;&quot;&quot;</pre><pre>co.Exec(&quot;echo doc example 1&quot;, doc=markdown_doc)<br>co.Exec(&quot;echo doc example 2&quot;, doc=more_markdown_doc)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*jU00KwVUbamU-379.png" /><figcaption>The example uses simple Markdown in the doc.</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*y-K-lF2AzCY9AH57.png" /><figcaption>This example uses Markdown to display a link and an image.</figcaption></figure><h3>skip</h3><p>Nodes can be skipped in the web app or with skip=True. This is useful, for example, if you have a pipeline that has a reasonable default way to run, but you want the ability to manually enable (unskip) additional steps from the web app. A specific example might be deploying a production model. You could skip the deployment node by default, and require that someone manually reviews the output of the pipeline before unskipping and running the node to complete the deployment.</p><pre>image = co.Image(&quot;bash:5.0&quot;)<br>with co.Serial(image=image) as skip_example:<br>    co.Exec(&quot;echo build model&quot;, name=&quot;build&quot;)<br>    co.Exec(&quot;echo test model&quot;, name=&quot;test&quot;)<br>    co.Exec(&quot;echo deploy model&quot;, name=&quot;deploy&quot;, <strong>skip=True</strong>)<br>    co.Exec(&quot;echo send status email&quot;, name=&quot;send email&quot;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*5wJf2053X_5EV57UK79h_Q.png" /><figcaption>Default <strong>skip</strong> the deploy step, and force someone to manually <strong>unskip</strong> it from the toolbar.</figcaption></figure><p>Now, with the information you learned in <a href="https://medium.com/conducto/your-first-data-science-pipeline-cc9ceac142f6">Your First Pipeline</a>, <a href="https://medium.com/conducto/execution-environment-3bb663549a0c">Execution Environment</a>, <a href="https://medium.com/conducto/environment-variables-and-secrets-9acab502ec77">Environment Variables and Secrets</a>, <a href="https://medium.com/conducto/data-stores-f6dc90104029">Data Stores</a>, <a href="https://medium.com/conducto/easy-and-powerful-python-pipelines-2de5825375f2">Easy and Powerful Python Pipelines</a>, and here, you can create arbitrarily complex pipelines.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=7be236eaeaac" width="1" height="1" alt=""><hr><p><a href="https://medium.com/conducto/node-parameters-7be236eaeaac">Node Parameters</a> was originally published in <a href="https://medium.com/conducto">Conducto</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Data Stores]]></title>
            <link>https://medium.com/conducto/data-stores-f6dc90104029?source=rss-42ece86b5daf------2</link>
            <guid isPermaLink="false">https://medium.com/p/f6dc90104029</guid>
            <category><![CDATA[workflow]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[containers]]></category>
            <category><![CDATA[data-visualization]]></category>
            <dc:creator><![CDATA[Matt Jachowski]]></dc:creator>
            <pubDate>Mon, 20 Apr 2020 22:37:49 GMT</pubDate>
            <atom:updated>2020-07-29T20:27:10.080Z</atom:updated>
            <content:encoded><![CDATA[<h4><a href="https://medium.com/conducto/data/home">Conducto for Data Science</a></h4><p>Data science pipelines necessarily generate data, plots, or intermediate results that need to be stored for some amount of time. You cannot simply persist these files on the local filesystem, because each command runs in a container with it’s own filesystem that disappears when the container exits. And, in cloud mode, containers run on different machines, so there is no shared filesystem to mount. So, <a href="https://conducto.com/">Conducto</a> supports a few different approaches that work in a containerized world.</p><ul><li><a href="#da90">Connect to your own data store</a>.</li><li><a href="#69e1">Use </a><a href="#69e1">co.data.pipeline/conducto-data-pipeline as a pipeline-local key-value store</a>.</li><li><a href="#1b84">Use </a><a href="#1b84">co.data.user/conducto-data-user as a user-scoped persistent key-value store</a>.</li></ul><p>Explore our <a href="https://www.conducto.com/demo/data_science">live demo</a>, view the <a href="https://github.com/conducto/demo/blob/main/data_science/data_stores.py">source code for this tutorial</a>, or clone the <a href="https://github.com/conducto/demo">demo</a> and run it for yourself.</p><pre>git clone <a href="https://github.com/conducto/demo.git">https://github.com/conducto/demo.git</a><br>cd demo/data_science<br>python data_stores.py --local</pre><p>Alternatively, download the zip archive <a href="https://github.com/conducto/demo/archive/master.zip">here</a>.</p><h3>Your Own Data Store</h3><p>There are many standard ways to store persistent data: databases, AWS S3, and in-memory caches like redis, just to name a few. An exec node can run any shell command, so it is easy to use any of these approaches. Here is a trivial example that sets AWS credentials and writes to S3 with the AWS CLI.</p><pre>image = co.Image(&quot;python:3.8-alpine&quot;, reqs_py=[&quot;awscli&quot;]<br>env = {<br>    &quot;AWS_ACCESS_KEY_ID&quot;: &quot;my_access_key_id&quot;,<br>    &quot;AWS_SECRET_ACCESS_KEY&quot;: &quot;my_secret_key&quot;<br>}<br>s3_command = &quot;aws s3 cp my_file s3://my_s3_bucket/&quot;<br>s3_exec_node = co.Exec(s3_command, image=image, env=env)</pre><p>Note that in a real pipeline, you would want to store your AWS credentials as <a href="https://medium.com/conducto/environment-variables-and-secrets-9acab502ec77">secrets</a>.</p><h3>Use co.data.pipeline / conducto-data-pipeline</h3><p>co.data.pipeline is a pipeline-local key-value store. This data is only visible to your pipeline and persists until your pipeline is deleted. It is useful for writing data in one pipeline step, and reading it in another. In local mode, pipeline data lives on your local filesystem. In cloud mode, pipeline data lives in AWS S3.</p><p>co.data.pipeline has both a python interface and command line interface as conducto-data-pipeline. Here is the condensed interface. Our demo prints the command line usage to show the full interface.</p><pre>usage: conducto-data-pipeline [-h] &lt;method&gt; [&lt; --arg1 val1 --arg2 val2 ...&gt;]<br><br>methods:<br>    delete         (name)    <br>    exists         (name)    <br>    get            (name, file)    <br>    gets           (name, byte_range:List[int]=None)    <br>    list           (prefix)    <br>    put            (name, file)    <br>    puts           (name)    <br>    url            (name)    <br>    cache-exists   (name, checksum)    <br>    clear-cache    (name, checksum=None)    <br>    save-cache     (name, checksum, save_dir)    <br>    restore-cache  (name, checksum, restore_dir)</pre><p>One useful application is performing and summarizing a parameter search. In this example, we try different parameterizations of an algorithm in parallel. Each one stores its results using co.data.pipeline.puts(). Once all of the parallel tasks are done, it reads the results using co.data.pipeline.gets() and prints a summary.</p><p>Here is the pipeline specification. Each pipeline node is bolded for clarity.</p><pre># Location to store data.<br>data_dir = &quot;demo/data_science/pipeline_data&quot;</pre><pre># Image installs python, R, and conducto.<br>output = co.Serial(image=image)</pre><pre># Parameter search over 3 parameters in nested for loops.<br><strong>output[&quot;parameter_search&quot;] = ps = co.Parallel()</strong></pre><pre>for window in [25, 50, 100]:<br>    <strong>ps[f&quot;window={window}&quot;] = w = co.Parallel()</strong></pre><pre>    for mean in [.05, .08, .11]:<br>        <strong>w[f&quot;mean={mean}&quot;] = m = co.Parallel()</strong></pre><pre>        for volatility in [.1, .125, .15, .2]:<br>            <strong>m[f&quot;volatility={volatility}&quot;] = co.Exec(                                                                                                                                                                                                                                                                                                                                                    <br>                f&quot;python temp_data.py --window={window} &quot;<br>                f&quot;--mean={mean} --volatility={volatility} &quot;                                                                                                                                                                                                                                                                                                                                 <br>                f&quot;--data-dir={data_dir}&quot;                                                                                                                                                                                                                                                                                                                                      <br>            )</strong></pre><pre># Summarize parameter search results.<br><strong>output[&quot;summarize&quot;] = co.Exec(f&quot;Rscript temp_data.R {data_dir}&quot;)</strong></pre><p>This results in the following pipeline, where I have drilled down to an arbitrary step of the parameter search.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*JbERL7kIkgWj4QS02LYhmg.png" /><figcaption>View of the pipeline pane for the parameter search example pipeline.</figcaption></figure><p>Any Exec node shows the command being run for a single step of the parameter search.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*SGA96fN05o6s6WTIm7bnPQ.png" /><figcaption>The node pane shows the command being run for a single step of the parameter search.</figcaption></figure><p>The script being run for each step of the parameter search is temp_data.py and can be viewed <a href="https://github.com/conducto/demo/blob/master/data_science/code/temp_data.py">here</a>. In particular, this is the code it uses to store results to co.data.pipeline.</p><pre># Save result to Conducto&#39;s pipeline data store<br>path = &quot;{}/mn={:.2f}_vol={:.2f}_win={:03}&quot;.format(<br>    data_dir, mean, volatility, window<br>)<br>data = json.dumps(output).encode()<br><strong>co.data.pipeline.puts(path, data)</strong></pre><p>In contrast, the summarize steps runs temp_data.R, which can be viewed <a href="https://github.com/conducto/demo/blob/master/data_science/code/temp_data.R">here</a>, and uses the the command line interface conducto-data-pipeline.</p><pre># Use `conducto-data-pipeline list` command to get all the files.<br><strong>cmd = sprintf(&quot;conducto-data-pipeline list --prefix=%s&quot;, argv$dir)</strong><br>files = fromJSON(system(cmd, intern=TRUE))</pre><pre>names(files) &lt;- gsub(&quot;.*/&quot;, &quot;&quot;, files)<br>datas = lapply(files, function(f) {<br>    # Call `conducto-data-pipeline gets` to get an individual dataset.<br>    <strong>cmd = sprintf(&quot;conducto-data-pipeline gets --name=%s&quot;, f)</strong><br>    fromJSON(system(cmd, intern=TRUE))<br>})</pre><h3>Use co.data.user / conducto-data-user</h3><p>co.data.user is a user-scoped persistent key-value store. This is just like co.data.pipeline, but data is visible in all pipelines and persists beyond the lifetime of your pipeline. You are responsible for manually clearing your data when you no longer need it. In local mode, user data lives on your local filesystem. In cloud mode, user data lives in AWS S3.</p><p>co.data.user has both a python interface and command line interface as conducto-data-user. Here is the condensed interface. Our demo prints the command line usage to show the full interface.</p><pre>usage: conducto-data-user [-h] &lt;method&gt; [&lt; --arg1 val1 --arg2 val2 ...&gt;]<br><br>methods:<br>    delete         (name)    <br>    exists         (name)    <br>    get            (name, file)    <br>    gets           (name, byte_range:List[int]=None)    <br>    list           (prefix)    <br>    put            (name, file)    <br>    puts           (name)    <br>    url            (name)    <br>    cache-exists   (name, checksum)    <br>    clear-cache    (name, checksum=None)    <br>    save-cache     (name, checksum, save_dir)    <br>    restore-cache  (name, checksum, restore_dir)</pre><p>One useful application in data science is storing downloaded data. In this example, we download data from the Bitcoin blockchain. This can be time-consuming, so we want to avoid downloading the same data twice. By storing the data in co.data.user, we pull it once and persist it across pipelines.</p><pre># Image installs python and conducto.<br>with co.Serial(image=image) as out:<br>    out[&quot;download_20-11&quot;] = \<br>        co.Exec(&quot;python btc.py download --start=-20 --end=-11&quot;)<br>    out[&quot;download_15-6&quot;] = \<br>        co.Exec(&quot;python btc.py download --start=-15 --end=-6&quot;)<br>    out[&quot;download_10-now&quot;] = \<br>        co.Exec(&quot;python btc.py download --start=-10 --end=-1&quot;)</pre><p>Notice that this example contains three “download” nodes with overlapping ranges. They each download their range and skip any blocks that are already downloaded.</p><p>The code using co.data.user is in btc.py, which you can view <a href="https://github.com/conducto/demo/blob/master/data_science/code/btc.py">here</a>. This is a relevant section of the download function, with co.data.user usage bolded.</p><pre>for height in range(start, end + 1):<br>    path = f&quot;conducto/demo/btc/height={height}&quot;</pre><pre>    # Check if `co.data.user` already has this block.<br><strong>    if co.data.user.exists(path):<br></strong>        print(f&quot;Data already exists for block at height {height}&quot;)<br><strong>        data_bytes = co.data.user.gets(path)<br></strong>        _print_block(height, data_bytes)<br>        continue</pre><pre>    print(f&quot;Downloading block at height={height}&quot;)<br>    data = _download_block(height)</pre><pre>    # Put the data into `co.data.user`.<br>    data_bytes = json.dumps(data).encode()<br>    <strong>co.data.user.puts(path, data_bytes)</strong></pre><p>If you download the demo, you can run this pipeline and see that it takes some time to download the data. But, if you click the <em>Reset</em> button and re-run the pipeline, you will see that it runs much faster. This is expected, because all of the data, aside from any new data generated since the pipeline last ran, is already in user data. Select any of the download nodes and look at the <em>timeline </em>in the node pane to see how long your first and second runs took.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*4i1IREhYBOPWPP0xiaENlA.png" /><figcaption>The timeline shows that the first run took 1 minute and 88 MB of memory. The second run took 2.7 seconds and 47 MB of memory because the data was already in co.data.user.</figcaption></figure><p>That’s it! Now, with the information you learned in <a href="https://medium.com/conducto/your-first-data-science-pipeline-cc9ceac142f6">Your First Pipeline</a>, <a href="https://medium.com/conducto/execution-environment-3bb663549a0c">Execution Environment</a>, <a href="https://medium.com/conducto/environment-variables-and-secrets-9acab502ec77">Environment Variables and Secrets</a>, <a href="https://medium.com/conducto/node-parameters-7be236eaeaac">Node Parameters</a>, <a href="https://medium.com/conducto/easy-and-powerful-python-pipelines-2de5825375f2">Easy and Powerful Python Pipelines</a>, and here, you can create arbitrarily complex data science pipelines.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f6dc90104029" width="1" height="1" alt=""><hr><p><a href="https://medium.com/conducto/data-stores-f6dc90104029">Data Stores</a> was originally published in <a href="https://medium.com/conducto">Conducto</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Introduction to Conducto Pipelines]]></title>
            <link>https://medium.com/conducto/introduction-to-conducto-pipelines-2759ecf876a2?source=rss-42ece86b5daf------2</link>
            <guid isPermaLink="false">https://medium.com/p/2759ecf876a2</guid>
            <category><![CDATA[ci-cd-pipeline]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[devops]]></category>
            <category><![CDATA[containers]]></category>
            <category><![CDATA[python]]></category>
            <dc:creator><![CDATA[Matt Jachowski]]></dc:creator>
            <pubDate>Sun, 19 Apr 2020 08:57:04 GMT</pubDate>
            <atom:updated>2020-07-29T20:18:44.527Z</atom:updated>
            <content:encoded><![CDATA[<h4><a href="https://medium.com/conducto/getting-started/home">Getting Started with Conducto</a></h4><p>A pipeline is a sequence of commands that must be executed in a specific order. Some steps can happen concurrently, while other steps must happen one after another.</p><p>Conducto is a tool for writing, executing, visualizing, and debugging pipelines. At its most basic level, Conducto makes it trivial to chain together sequences of shell commands into pipelines using a simple python interface.</p><p>Explore the <a href="https://www.conducto.com/demo/simple">live demo</a>, view the <a href="https://github.com/conducto/demo/blob/main/demo.py">source code for this tutorial</a>, or clone the <a href="https://github.com/conducto/demo">demo</a> and run it for yourself.</p><pre>git clone <a href="https://github.com/conducto/demo.git">https://github.com/conducto/demo.git</a><br>cd demo<br>python demo.py islands --local</pre><p>Alternatively, download the zip archive <a href="https://github.com/conducto/demo/archive/master.zip">here</a>.</p><h3>Boilerplate</h3><p>In this introduction, we will build a simple pipeline of echo commands. First, create a empty python file (mine is called demo.py), then add this standard Conducto boilerplate code.</p><pre>import conducto as co</pre><pre># We will add more code here.</pre><pre>if __name__ == &quot;__main__&quot;:<br>    co.main()</pre><h3>Nodes</h3><p>You can conceptualize a pipeline as a sequence of <em>commands</em> that happen in <em>parallel</em> (at the same time), or in <em>serial</em> (one after the other). Conducto exposes three <em>Node</em> classes that directly map onto these ideas: <em>Exec</em>, <em>Parallel</em>, and <em>Serial</em>. Note that the code below is just for illustration purposes and should not be copied into your python file.</p><h4>An<em> </em>Exec Node is a shell command.</h4><pre>exec_node = co.Exec(&quot;echo hello world&quot;)</pre><h4>A <em>Parallel Node</em> holds other nodes that can be executed in parallel.</h4><pre>parallel_node = co.Parallel()<br>parallel_node[&quot;task1&quot;] = co.Exec(&quot;echo whistle&quot;)<br>parallel_node[&quot;task2&quot;] = co.Exec(&quot;echo while you work&quot;)</pre><h4>A <em>Serial Node</em> holds other nodes that must be executed in serial.</h4><pre>serial_node = co.Serial()<br>serial_node[&quot;task1&quot;] = co.Exec(&quot;echo first do this&quot;)<br>serial_node[&quot;task2&quot;] = co.Exec(&quot;echo then do that&quot;)</pre><h3>Pipeline Specification</h3><h4>Pipeline Function</h4><p>A pipeline is specified in a function that returns the root node of a tree that combines Exec, Parallel, and Serial Nodes. So, let us go back to our file, and create an empty pipeline function. Here, we begin by defining a pipeline function named islands.</p><pre>import conducto as co</pre><pre><strong>def islands() -&gt; co.Serial:<br>    return None</strong></pre><pre>if __name__ == &quot;__main__&quot;:<br>    co.main()</pre><p>The islands function is annotated with a <em>type hint </em>indicating that it will return a Serial Node. It is ok if you are not familiar with type hints. Just ensure that your pipeline function signature always ends with -&gt; co.[NodeType].</p><h4>Pipeline Definition</h4><p>Now we can actually define our pipeline. We are going to define a toy pipeline that prints the nickname of each Hawaiian island, starting with the southernmost island and moving north. Islands in the same county will be grouped into either a Parallel or Serial node. In pseudocode, the pipeline should look like:</p><pre>hawaii -&gt; echo big island<br>maui county:<br>    maui -&gt; echo valley isle<br>    lanai -&gt; echo pineapple isle<br>    molokai -&gt; echo friendly isle<br>    kahoolawe -&gt; echo target isle<br>oahu -&gt; echo gathering place<br>kauai county:<br>    kauai -&gt; echo garden isle<br>    niihau -&gt; echo forbidden isle</pre><p>We can easily translate this into python using Node objects. Note that the choice of Parallel and Serial Nodes for maui_county and kauai_county below is arbitrary.</p><pre>pipeline = co.Serial()<br>pipeline[&quot;hawaii&quot;] = co.Exec(&quot;echo big island&quot;)</pre><pre>pipeline[&quot;maui_county&quot;] = co.Parallel()<br>pipeline[&quot;maui_county&quot;][&quot;maui&quot;] = co.Exec(&quot;echo valley isle&quot;)<br>pipeline[&quot;maui_county&quot;][&quot;lanai&quot;] = co.Exec(&quot;echo pineapple isle&quot;)<br>pipeline[&quot;maui_county&quot;][&quot;molokai&quot;] = co.Exec(&quot;echo friendly isle&quot;)<br>pipeline[&quot;maui_county&quot;][&quot;kahoolawe&quot;] = co.Exec(&quot;echo target isle&quot;)</pre><pre>pipeline[&quot;oahu&quot;] = co.Exec(&quot;echo gathering place&quot;)</pre><pre>pipeline[&quot;kauai_county&quot;] = co.Serial()<br>pipeline[&quot;kauai_county&quot;][&quot;kauai&quot;] = co.Exec(&quot;echo garden isle&quot;)<br>pipeline[&quot;kauai_county&quot;][&quot;niihau&quot;] = co.Exec(&quot;echo forbidden isle&quot;)</pre><p>This is straightforward, but I believe that the pipeline structure is even clearer when we leverage python’s with statement. This code is an equivalent way to express our pipeline.</p><pre>with co.Serial() as pipeline:<br>    pipeline[&quot;hawaii&quot;] = co.Exec(&quot;echo big island&quot;)</pre><pre>    with co.Parallel(name=&quot;maui_county&quot;) as maui_county:<br>       maui_county[&quot;maui&quot;] = co.Exec(&quot;echo valley isle&quot;)<br>       maui_county[&quot;lanai&quot;] = co.Exec(&quot;echo pineapple isle&quot;)<br>       maui_county[&quot;molokai&quot;] = co.Exec(&quot;echo friendly isle&quot;)<br>       maui_county[&quot;kahoolawe&quot;] = co.Exec(&quot;echo target isle&quot;)</pre><pre>    pipeline[&quot;oahu&quot;] = co.Exec(&quot;echo gathering place&quot;)</pre><pre>    with co.Serial(name=&quot;kauai_county&quot;) as kauai_county:<br>       kauai_county[&quot;kauai&quot;] = co.Exec(&quot;echo garden isle&quot;)<br>       kauai_county[&quot;niihau&quot;] = co.Exec(&quot;echo forbidden isle&quot;)</pre><p>Now, we can put this code into our islands function from before, return the root pipeline node, and we are done.</p><pre>import conducto as co</pre><pre>def islands() -&gt; co.Serial:<br><strong>    with co.Serial() as pipeline:<br>        pipeline[&quot;hawaii&quot;] = co.Exec(&quot;echo big island&quot;)<br>        with co.Parallel(name=&quot;maui_county&quot;) as maui_county:<br>            maui_county[&quot;maui&quot;] = co.Exec(&quot;echo valley isle&quot;)<br>            maui_county[&quot;lanai&quot;] = co.Exec(&quot;echo pineapple isle&quot;)<br>            maui_county[&quot;molokai&quot;] = co.Exec(&quot;echo friendly isle&quot;)<br>            maui_county[&quot;kahoolawe&quot;] = co.Exec(&quot;echo target isle&quot;)<br>        pipeline[&quot;oahu&quot;] = co.Exec(&quot;echo gathering place&quot;)<br>        with co.Serial(name=&quot;kauai_county&quot;) as kauai_county:<br>            kauai_county[&quot;kauai&quot;] = co.Exec(&quot;echo garden isle&quot;)<br>            kauai_county[&quot;niihau&quot;] = co.Exec(&quot;echo forbidden isle&quot;)<br>    return pipeline</strong></pre><pre>if __name__ == &quot;__main__&quot;:<br>    co.main()</pre><h3>Pipeline Execution</h3><p>The python file contains our full pipeline specification. Now, we can execute it. First, run the script with the --help option.</p><pre>python demo.py --help</pre><p>You will see a message like the one below. You can see that Conducto recognizes our pipeline function from the bolded text.</p><pre>usage: demo.py [-h] &lt;method&gt; [&lt; --arg1 val1 --arg2 val2 ...&gt;]<br>                [--cloud] [--local] [--run] [--sleep-when-done]<br><strong>methods that return conducto pipelines:<br>    islands  () -&gt; Serial</strong></pre><pre>optional arguments:<br>  -h, --help  show this help message and exit<br>  --version   show conducto package version</pre><p>Now, execute the script in <em>local</em> mode, which means that the entire pipeline will execute on your local machine. In a future release, you will also be able to execute the same script in <em>cloud</em> mode for immediate scale.</p><pre>python demo.py islands --local</pre><p>This should open a new browser window or tab to conducto.com where can see the pipeline. If this does not happen, copy the printed URL into your browser.</p><p>The left-hand side of the screen is called the<em> pipeline pane</em> and has a toolbar with icons at the top. Click the <em>View </em>button to expand the pipeline and see the pipeline tree we have created. Click the <em>Run </em>button to execute the pipeline.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Gz76_hmi5sEAJW5NFaaOtw.png" /><figcaption>This is the pipeline pane. Click <strong>View</strong> to expand the pipeline tree and <strong>Run</strong> to execute the pipeline.</figcaption></figure><p>This interactive tree representation gives you a useful visual summary of the pipeline. You can see that Exec, Parallel, and Serial Node types are indicated by unique icons.</p><p>Notice how closely the pipeline tree in the web app mirrors our python specification.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*RkjAt_-Kvl6XgFQuzkShwQ.png" /><figcaption>Pipeline specification and visualization mirror each other.</figcaption></figure><p>Finally, click on one of the Exec nodes and examine the execution details. It contains useful information like the command, duration, memory used, return code, and stdout.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Bs0i7jMQKSzKGGD1c36CAg.png" /></figure><h3>Summary</h3><p>Now you have written and executed a simple pipeline in Conducto. I hope you are already imagining how Conducto can enable you to <em>easily </em>write and execute your own pipelines.</p><p>In my previous job, the predecessor to Conducto was the secret sauce that enabled our algorithmic trading team to run an ultra-productive data science and machine learning effort that has run for a decade and driven billions of dollars in revenue. So it stands to reason that Conducto is great for <a href="https://medium.com/conducto/data/home">data science</a>.</p><p>But, pipelines are everywhere, and when we switched our internal CI/CD pipeline from CircleCI to Conducto, <a href="https://medium.com/conducto/supercharge-your-ci-cd-with-conducto-e701c4ea5be3">we immediately became more productive</a>. Try <a href="https://medium.com/conducto/getting-started-with-conducto-for-ci-cd-b6afb626f410">Conducto for CI/CD</a> if you do not love your current solution.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=2759ecf876a2" width="1" height="1" alt=""><hr><p><a href="https://medium.com/conducto/introduction-to-conducto-pipelines-2759ecf876a2">Introduction to Conducto Pipelines</a> was originally published in <a href="https://medium.com/conducto">Conducto</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Environment Variables and Secrets]]></title>
            <link>https://medium.com/conducto/environment-variables-and-secrets-9acab502ec77?source=rss-42ece86b5daf------2</link>
            <guid isPermaLink="false">https://medium.com/p/9acab502ec77</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[containers]]></category>
            <category><![CDATA[workflow]]></category>
            <category><![CDATA[data-visualization]]></category>
            <dc:creator><![CDATA[Matt Jachowski]]></dc:creator>
            <pubDate>Fri, 17 Apr 2020 01:29:00 GMT</pubDate>
            <atom:updated>2020-07-29T20:26:37.344Z</atom:updated>
            <content:encoded><![CDATA[<h4><a href="https://medium.com/conducto/data/home">Conducto for Data Science</a></h4><p>Non-trivial pipelines require the specification of environment variables and secrets. This is easy in <a href="https://conducto.com/">Conducto</a>.</p><p>Explore our <a href="https://www.conducto.com/demo/data_science">live demo</a>, view the <a href="https://github.com/conducto/demo/blob/main/data_science/env_secrets.py">source code for this tutorial</a>, or clone the <a href="https://github.com/conducto/demo">demo</a> and run it for yourself.</p><pre>git clone <a href="https://github.com/conducto/demo.git">https://github.com/conducto/demo.git</a><br>cd demo/data_science<br>python env_secrets.py --local</pre><p>Alternatively, download the zip archive <a href="https://github.com/conducto/demo/archive/master.zip">here</a>.</p><h3>Environment Variables</h3><p>To specify environment variables, just supply the env argument to any node. Assign a dictionary of key value pairs where <em>both keys and values must be strings</em>.</p><pre><strong>env = {<br>    &quot;NUM_THREADS&quot;: &quot;4&quot;,<br>    &quot;MY_DATASET&quot;: &quot;volcano_data&quot;,<br>}</strong><br>image = co.Image(&quot;bash:5.0&quot;)<br>command = &quot;env | grep -e NUM_THREADS -e MY_DATASET<br>env_test = co.Exec(command, <strong>env=env</strong>, image=image)</pre><h3>Secrets</h3><p>Some environment variables, like passwords and tokens, are sensitive and should not be hardcoded into any scripts. You can configure Conducto with both user- and org-level secrets (if you are an admin), which will be injected into each running exec node. You can specify a dictionary of secrets with our Secrets API.</p><pre># get_my_secrets_dict() returns a dict of string to string<br>user_secrets = get_my_secrets_dict()<br><strong>token = co.api.Auth().get_token_from_shell()<br>secrets = co.api.Secrets()</strong><br><strong>secrets.put_user_secrets(token, user_secrets, replace=False)</strong></pre><p>Or you can enter them through our web interface.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*tD1-TnNmgnON17Bp.png" /><figcaption>Specifying AWS keys as user-level secrets.</figcaption></figure><p>That’s it! Now, with the information you learned in <a href="https://medium.com/conducto/your-first-data-science-pipeline-cc9ceac142f6">Your First Pipeline</a>, <a href="https://medium.com/conducto/execution-environment-3bb663549a0c">Execution Environment</a>, <a href="https://medium.com/conducto/data-stores-f6dc90104029">Data Stores</a>, <a href="https://medium.com/conducto/node-parameters-7be236eaeaac">Node Parameters</a>, <a href="https://medium.com/conducto/easy-and-powerful-python-pipelines-2de5825375f2">Easy and Powerful Python Pipelines</a>, and here, you can create arbitrarily complex data science pipelines.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=9acab502ec77" width="1" height="1" alt=""><hr><p><a href="https://medium.com/conducto/environment-variables-and-secrets-9acab502ec77">Environment Variables and Secrets</a> was originally published in <a href="https://medium.com/conducto">Conducto</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Execution Environment]]></title>
            <link>https://medium.com/conducto/execution-environment-3bb663549a0c?source=rss-42ece86b5daf------2</link>
            <guid isPermaLink="false">https://medium.com/p/3bb663549a0c</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[data-visualization]]></category>
            <category><![CDATA[workflow]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[containers]]></category>
            <dc:creator><![CDATA[Matt Jachowski]]></dc:creator>
            <pubDate>Thu, 16 Apr 2020 20:31:13 GMT</pubDate>
            <atom:updated>2020-07-29T20:25:57.637Z</atom:updated>
            <content:encoded><![CDATA[<h4><a href="https://medium.com/conducto/data/home">Conducto for Data Science</a></h4><p>In this tutorial, you will learn how to specify the dependencies and code necessary for your commands to run. <a href="https://conducto.com/">Conducto</a> strives to make this as simple as possible.</p><p>When we walked through creating <a href="https://medium.com/conducto/your-first-data-science-pipeline-cc9ceac142f6">your first pipeline</a>, we glossed over an important detail — specifying the execution environment of your commands. That is, for each command, you must be able to specify:</p><ul><li><a href="#6743">the software dependencies required</a>, and</li><li><a href="#6f6d">a copy of your own code</a></li></ul><p>Explore our <a href="https://www.conducto.com/demo/data_science">live demo</a>, view the <a href="https://github.com/conducto/demo/blob/main/data_science/execution_env.py">source code for this tutorial</a>, or clone the <a href="https://github.com/conducto/demo">demo</a> and run it for yourself.</p><pre>git clone <a href="https://github.com/conducto/demo.git">https://github.com/conducto/demo.git</a><br>cd demo/data_science<br>python execution_env.py --local</pre><p>Alternatively, download the zip archive <a href="https://github.com/conducto/demo/archive/master.zip">here</a>.</p><h3>Containers and Images</h3><p>Conducto achieves this by running each of your exec node commands inside of a <a href="https://www.docker.com/resources/what-container"><em>docker container</em></a><em>, </em>which is defined by an <em>image</em> that you help to configure. An <em>image</em> is a template for an execution environment that contains a base operating system and filesystem contents, including libraries, packages, and user code. A <em>container</em> is an instantiation of an image, and is like virtual machine, but lighter weight and quicker to create and destroy.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/500/0*v150VHTf40B7WUmw.jpeg" /><figcaption>It is ok if you are new to containers, Conducto handles a lot of the details for you.</figcaption></figure><p>We will deep dive into how you configure an image. As a refresher, this is the pipeline from <a href="https://medium.com/conducto/your-first-data-science-pipeline-cc9ceac142f6">your first pipeline tutorial</a>, with the image parameter bolded.</p><pre>import conducto as co</pre><pre>def download_and_plot() -&gt; co.Serial:<br>    dockerfile = &quot;./docker/Dockerfile.first&quot;<br>    <strong>image = co.Image(dockerfile=dockerfile, copy_dir=&quot;./code&quot;)</strong><br>    with co.Serial(<strong>image=image</strong>) as pipeline:<br>        co.Exec(download_command, name=&quot;download&quot;)<br>        with co.Parallel(name=&quot;plot&quot;):<br>            # ...<br>    return pipeline</pre><pre>if __name__ == &quot;__main__&quot;:<br>    co.main(default=download_and_plot)</pre><h3>Image Specification</h3><p>In Conducto, there are two ways to specify an image.</p><ul><li><a href="#1f39">Specifying an existing image</a> from DockerHub or another image registry.</li><li><a href="#9f79">Specifying a custom Dockerfile</a>.</li></ul><h3>Existing Image</h3><p>Specifying a existing image looks like this.</p><pre>image = co.Image(<strong>&quot;r-base:3.6.0&quot;</strong>)</pre><p>This particular image contains R, a programming language and environment for statistical computing, in a Debian Linux operating system, and is one of the <a href="https://hub.docker.com/_/r-base">many official R images available on DockerHub</a>. You can specify any image from any public image registry, or a locally built image.</p><h3>Python Image + Python Requirements</h3><p>If you specify an image with python installed, we also allow you to specify any python package requirements inline.</p><pre>image = co.Image(&quot;python:3.8-slim&quot;, <strong>reqs_py=[&quot;numpy&quot;]</strong>)</pre><p>This specific example is equivalent to having python 3.8 installed in Debian Linux, with the following pip command having been run.</p><pre>pip install numpy</pre><h3>Custom Dockerfile</h3><p>For more control, you can <a href="https://docs.docker.com/get-started/part2/#define-a-container-with-dockerfile">specify your own Dockerfile</a>, which Conducto will build into an image. You may specify dockerfile with an absolute or relative path, which is evaluated relative to the location of your pipeline script. You must also specify context, which is the <a href="https://docs.docker.com/engine/reference/commandline/build/">docker build context</a>.</p><pre>image = co.Image(<br>    <strong>dockerfile=&quot;./docker/Dockerfile.simple&quot;,<br>    context=&quot;.&quot;<br></strong>)</pre><p>Here is a very simple Dockerfile that results in an image equivalent to the python example from the previous section.</p><pre>FROM python:3.8-slim<br>RUN pip install numpy</pre><h3>Adding Your Own Code</h3><p>So far we have discussed how to use images to include required software dependencies. But, you likely also need to include your own code in the image.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/619/0*1md5kL7Q9EU4uTiz.jpg" /><figcaption>Fun fact: Conducto was almost named Blue Steel.</figcaption></figure><p>There are a few ways to do this.</p><ul><li><a href="#2608">Copy a local directory directly into the image</a>.</li><li><a href="#e21c">Clone a specific branch from a git repository into the image</a>.</li><li><a href="#be01">COPY or ADD files explicitly in a Dockerfile</a>.</li></ul><h3>Copy a Local Directory</h3><p>You can specify a local directory with your own files to be copied into your image with the copy_dir argument. You may use an absolute or relative path for the directory, which is evaluated relative to the location of your pipeline script.</p><pre>image = co.Image(&quot;r-base:3.6.0&quot;, <strong>copy_dir=&quot;./code&quot;</strong>)</pre><p>This copies the directory ./code into your image. You may specify copy_dir for any version of image specification from above: existing image or dockerfile.</p><h3>Clone from Git</h3><p>You can also specify a git repository and branch to clone into your image with the copy_url and copy_branch arguments. This is useful for ensuring that your data science pipelines run against clean, versioned code, and not scripts with local uncommitted changes that could be lost. Here is an example using our demo repo on GitHub.</p><pre>git_url = f&quot;https://<a href="http://twitter.com/github">github</a>.com/conducto/demo.git<br>dockerfile = &quot;./docker/Dockerfile.git&quot;<br>image = co.Image(<br>    dockerfile=dockerfile, <strong>copy_url=git_url, copy_branch=&quot;master&quot;<br></strong>)</pre><p>Just like copy_dir, you can specify copy_url and copy_branch to any version of image specification.</p><h3>COPY or ADD in Dockerfile</h3><p>Finally, if you specify your own custom Dockerfile, you can COPY or ADD any files you want. Here is a Dockerfile that explicitly copies a code directory into the image. In this example, ./code is a path relative to the <a href="https://docs.docker.com/engine/reference/commandline/build/">docker build context</a>, specified by the context argument as seen <a href="#9f79">earlier</a>.</p><pre>FROM r-base:3.6.0<br><strong>COPY ./code /root/code</strong></pre><h3>Mounting Local Code for Debugging</h3><p>One of our favorite features in Conducto is <em>live debugging</em>. We show an example of this in our <a href="https://medium.com/conducto/easy-error-resolution-45ca08d40f1d">debugging tutorial</a>. When you debug a node, you get a shell in a container with your full execution environment, including any code you have added to the image. If possible, we will mount your local code, creating a <em>live debug</em> environment. In this mode, any edits you make to your code outside of the container are visible inside the container, where you can test your command in its full execution environment. <em>This allows you to use your regular editor and debug tools outside of the container to make the debug process as painless as possible.</em></p><p>We can do this in two scenarios:</p><ul><li>you add your code with copy_dir, or</li><li>you specify path_map to explicitly map paths outside the container to inside the container.</li></ul><p>So, you get the feature for no effort if you use copy_dir, but you have to specify an extra parameter if you want to use live debug with the <em>clone from git </em>or <em>dockerfile</em> image specifications.</p><h3>Clone from Git + path_map</h3><p>If you always have a local checkout of the git repo that you specify to an image, you can safely specify a path_map to make any later debugging easier. Here is the example from above with path_map added.</p><pre>git_url = f&quot;https://<a href="http://twitter.com/github">github</a>.com/conducto/demo.git<br><strong>path_map = {&quot;.&quot;: &quot;data_science&quot;}</strong><br>image = co.Image(<br>    dockerfile=&quot;./docker/Dockerfile.git&quot;,<br>    copy_url=git_url,<br>    copy_branch=&quot;master&quot;,<br>    <strong>path_map=path_map</strong><br>)</pre><p>This maps the local directory ., relative to the location of the pipeline script, which is <em>outside</em> the container, to the data_science directory relative to the root of the cloned git repo <em>inside</em> the container.</p><h3>COPY or ADD in Dockerfile + path_map</h3><p>It works the same way for a image with a dockerfile that adds its own files, except that the target path inside the container must be absolute. This is because in this scenario, Conducto has no way to choose a reasonable default root directory inside the container. Here is an example.</p><pre><strong>path_map = {&quot;./code&quot;: &quot;/root/code&quot;}<br></strong>image = co.Image(<br>    dockerfile=&quot;./docker/Dockerfile.copy&quot;,<br>    context=&quot;.&quot;,<br><strong>    path_map=path_map<br></strong>)</pre><p>Where the Dockerfile is the same as above.</p><pre>FROM r-base:3.6.0<br>COPY ./code /root/code</pre><h3>Image Inheritance</h3><p>Finally, a node with unspecified image parameter will inherit the values of it’s parent. The pipeline from our first tutorial shows this, with all nodes sharing an image with the root node.</p><pre>import conducto as co</pre><pre>def download_and_plot() -&gt; co.Serial:<br>    dockerfile = &quot;./docker/Dockerfile.first&quot;<br>    <strong>image = co.Image(dockerfile=dockerfile, copy_dir=&quot;./code&quot;)</strong><br>    with co.Serial(<strong>image=image</strong>) as pipeline:<br>        co.Exec(download_command, name=&quot;download&quot;)<br>        with co.Parallel(name=&quot;plot&quot;):<br>            # ...<br>    return pipeline</pre><pre>if __name__ == &quot;__main__&quot;:<br>    co.main(default=download_and_plot)</pre><p>That is all there is to it! Now, with the information you learned in <a href="https://medium.com/conducto/your-first-data-science-pipeline-cc9ceac142f6">Your First Pipeline</a>, <a href="https://medium.com/conducto/environment-variables-and-secrets-9acab502ec77">Environment Variables and Secrets</a>, <a href="https://medium.com/conducto/data-stores-f6dc90104029">Data Stores</a>, <a href="https://medium.com/conducto/node-parameters-7be236eaeaac">Node Parameters</a>, <a href="https://medium.com/conducto/easy-and-powerful-python-pipelines-2de5825375f2">Easy and Powerful Python Pipelines</a>, and here, you can create arbitrarily complex data science pipelines.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=3bb663549a0c" width="1" height="1" alt=""><hr><p><a href="https://medium.com/conducto/execution-environment-3bb663549a0c">Execution Environment</a> was originally published in <a href="https://medium.com/conducto">Conducto</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Your First Data Science Pipeline]]></title>
            <link>https://medium.com/conducto/your-first-data-science-pipeline-cc9ceac142f6?source=rss-42ece86b5daf------2</link>
            <guid isPermaLink="false">https://medium.com/p/cc9ceac142f6</guid>
            <category><![CDATA[containers]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[data-visualization]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[workflow]]></category>
            <dc:creator><![CDATA[Matt Jachowski]]></dc:creator>
            <pubDate>Thu, 16 Apr 2020 06:39:07 GMT</pubDate>
            <atom:updated>2020-07-29T20:25:18.498Z</atom:updated>
            <content:encoded><![CDATA[<h4><a href="https://medium.com/conducto/data/home">Conducto for Data Science</a></h4><p>In this tutorial, you will learn how to <a href="https://medium.com/conducto/your-first-pipeline-32a303b2cc5d#1738">define</a>, <a href="https://medium.com/conducto/your-first-pipeline-32a303b2cc5d#cae2">execute</a>, and <a href="https://medium.com/conducto/your-first-pipeline-32a303b2cc5d#fd60">interact</a> with a simple <a href="https://conducto.com/">Conducto</a> pipeline.</p><p>Upon completion, you will understand how to use the following minimal API.</p><ul><li><a href="https://medium.com/conducto/your-first-pipeline-32a303b2cc5d#6dfd">co.Exec</a>, <a href="https://medium.com/conducto/your-first-pipeline-32a303b2cc5d#3d20">co.Serial</a>, and <a href="https://medium.com/conducto/your-first-pipeline-32a303b2cc5d#c9e0">co.Parallel</a> node classes,</li><li><a href="https://medium.com/conducto/your-first-pipeline-32a303b2cc5d#ed67">co.Image</a> to specify execution environment, and</li><li><a href="https://medium.com/conducto/your-first-pipeline-32a303b2cc5d#31d1">co.main()</a> to make your pipeline executable</li></ul><p>Explore our <a href="https://www.conducto.com/demo/data_science">live demo</a>, view the <a href="https://github.com/conducto/demo/blob/main/data_science/first_pipeline.py">source code for this tutorial</a>, or clone the <a href="https://github.com/conducto/demo">demo</a> and run it for yourself.</p><pre>git clone <a href="https://github.com/conducto/demo.git">https://github.com/conducto/demo.git</a><br>cd demo/data_science<br>python first_pipeline.py --local</pre><p>Alternatively, download the zip archive <a href="https://github.com/conducto/demo/archive/master.zip">here</a>.</p><h3>Define Your Pipeline</h3><p>In Conducto, you express your pipeline as a series of commands that need to be executed in serial and/or parallel. Our python API exposes a minimal set of <em>Node</em> classes to get this done quickly and painlessly. Then, you have the full power of python to nest these nodes for arbitrarily complex pipelines.</p><p>First, you need to import conducto.</p><pre>import conducto as co</pre><p>Then, you start building your pipeline with nodes.</p><h3>Exec Node</h3><p>An <em>exec node</em> simply wraps a shell command.</p><pre>plot = co.Exec(&quot;python plot.py --dataset heating&quot;)</pre><h3>Serial Node</h3><p>A <em>serial node</em> specifies that a series of sub-nodes must happen in one after another. If one of the sub-nodes fails, execution stops and the entire serial node is marked as failed.</p><pre>steps = co.Serial()<br>steps[&quot;download&quot;] = co.Exec(download_command)<br>steps[&quot;plot&quot;]  = co.Exec(&quot;python plot.py --dataset heating&quot;)</pre><p>Note that the definition of download_command is omitted for clarity. See the <a href="https://github.com/conducto/demo/blob/master/data_science/first_pipeline.py">source code in the demo</a> for the full details.</p><h3>Parallel Node</h3><p>A <em>parallel node</em> specifies that a series of sub-nodes can occur in parallel. All nodes are executed, and if any nodes fail, the entire parallel node is marked as failed.</p><pre>plot = co.Parallel()<br>plot[&quot;heating&quot;] = co.Exec(&quot;python plot.py --dataset heating&quot;)<br>plot[&quot;cooling&quot;] = co.Exec(&quot;python plot.py --dataset cooling&quot;)</pre><h3>Nesting</h3><p>Serial and parallel nodes may contain any node type, not just exec nodes. This allows the creation of non-trivial pipelines.</p><pre>pl = co.Serial()<br>pl[&quot;download&quot;] = co.Exec(download_command)<br>pl[&quot;plot&quot;] = co.Serial()<br>pl[&quot;plot&quot;][&quot;heating&quot;] = co.Exec(&quot;python plot.py --dataset heating&quot;)<br>pl[&quot;plot&quot;][&quot;cooling&quot;] = co.Exec(&quot;python plot.py --dataset cooling&quot;)</pre><p>Easy to do, but perhaps more verbose than you prefer. We can use python to make it nicer.</p><pre>with co.Serial() as pl:<br>    co.Exec(download_command, name=&quot;download&quot;)<br>    with co.Serial(name=&quot;plot&quot;):<br>        co.Exec(&quot;python plot.py --dataset heating&quot;, name=&quot;heating&quot;)<br>        co.Exec(&quot;python plot.py --dataset cooling&quot;, name=&quot;cooling&quot;)</pre><h3>Image</h3><p>Of course, your commands will only be able to run in an execution environment with:</p><ul><li>your software dependencies installed,</li><li>a copy of your own code present, and</li><li>any necessary environment variables set</li></ul><p>Conducto achieves this by running each of your exec commands inside of a <em>docker container, </em>which is defined by an <em>image</em> that you help to configure. Read full details in the <a href="https://medium.com/conducto/execution-environment-5a66ff0a10bc">Execution Environment</a> and <a href="https://medium.com/conducto/environment-variables-and-secrets-12256150e94d">Environment Variables and Secrets</a> tutorials. But for now, we will skip over these details, and just provide an appropriate image for our example. This particular image includes python and some packages to manipulate data, and copies over your local ./code directory. Note that the . is relative to the location of the pipeline script.</p><pre>dockerfile = &quot;./docker/Dockerfile.first&quot;<strong><br>image = co.Image(dockerfile=dockerfile, copy_dir=&quot;./code&quot;)<br></strong>with co.Serial(<strong>image=image</strong>) as pipeline:<br>    # ...</pre><h3>Main</h3><p>Now that you have a pipeline specified, make it executable. First, wrap your pipeline in a function that returns the top-level node.</p><pre><strong>def download and plot() -&gt; co.Serial:<br>    </strong>dockerfile = &quot;./docker/Dockerfile.first&quot;<br>    image = co.Image(dockerfile=dockerfile, copy_dir=&quot;./code&quot;)<br>    with co.Serial(image=image) as pipeline:<br>        co.Exec(download_command, name=&quot;download&quot;)<br>        with co.Parallel(name=&quot;plot&quot;):<br>            # ...<br><strong>    return pipeline</strong></pre><p>Conducto requires that you write a <em>type hint</em> to indicate the node return type of the function. Do not worry if type hints are new to you. Simply ensure that the first line of your function includes -&gt; co.[NodeClass], like this:</p><pre>def download_and_plot() <strong>-&gt; co.Serial</strong>:</pre><p>Finally, define the main function of your python script.</p><pre>def download_and_plot() -&gt; co.Serial:<strong><br>    </strong>dockerfile = &quot;./docker/Dockerfile.first&quot;<br>    image = co.Image(dockerfile=dockerfile, copy_dir=&quot;./code&quot;)<br>    with co.Serial(image=image) as pipeline:<br>        co.Exec(download_command, name=&quot;download&quot;)<br>        with co.Parallel(name=&quot;plot&quot;):<br>            # ...<br>    return pipeline</pre><pre><strong>if __name__ == &quot;__main__&quot;:<br>    co.main(default=download_and_plot)</strong></pre><h3>Execute Your Pipeline</h3><p>Executing your pipeline is easy. First, if you want to spot-check your pipeline, run your script with no arguments.</p><pre>python first_pipeline.py</pre><p>You will see a pipeline serialization like this.</p><pre>/<br>├─0 download   set -ex\ncurl http://...<br>└─1 plot<br>  ├─ heating   python plot.py --dataset heating<br>  └─ cooling   python plot.py --dataset cooling</pre><p>To execute the pipeline on your local machine, <em>which is always free</em>, run this. Note that in local mode, your code never leaves your machine.</p><pre>python first_pipeline.py --local</pre><p>Coming soon, you will be able to effortlessly run the same pipeline in the cloud too.</p><pre>python first_pipeline.py --cloud</pre><h3>Interact With Your Pipeline</h3><p>The script will print a URL and pop it open in your browser. You can view your pipeline,</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*SbwwuDHuFPiCfin4HvO8UA.png" /><figcaption>The <strong>pipeline summary</strong> is the row at the top, the <strong>pipeline pane</strong> is on the left, and the <strong>node pane</strong> is on the right. The <strong>pipeline pane</strong> shows your pipeline, with parallel, serial, and exec nodes getting unique icons.</figcaption></figure><p>run it and quickly identify pipeline status,</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*6aojP9naTm6iM29cf0jCNw.png" /><figcaption>Press the <strong>Run</strong> button in the upper left of the pipeline pane. See the execution status of each node: <strong>P</strong>ending, <strong>Q</strong>ueued, <strong>R</strong>unning, <strong>D</strong>one, <strong>E</strong>rrored, and <strong>K</strong>illed.</figcaption></figure><p>examine the output of any exec node,</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*3QbVVMu__kBSQjsMKl6dRA.png" /><figcaption>View the command, execution params, stdout, and stderr of a node in the right hand node pane. Stdout can even include plots!</figcaption></figure><p>and <a href="https://medium.com/conducto/rapid-and-painless-debugging-ff2abdba44c1">rapidly and painlessly debug errors</a>. Collaborate with anyone else in your org by sharing the URL.</p><p>Put your pipeline to sleep when you are finished with it. Its state, logs, and data are stored for 7 days. During this period you can wake it up. After 7 days, it is deleted.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/403/1*LWd5Q6TGDyoqX--Wop9zWg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/403/1*aBuHPGOhf8RPf_Qx-ZsWwA.png" /><figcaption>The “zzz” icon in the pipeline summary puts the pipeline to sleep. When no pipelines are selected you see a list of available ones. Click the “alarm clock” button on a sleeping pipeline to get a wakeup command to run into a local shell.</figcaption></figure><h3>How Much More Data Do You Need?</h3><p>This was a simple example, but once you add in <a href="https://medium.com/conducto/environment-variables-and-secrets-9acab502ec77">Environment Variables and Secrets</a>, <a href="https://medium.com/conducto/data-stores-f6dc90104029">Data Stores</a>, <a href="https://medium.com/conducto/node-parameters-7be236eaeaac">Node Parameters</a>, and <a href="https://medium.com/conducto/easy-and-powerful-python-pipelines-2de5825375f2">Easy and Powerful Python Pipelines</a>, you can easily express the most complex of data science pipelines in Conducto.</p><p>In my previous job, the predecessor to Conducto was the secret sauce that enabled our algorithmic trading team to run an ultra-productive data science and machine learning effort that has driven billions of dollars in revenue for a decade. Simply put, Conducto multiplied the impact of each team member by a lot.</p><p>How much more data do you need? Get started with Conducto now. Local mode is always free and is only limited by the CPU and memory on your machine. Cloud mode gives you immediate scale. Use the full power of python to write pipelines with ease. And, <a href="https://medium.com/conducto/easy-error-resolution-45ca08d40f1d">experience painless debugging and easy error resolution</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=cc9ceac142f6" width="1" height="1" alt=""><hr><p><a href="https://medium.com/conducto/your-first-data-science-pipeline-cc9ceac142f6">Your First Data Science Pipeline</a> was originally published in <a href="https://medium.com/conducto">Conducto</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>